Applying Data Science Techinques to Analyze Yelp DataSet. Part – 4

This is the Part where I will analyse Yelp Dataset from the customers perspective. In Part 1, I cleaned the data and made it analysis ready. In Part 2 and Part 3, I analysed data set from the Business perspective.

Type of Users :

All the users are not same, some of them write lots of reviews, some writes few reviews, some are popular, some are not. Let’s start with classifying reviewers of Edinburgh. I want to classify them based on two different feature sets :

  1. Based on Reviewers Popularity and Activity in Yelp.
  2. Based on number of different Business categories users reviewed on Yelp
Based on Reviewers Popularity and Activity in Yelp :

I used K-Mean clustering to cluster user’s dataset. For this, I used the whole User data set, which contains users from all the available countries. Once classes are assigned to each user, I filtered the users who have reviewed a business in Edinburgh for further inference. I used following useful features for this:

  1. Number of   Compliments (cute, hot, list, plain, writer, photo, note, funny, profile and cool),
  2. Number of Votes (Cool, Funny and Useful),
  3. Number of years he has been elite users,
  4. number of Friends,
  5. Average Rating,
  6. Number of reviews
  7. Number of fans

Determining the value of K for K-Mean algorithm – I used Elbow method and Plotted average distance of observations from the cluster centroid for different values of K. Initially I did the clustering using K=5. On comparing the user’s characteristics, it turned out that the 3 middle classifications were not much different from each other. Hence to simplify the result, I did the clustering again using K= 3.

Looking at the user’s characteristics I named them as

  • Outstanding Users: They are very few users, only 75 in the whole user data set. They have written thousands of reviews and their reviews are well appreciated by other members in the form of votes and compliments. They have huge fan following and very good network of friends
  • Active Users: They are very active and write thousands of reviews, they do get votes and compliments, have fans and friends, but that is very less if we compare it with outstanding users
  • Average Users: All the contributors who are not outstanding or active are Average contributors. They are 99% of the whole user data set.

From the Cluster plot, It seems like Average User count is very low as compare to other two. So, I plotted all the clusters individually. I found that number of outstanding users less but sparse  (b), whereas the number of Average Users is very high but they are concentrated Figure (a).


Also, when I compared the percentage of outstanding users & active users in Edinburgh to overall users, I found that the concentration levels are comparatively higher in Edinburgh (Table-1).

Based on number of different Business categories users reviewed on Yelp

Another way of analysing user activity is to determine how many different categories a user reviews. I chose 4 categories as the threshold. So, if a user has reviewed business from 4 or more different categories than he or she is an enthusiastic Yelp User.  I used network analysis for this task and created a directed network between users and the business categories they reviewed.

Approach :
  • Merge Review data set and Business data set on Business_id and merge the result with User data set on user_id
  • I used Gephi, to answer this question using visualisation. I created –
    • CSV, using all users (ID and Name) and Main Category (Gave ID to each Main category).
    • CSV, Source is User ID and Destination is Main Category ID.
    • Deleted all the duplicate rows.
  • Using these two files, I created a directed network graph from users to business category and then sized the nodes using in degree attribute. Since user’s “in degree” is always 0, user node will always be smaller than category node.
  • I first started with all the users and then keep on filtering based on degree range. All the users who have reviewed only 1 business, has degree one. By filtering them out I left with all the users who have reviewed 2 or more business categories. The process is iterated till I observed users who reviewed more than 4 categories.
  • Kept the records of the percent of nodes (which are approximate to the total number of users) after every iteration.
Analysis and Inference :

I used Yufan Hu layout for visualisation, as it clusters the similar nodes and sized the nodes based on in degree attribute.

Iteration 0: Total Number of Users 4986 (100%)


The initial graph has all the users who have reviewed 1 or more business categories. This big Blue Dot in the middle is Restaurant and we can see that it has received maximum reviews hence in-degree value is very high




Iteration 1: Total Number of Users 1769 (35.76%)

Next iteration, I filtered the for degree range 2 or more

  • Around 65% of Yelp users use yelp to review only one business category.
  • 4 Business Categories are standing out: Restaurant, Nightlife, Food and Hotel & Travels.
  • We can see three big clusters of the people who are reviewing only:
    • Restaurant and Nightlife
    • Restaurant and Food (they may be the people who are foodie, but are not interested in Nightlife)
    • Restaurants and Hotel& Travels (they may be tourist and hence interested in these two categories)




Iteration 2: Total Number of Users 848 (17.37%)

I filtered the for degree range 3 or more, from the resulting graph tells us that :

  • Around 83% of Yelp users use yelp to review up to 2 business categories.
  • 4 Business Categories are standing out: Restaurant, Nightlife, Food and Hotel & Travels.
  • We can see two big clusters of the people who are reviewing only:
    • Restaurant, Food and Nightlife (They might be locals)
    • Restaurant, Hotel & Travel and Nightlife (They might be Tourist)





Iteration 3: Total Number of Users 444 (9.31%)

I filtered the for degree range 4 or more, the graph tells us

  • Around 91% of Yelp users use yelp to review up to 3 business categories.
  • Around 9% are “Enthusiastic Yelp Users”, who use Yelp for at least 4 categories.
  • They are valuable as they are promoting Yelp activities of other business categories as well.






Conclusion: Now we know the active users as well as Enthusiastic users in Edinburgh. Yelp should do something to promote them and motivate others to be like them

Leave a Reply

Your email address will not be published. Required fields are marked *