Applying Data Science Techinques to Analyze Yelp DataSet. Part – 4 This is the Part where I will analyse Yelp Dataset from the customers perspective. In Part 1, I cleaned the data and made it analysis ready. In Part 2 and Part 3, I analysed data set from the Business perspective.

Type of Users :

100 gratis datingsider All the users are not same, some of them write lots of reviews, some writes few reviews, some are popular, some are not. Let’s start with classifying reviewers of Edinburgh. I want to classify them based on two different feature sets :

  1. Based on Reviewers Popularity and Activity in Yelp.
  2. Based on number of different Business categories users reviewed on Yelp
Based on Reviewers Popularity and Activity in Yelp :

I used guelph university dating site K-Mean clustering to cluster user’s dataset. For this, I used the whole User data set, which contains users from all the available countries. Once classes are assigned to each user, I filtered the users who have reviewed a business in Edinburgh for further inference. I used following useful features for this:

  1. Number of   Compliments (cute, hot, list, plain, writer, photo, note, funny, profile and cool),
  2. Number of Votes (Cool, Funny and Useful),
  3. Number of years he has been elite users,
  4. number of Friends,
  5. Average Rating,
  6. Number of reviews
  7. Number of fans

see Determining the value of K for K-Mean algorithm – I used Elbow method and Plotted average distance of observations from the cluster centroid for different values of K. Initially I did the clustering using K=5. On comparing the user’s characteristics, it turned out that the 3 middle classifications were not much different from each other. Hence to simplify the result, I did the clustering again using K= 3.

Looking at the user’s characteristics I named them as

From the Cluster plot, It seems like Average User count is very low as compare to other two. So, I plotted all the clusters individually. I found that number of outstanding users less but sparse  (b), whereas the number of Average Users is very high but they are concentrated Figure (a).


Also, when I compared the percentage of outstanding users & active users in Edinburgh to overall users, I found that the concentration levels are comparatively higher in Edinburgh (Table-1).

Based on number of different Business categories users reviewed on Yelp

Another way of analysing user activity is to determine how many different categories a user reviews. I chose 4 categories as the threshold. So, if a user has reviewed business from 4 or more different categories than he or she is an enthusiastic Yelp User.  I used network analysis for this task and created a directed network between users and the business categories they reviewed.

Approach :
  • Merge Review data set and Business data set on Business_id and merge the result with User data set on user_id
  • I used Gephi, to answer this question using visualisation. I created –
    • CSV, using all users (ID and Name) and Main Category (Gave ID to each Main category).
    • CSV, Source is User ID and Destination is Main Category ID.
    • Deleted all the duplicate rows.
  • Using these two files, I created a directed network graph from users to business category and then sized the nodes using in degree attribute. Since user’s “in degree” is always 0, user node will always be smaller than category node.
  • I first started with all the users and then keep on filtering based on degree range. All the users who have reviewed only 1 business, has degree one. By filtering them out I left with all the users who have reviewed 2 or more business categories. The process is iterated till I observed users who reviewed more than 4 categories.
  • Kept the records of the percent of nodes (which are approximate to the total number of users) after every iteration.
Analysis and Inference :

I used Yufan Hu layout for visualisation, as it clusters the similar nodes and sized the nodes based on in degree attribute. Iteration 0: Total Number of Users 4986 (100%)


The initial graph has all the users who have reviewed 1 or more business categories. This big Blue Dot in the middle is Restaurant and we can see that it has received maximum reviews hence in-degree value is very high




source site Iteration 1: Total Number of Users 1769 (35.76%)

Next iteration, I filtered the for degree range 2 or more

  • Around 65% of Yelp users use yelp to review only one business category.
  • 4 Business Categories are standing out: Restaurant, Nightlife, Food and Hotel & Travels.
  • We can see three big clusters of the people who are reviewing only:
    • Restaurant and Nightlife
    • Restaurant and Food (they may be the people who are foodie, but are not interested in Nightlife)
    • Restaurants and Hotel& Travels (they may be tourist and hence interested in these two categories)




Iteration 2: Total Number of Users 848 (17.37%)

I filtered the for degree range 3 or more, from the resulting graph tells us that :

  • Around 83% of Yelp users use yelp to review up to 2 business categories.
  • 4 Business Categories are standing out: Restaurant, Nightlife, Food and Hotel & Travels.
  • We can see two big clusters of the people who are reviewing only:
    • Restaurant, Food and Nightlife (They might be locals)
    • Restaurant, Hotel & Travel and Nightlife (They might be Tourist)





Iteration 3: Total Number of Users 444 (9.31%)

I filtered the for degree range 4 or more, the graph tells us

  • Around 91% of Yelp users use yelp to review up to 3 business categories.
  • partnervermittlung biermeier Around 9% are “Enthusiastic Yelp Users”, who use Yelp for at least 4 categories.
  • They are valuable as they are promoting Yelp activities of other business categories as well.


 Conclusion: Now we know the active users as well as Enthusiastic users in Edinburgh. Yelp should do something to promote them and motivate others to be like them

Leave a Reply

Your email address will not be published. Required fields are marked *