Applying Data Science Techinques to Analyze Yelp DataSet. Part – 3

This is the third article in this series :

Part 1: How to get Data ready for analysis

Part 2: Yelp Popularity Over time

I am analysing if all the Business categories are equally active on Yelp. First I used MDS to check if there is a skew in the distribution, then I will investigate further why this skew exist.

In this post, I compare all the main categories ( a derived feature explained in Part 1) based on following features, which I will call activity related features from now on:

  1. Number of Businesses registered in that category
  2. Number of reviews it received
  3. Number of  users reviewed that category
  4. Number of  check-ins it received
  5. Number of  photos posted for that category
  6. Number of  tips it received

For this, I transformed the data by aggregating all the activity features across the main categories. In the new transformed dataset, I have 22 Main Categories ( ie Number Of observation) and 6 features.

I reduced the dimension using “Multi-dimensional scaling” (MDS) to find out which main categories are close and which are sparse based on Activity related features. Scree plot is used to determine dimensions it should reduce to so that the stress level is minimum. I noticed that from that stress level is dropped significantly from 2 dimensions to 3 dimensions.

Plotting the MDS after reducing 6 dimensions to 3 dimensions

Based on MDS plot, it can be inferred that not all the Business categories are equally active in Yelp. There are two extremes, one is Restaurants and other is the cluster of 18 Business categories on bottom left corner. Other categories Shopping, Food, Nightlife is also far from the restaurant but little closer to the cluster.

To Investigate further the MDS Results :

  1. I created a heat map of the number of reviews against Main Categories and Review Year. Review Year has been derived from review date.

Shopping is an interesting category, it has the second highest number of businesses. It has received the second highest number of reviews; 1835 in 2010 after that it dropped to 124 in 2011. Reviews for Shopping category has been almost stable after 2011.  Hence, I can say that shopping activity has been more active in 2010.2. Since the time is not given for other features, they are plotted against the main category

2. Since the time is not given for other features, they are plotted against the main category


If we see all the maps plots, the restaurant is always marked near the 4th line, whereas nightlife and food are always marked near the 2nd line. Shopping is fluctuating between 1st and 2nd line and all the others are near the first line. It tells that Restaurant is category where we can maximum number of pics, businesses, check-ins, Tips and Reviews

We can say that Restaurant Category has been most active, followed by Food and Nightlife and then its Shopping and then all the other categories. Yelp users use it for


Leave a Reply

Your email address will not be published. Required fields are marked *