Applying Data Science Techinques to Analyze Yelp DataSet. Part – 1


Online review is the new word of mouth to critique a business/service. Yelp not only provides a platform to let people voice their views but also connects them to the views of their friends/or views of people they are interested in (in a similar fashion like Facebook). I understand that Business and Users are two foundation pillars of Yelp. I am analysing the yelp activity in Edinburgh, UK from users as well as business perspectives. Reviews, check-ins, tips and photos play an important role in measuring Yelp’s success.  In this post, I will  talk about the data collection  which covers data wrangling and transformation. In next few posts will cover analysis from business perspectives and  analysis from users’ perspectives followed by a conclusion to my findings. From business perspective, I have analysed the development of Yelp over time and presence of different business categories are focus area of this section. From user perspective, I have investigated the type of users based on their activity and enthusiasm (using yelp for reviewing more than 4 different Business categories). I have also analysed the review ratings and the usefulness of the reviews in this section.

2.Collection of data

2.1 Jason Files

I have used the dataset available at Yelp [1], which contains the information about local businesses in 10 cities across 4 countries and Social network of 687K users for a total of 4.2M social edges.  Yelp has provided data in 6 Jason files each about businesses, user profile & activities, user reviews, tips by users, photos uploaded and about user check-ins.  All Jason files were converted to csv for ease of understanding in a tabular format and hence further interpretation.

2.2 Web Scraping

Yelp lists 22 primary business categories and for my analysis I decided to stick to these rather than going further down to sub-categories. Yelp dataset was found to have business category listed not in order (primary category may not be listed first, it may be listed second or further down the list) for an observation. Since the list provided online [1] is listed as of 2013, I used web scraping to get the latest list of 22 primary business categories from Yelp website [2]. Then I used the list to classify each observation of the dataset by matching the words in the category feature to the list obtained by web scraping.

3. Data Wrangling

3.1 Missing Values

Business data set had lots of missing values for some features. I devised my analytical questions to work around it and drafted them to have the minimal/no dependency on the attributes with missing data. I dropped attributes having more than 80% of missing values. For attributes where the analysis required a numerical value and could not deal with blank (missing data), I populated the value as “0”. A similar approach was taken for other data sets like users, user reviews etc.

3.2 Derived Features

Main Category is derived from the category feature of business dataset and from list of Primary business categories (obtained via web scraping). For example, if the category feature of an observation is [‘Restaurant’. ‘fast food’, ‘Pizza’] then Main category is “Restaurant” as it exists in the Primary category list. This new feature will be used extensively in my analysis.

UserClass, initially I added this as an empty column which I will update after applying K-Mean clustering.

3.3 New Observations

I cloned the business dataset, and added new observation to the cloned one, for business categories related analysis. In the business dataset, category feature of some of the observations matched to more than one primary business categories. To deal with this, I have inserted new observations so that each observation contains only one Main Category in cloned business dataset. For example, if category in business dataset looks like [ Nightlife, PUB, Restaurant], then the same row is inserted twice, once with Main Category as “Nightlife” and other with Main Category as “Restaurant”.

3.4 Creating a subset for Edinburgh

For this analysis, I am concentrating on only one city “Edinburgh, UK”.  We start filtering from business dataset, so filtered all the data rows where city contains “Edin”. I used “Edin” to include various names used such as “Edinburgh Central”, “Edinburgh stn” etc. It also helped in dealing with spelling errors like “Edinbura”.  Now I have a list of businesses from Edinburgh. From the list, I used business IDs to filter other datasets – user, photos, review, tips and check-ins. Do note that the filtered users are not the users who live in Edinburgh, they are the users who have reviewed Edinburgh Business.

3.5 Merging

The relationship between individual datasets is relational and can be shown using Entity Relationship diagram (Figure 1).  Based on the relationship shown in the diagram Business and Photos can be merged on business_id, same is the case with check-ins. Business and Users are not directly related but can be merged using review or tips. Data set has not provided any relationship between User, check ins and photos.


Link to Part 2 

4 thoughts on “Applying Data Science Techinques to Analyze Yelp DataSet. Part – 1”

  1. wonderful post, very informative. I wonder why the other experts of this sector do not notice this. You must continue your writing. I am confident, you have a huge readers’ base already!

Leave a Reply

Your email address will not be published. Required fields are marked *