Yelp Chanllenge. Tianshu Fan Xinhang Shao University of Washington. June 7, 2013

Similar documents
DATA MINING CAPSTONE FINAL REPORT

Brewculator Final Report

A Recipe Recommendation System Based on Regional Flavor Similarity Lin-rong GUO, Shi-zhong YUAN *, Xue-hui MAO and Yi-ning GU

Specialty Coffee Market Research 2013

What makes a good muffin? Ivan Ivanov. CS229 Final Project

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

Highlands Youth Citrus Project 2018 Rules & Regulations

F&N 453 Project Written Report. TITLE: Effect of wheat germ substituted for 10%, 20%, and 30% of all purpose flour by

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method

Name: Adapted from Mathalicious.com DOMINO EFFECT

Market Basket Analysis of Ingredients and Flavor Products. by Yuhan Wang A THESIS. submitted to. Oregon State University.

Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Capacity Utilization. Last Updated: December 21, 2016

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN

AWRI Refrigeration Demand Calculator

Predicting Wine Quality

IT 403 Project Beer Advocate Analysis

Buying Filberts On a Sample Basis

Economics 101 Spring 2016 Answers to Homework #1 Due Tuesday, February 9, 2016

The Dun & Bradstreet Asia Match Environment. AME FAQ. Warwick R Matthews

Tamanend Wine Consulting

PERFORMANCE OF HYBRID AND SYNTHETIC VARIETIES OF SUNFLOWER GROWN UNDER DIFFERENT LEVELS OF INPUT

TRTP and TRTA in BDS Application per CDISC ADaM Standards Maggie Ci Jiang, Teva Pharmaceuticals, West Chester, PA

Economics 101 Spring 2019 Answers to Homework #1 Due Thursday, February 7 th, Directions:

Efficient Image Search and Identification: The Making of WINE-O.AI

Update to A Comprehensive Look at the Empirical Performance of Equity Premium Prediction

MARKET ANALYSIS REPORT NO 1 OF 2015: TABLE GRAPES

Caffeine And Reaction Rates

Biologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name

Health Effects due to the Reduction of Benzene Emission in Japan

Monitoring Regional Alcohol Consumption through Social Media

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

Regression Models for Saffron Yields in Iran

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Mapping and Tracking (Invasive) Plants with Calflora s Weed Manager

GEORGIA DEPARTMENT OF CORRECTIONS Standard Operating Procedures. Policy Number: Effective Date: 2/9/2018 Page Number: 1 of 5

Histograms Class Work. 1. The list below shows the number of milligrams of caffeine in certain types of tea.

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4

Grocery List (Step 2)

Figure 1: Percentage of Pennsylvania Wine Trail 2011 Pennsylvania Wine Industry Needs Assessment Survey

What Makes a Cuisine Unique?

Which of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not?

GrillCam: A Real-time Eating Action Recognition System

Wine Rating Prediction

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

Insiders' Guide To Denver, 6th (Insiders' Guide Series) By Linda Castrone

Improving Capacity for Crime Repor3ng: Data Quality and Imputa3on Methods Using State Incident- Based Repor3ng System Data

Investigation 1: Ratios and Proportions and Investigation 2: Comparing and Scaling Rates

Improving Enquiry Point and Notification Authority Operations

Esri Demographic Data Release Notes: Israel

Using Standardized Recipes in Child Care

Missing Data Treatments

MUMmer 2.0. Original implementation required large amounts of memory

How LWIN helped to transform operations at LCB Vinothèque

Multiple Imputation for Missing Data in KLoSA

Directions for Menu Worksheet ***Updated 9/2/2014 for SY *** General Information:

Statistics: Final Project Report Chipotle Water Cup: Water or Soda?

Tips for Writing the RESULTS AND DISCUSSION:

Statistics & Agric.Economics Deptt., Tocklai Experimental Station, Tea Research Association, Jorhat , Assam. ABSTRACT

Directions for Menu Worksheet. General Information:

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2]

Paper Reference IT Principal Learning Information Technology. Level 3 Unit 2: Understanding Organisations

Valuation in the Life Settlements Market

Going Round About Cycle Menus Linsey LaPlant, MS, RDN Health-e Pro Sales Manager. CSNA s Annual Conference Sacramento, CA

Coffee weather report November 10, 2017.

Structural Reforms and Agricultural Export Performance An Empirical Analysis

Comparing R print-outs from LM, GLM, LMM and GLMM

Experiment 2: ANALYSIS FOR PERCENT WATER IN POPCORN

Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Indexes of Aggregate Weekly Hours. Last Updated: December 22, 2016

Preview. Introduction (cont.) Introduction. Comparative Advantage and Opportunity Cost (cont.) Comparative Advantage and Opportunity Cost

Problem Set #15 Key. Measuring the Effects of Promotion II

How Many of Each Kind?

Preview. Introduction. Chapter 3. Labor Productivity and Comparative Advantage: The Ricardian Model

Level 2 Mathematics and Statistics, 2016

Wine Consumption Production

Starbucks Geography Summary

Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Coffee (lb/day) PPC 1 PPC 2. Nuts (lb/day) COMPARATIVE ADVANTAGE. Answers to Review Questions

The Analects Of Confucius By Confucius

June Cleaning Manual Pro

Investigation 1: Ratios and Proportions and Investigation 2: Comparing and Scaling Rates

COURSE FOD 3030: CREATIVE BAKING

Citrus Fruits 2014 Summary

STEP1 Check the ingredients used for cooking, their weight, and cooking method. Table19 Ingredient name and weight of company A s Chop Suey

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Harvesting Charges for Florida Citrus, 2016/17

GLOBALIZATION UNIT 1 ACTIVATE YOUR KNOWLEDGE LEARNING OBJECTIVES

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

PEEL RIVER HEALTH ASSESSMENT

PRODUCTION SOFTWARE FOR WINEMAKERS. Wine Operations and Laboratory Analyses

EAT TOGETHER EAT BETTER BEAN MEASURING ACTIVITY

Feasibility of Shortening the. Germination and Fluorescence Test Period. Of Perennial Ryegrass

EDICT ± OF GOVERNMENT

Falling Objects. computer OBJECTIVES MATERIALS

Detecting Melamine Adulteration in Milk Powder

SENSORY EXPERIENCE TEST on DISPOSABLE COFFEE CUP LIDS Test Date: January 21, 2014 Report Date: March 10, 2014

Bearing Produced by IAR Team Focus Technology Co., Ltd.

Mini Project 3: Fermentation, Due Monday, October 29. For this Mini Project, please make sure you hand in the following, and only the following:

Guidelines for Submitting a Hazard Analysis Critical Control Point (HACCP) Plan

Transcription:

Yelp Chanllenge Tianshu Fan Xinhang Shao University of Washington June 7, 2013 1 Introduction In this project, we took the Yelp challenge and generated some interesting results about restaurants. Yelp provides data about businesses, reviews, users, and check-in sets in the Greater Phoenix, AZ metropolitan area. The original data were in JSON format. They were parsed and imported into Postgres using JDBC. Several questions were raised by the Yelp online contest, mainly focusing on finding useful information from the data. One question was to predict the rating of a restaurant from the review text only, which was the one we tried. Basic TFIDF method was used, achieving a mean absolute error (MAE) of 1.13 and a root mean square error (RMSE) of 1.49. Some interesting results were generated by queries. 2 Data Importing The original data are in JSON format, with one JSON object in each line. Some fields are lists, and some are nested JSON objects. To convert to relational schema, a java toolkit called JSON.simple was used to parse JSON format to a flat table. It is a SAXlike parser that processes data in a streaming fashion without using up the main memory. A JSON object is like a map entry, whose value could be retrieved by its name. It can also handle nested JSON objects and lists. Lists are JSONArray, which is a java.util.list essentially. At first, parsed data were written in text files and then copied into database, as in Homework 1. However, the text of reviews caused some troubles. One was the encoding problem. The other was some special characters, such as slash and newline. With JDBC, these problems were never encountered. JDBC also allowed us to update TFIDF score of reviews conveniently. Figure 1 is the ER diagram of yelp data. We mainly focused on restaurants, which have two thirds of total reviews (159,429 out of 229,906). Check-in sets were not used. Some fields were discarded due to lack of information (null for most records) or uselessness in the queries. All businesses are in the greater Phoenix, AZ metropolitan area, so city and state were omitted. One difficulty was that most restaurants belonged to more than one category. At first concatenated strings or arrays were considered for the category field, but due 1

to searching efficiency, is a relation was used. 10 is a tables were used to represent cuisines from different countries or areas. Some close categories with few restaurants were merged together. For example, Table Japanese has 174 records, merged from 125 Japanese and 94 Sushi Bars from the original data, which means that 45 restaurants have both tags. Some categories were neglected, such as Barbeque and Steakhouses. One restaurant may belong to several such tables, but duplication was minimized after we reassigned restaurants to the merged categories. Figure 1: The E/R diagram of the relations 3 Review rating prediction First, we preprocessing the review text to replace punctuation. The reason for this it to increase the accuracy of the prediction and decrease the variance of the word list at the same time. However, we still keep the facial expression like :), =) and so on. In the J. Martineau and T. Finin[1], it only predicts whether the movie review is more positive or more negative. Our case is more complicated. Second, in order to calculate the tf-idf score, we used the following steps: 1. Import all 5 star reviews to Java, count the frequency of each word using HashMap, sort according to the word count by TreeMap. Manually pick a list of words with highest frequencies, discard stop words, non-relevant words and low frequency words. This is the positive list which includes 183 words. 2. Repeat for 1 star reviews, and get a negative list with 103 words. 3. Use all reviews to find idf of each word in the lists, which is the total number of reviews divided by the number of reviews that word appear. The equation of idf is shown below: idf w = log D D w 2

4. For each review, count the frequency of each word in the positive list and negative list, as tf with equation below: tf w,d = 0.5 + 0.5 f(w, d)) maxf(w, d), k d) tfidf = tf idf 5. Compute the average of tfidf score of each word in the positive list as P, and the average of tfidf of each word in the negative list as N. Third, use three methods to predict the star. The first method is mentioned in G. Ganu, N. Elhadad, and A. Marian[2]. They talked about sentiment-based text rating using formula: P T extrating = [ P + N 4] + 1 which gives a score on scale from 1 to 5. However, this method assumes a linear distribution of positive tfidf. The second method is to predict star value based which range of the original star accumulated percentage of each star the positive tfidf percentage it falls in. The method three try to deal more with the case when people use negation of positive value to express negative feeling. 4 Results 4.1 Data inconsistencies if(p tfidf > N tfidf ), 3 + 2 P tfidf P min P max P min if(n tfidf > P tfidf ), 3 2 N tfidf N min N max N min When importing data, we found some inconsistencies among tables. About 1700 users in Table Reviews could not be found in Table Users, so foreign key constraint could not be added. For some users, the total number of reviews he/she wrote calculated from Table Reviews did not match the number of reviews shown in Table Users. Review count was also inconsistent between Table Restaurants and Table Reviews. One reason could be that only Phoenix users were in Table Users, while people from other places may also comment on restaurants in Phoenix. However, its more likely that the database was not updated concurrently, so the information is less valuable. If possible, Yelp should improve the database maintenance. 4.2 Review rating prediction result The accuracy of the prediction is measured by MAE(mean average error) and RMSE(root mean square error) metioned in F. Li, N. Liu, H. Jin, K. Zhao, Q. Yang and X. Zhu[3]. 3

Table 1: Error of Each Methods METHOD MAE RMSE Method1 1.13 1.49 Method2 1.64 1.95 Method3 1.25 1.67 The two values for each method are listed in Table 1. As we can see, the first method gives the best result, and method 3 also gives a good prediction. However the method 2 are poor. The results are highly depend on the word list we chose. also plot three figures for each method. The x-axis of the figure is review number, and the y-axis is star value. The blue lines are true star value and the red dot lines are the prediction star value. As we can see, the method 2 has a poor prediction on star 5, and the method 3 has a poor prediction on score 3 where method 1 have good predict on each star. However, there are many cases that the prediction of star bias with true star by 1-3. Figure 2: The E/R diagram of the relations 4.3 Results from queries 4.3.1 Spatial distribution of restaurants There are 4503 restaurants in total. From the spatial distribution in Figure 2, we can see that restaurants are concentrated in small areas and most space has no restaurant at all. The number of restaurants was counted within each area of 0.02 degree latitude by 0.02 degree longitude. 4.3.2 Relationship between rating stars and other facts Figure 3 (a) (c) (d) show the relationship between the ratings and other facts, like the number of restaurants, the average number of reviews per restaurants, and the number of funny/useful/cool votes. They all have the same distribution. Restaurants with 3.5 stars 4.5 stars are most popular, and have more reviews and review votes. 4

Figure 3: The E/R diagram of the relations Figure 4: (a) Restaurant distribution in 2D. (b) 3D view of the spatial distributions of restaurants. The height represents the number of restaurants 4.3.3 Facts about restaurant categories Figure 3 (b) and Figure 4 (a) (c) show the relationship between restaurant categories and other facts. European food and Middle Eastern food have least number of restaurants, yet have the highest average rating. Mexican food is most popular, but the rating is among the lowest. People go to American (New) restaurants also like writing reviews (the average number of reviews per restaurant is the highest), while Chinese restaurants have least number of reviews per restaurant. American (Traditional) restaurants get the lowest rating. 5

Figure 5: (a) Number of restaurants for each rating. (b) Restaurant category distribution. (c) The average number of reviews per restaurant for each rating. (d) The number of votes for funny, useful and cool reviews for each rating 4.3.4 Review categories Figure 3(d) is a histogram that shows the distributions of the number of restaurant categories people write review for. The x axis is the number of categories of restaurants that people have reviewed, and the y axis is the number of people who wrote reviews for that number of categories. There are 36,473 distinct users (reviewers) from Table Reviews. 47This query was one of the most complicated. A temporary table was created containing user id and categories. For each record in Table Reviews, if the business id can be found in a category table, insert the distinct user id and the category name combination to the temporary table. Repeat for each category table. Then do group by and count twice on user id and the count for the number of categories, respectively. 4.3.5 Review count for days of the week and months Figure 3(b) shows the total number of reviews on a certain day of the week. From the check-in information from other groups, people go to restaurants on Thursdays and Fridays most frequently. However, the number of reviews does not vary too much on each day of the week. Friday has the least number of reviews, and Monday has the most. It can be inferred that people usually write reviews in the next one or two days. The same statistics was done for months. The number of reviews for each month is also pretty close, with a maximum of 14,707 in August and a minimum of 11,957 in February. 6

5 Summary We imported the Yelp data about restaurants into Postgres, and found some inconsistencies between the original tables. A simple TFIDF method was used to predict the rating stars from pure review text, achieving a MAE of 1.13 and a RMSE of 1.49. Some interesting results from queries were also shown using the information from the data. References [1] J. Martineau & T. Finin. Delta tfidf: An improved feature space for sentiment analysis in Proceedings of the 3rd AAAI International Conference on Weblogs and Social Media, 2009, pp. 258-261. [2] G. Ganu& N. Elhadad& and A. Marian. Beyond the stars: Improving rating predictions using review text content in 12th International Workshop on the Web and Databases, 2009. [3] F. Li, N. Liu& H. Jin, K. Zhao& Q. Yang &X. Zhu. Incorporating reviewer and product information for review rating prediction. in Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three, 2011, pp. 1820-1825. 7