DATA MINING CAPSTONE FINAL REPORT

ABSTRACT This report is to summarize the tasks accomplished for the Data Mining Capstone. The tasks are based on yelp review data, majorly for restaurants. Six tasks are accomplished. The first task is to visualize customer review text for all restaurants. Frequent word cloud is plotted. Topics are detected from the review text for all restaurants. In addition, topic comparison for two Chinese restaurants are provided and visualized. The second task is to build cuisine map based on similarity between cuisines using customer review text. Top fifty cuisines are found first to be included in the cuisine map. Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF), and clustering methods, i.e. hierarchical clustering and k-means clustering, are used to build similarity matrix, heat map, and cuisine map. The third task is to recognize dish names from customer review text of a certain cuisine. Chinese cuisine is chosen for this task. A labeled dish name list of Chinese cuisine is revised manually. Then two algorithms, i.e. TopMine and SegPhrase, are used to mine a comprehensive Chinese dish list based on the review text for Chinese restaurants and the labeled dish name list. The fourth and fifth tasks are to detect popular dishes and recommend good restaurants for certain dishes. Again, Chinese cuisine is chosen for this task. 700 dish names from task 3 are used as a pool of Chinese dishes. The top 100 most popular dishes and their corresponding tastiness are found by mining customer review text and review score, i.e. stars. We also recommended top 100 most popular restaurants for two popular Chinese dishes, i.e. orange chicken and fried rice. I

The sixth task is to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. In addition, in this report we highlight the following: (1) the most useful data mining results produced through these specific data mining tasks and potential people who might benefit from such results; (2) the novel ideas/methods explored to carry out the tasks; (3) new knowledge people can learn from the project activities, particularly through the experiments. The report is organized as follows. Chapter 1 to Chapter 5 will introduce the six tasks: Chapters 1 to 3 address tasks 1 to 3, Chapter 4 addresses tasks 4 and 5, and Chapter 5 addresses task 6. Chapter 6 introduces the useful results. Chapter 7 presents the novel method used in carrying out the tasks. Chapter 8 summarizes the contribution of new knowledge discovered throughout the capstone. II

TABLE OF CONTENTS 1 CHAPTER 1 SUMMARY OF TASK 1: EXPLORATION OF DATA SET... 1 1.1 Tools Used... 1 1.2 Major Packages... 1 1.3 Data Import... 1 1.4 Data Preprocess... 2 1.5 Topic Model Fitting... 3 1.6 Comparison of Topics for Two Chinese Restaurants... 4 1.7 Discussion on the Topics for the Two Chinese Restaurants... 5 1.7.1 Similarity... 5 1.7.2 Difference... 5 2 CHAPTER 2 SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP CONSTRUCTION... 7 2.1 Tools Used... 7 2.2 Major Packages:... 7 2.3 Data Import... 7 2.4 Data Preprocess... 8 2.5 Similarity Matrix without IDF... 9 2.6 Similarity Matrix with IDF... 10 2.7 Similarity Matrix with Clustering... 12 2.7.1 Hierarchical Clustering... 12 2.7.2 k-means Clustering... 13 2.8 Conclusions... 15 3 CHAPTER 3 SUMMARY OF TASK 3: DISH RECOGNITION... 16 3.1 Task 3.1: Manual Tagging... 16 3.2 Task 3.2: Mining Additional Dish Names... 16 3.2.1 Corpus Preparation... 16 3.3 Dish Name Identify Using TopMine... 17 3.3.1 Parameters... 17 3.3.2 Opinion about the Result... 17 3.3.3 Improvement... 18 3.4 Dish Name Identify Using SegPhrase... 18 3.4.1 Parameters... 18 3.4.2 Opinion about the Result... 18 4 CHAPTER 4 SUMMARY OF TASK 4 & 5: POPULAR DISHES AND RESTARURANT RECOMMENDATION... 19 4.1 Data Preparation... 19 III

4.1.1 Corpus... 19 4.1.2 Dish List... 20 4.2 Tools and Packages... 20 4.2.1 Attached Base Packages:... 20 4.2.2 Other Attached Packages:... 20 4.3 Task 4: Popular Dishes... 20 4.3.1 Popularity Analysis... 20 4.3.2 Sentiment Analysis... 21 4.3.3 Illustration... 21 4.4 Task 5: Popular Restaurants... 22 4.4.1 Popularity Analysis... 22 4.4.2 Sentiment Analysis... 23 4.4.3 Illustration... 23 4.5 Conclusions... 24 5 CHAPTER 5 SUMMARY OF TASK 6: HYGIENE PREDICTION... 25 5.1 Tools Used... 25 5.1.1 Packages... 25 5.2 Text Preprocessing... 25 5.3 Training Method 1: Logistic Regression... 26 5.3.1 Text Representation Techniques... 26 5.3.1.1 Unigram... 26 5.3.2 Additional Features Used... 26 5.3.3 Learning Algorithm... 26 5.3.4 Results Analysis... 26 5.4 Training Method 2: Random Forest... 27 5.4.1 Text Representation Techniques... 27 5.4.1.1 Unigram... 27 5.4.1.2 Topic Model... 27 5.4.2 Additional Features Used... 27 5.4.3 Learning Algorithm... 28 5.4.4 Results Analysis... 28 5.5 Method Comparison... 29 6 CHAPTER 6 USEFULNESS OF RESULTS... 30 6.1 Cuisine Maps... 30 6.1.1 Usefulness for Customers... 30 6.1.2 Usefulness for Restaurant Owners... 30 6.2 Dish Recognizer... 30 6.3 Popular Dishes Detection... 31 6.4 Restaurant Recommendation... 31 6.5 Hygiene Prediction... 31 7 CHAPTER 7 NOVELTY OF EXPLORATION... 32 7.1 Hierarchical Clustering in Cuisine Map Development... 32 7.2 TopMine Output Used as the Input for SegPhrase in Dish Recognition... 32 IV

7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction... 32 8 CHAPTER 8 CONTRIBUTION OF NEW KNOWLEDGE... 33 8.1 Some Advantages of Random Forest over Logistic Regression... 33 8.1.1 Random Forest is Less Prone to Overfitting than Logistic Regression... 33 8.1.2 Logistic Regression is not Good at Handling Missing Feature Value... 33 9 CHAPTER 9 IMPROVEMENT TO BE DONE... 34 10 REFERENCES... 35 V

List of Figures Figure 1.1 Word cloud... 2 Figure 1.2 The topics of the sampled Restaurant... 3 Figure 1.3 The topics of the first Chinese Restaurant CR1... 4 Figure 1.4 The topics of the second Chinese Restaurant CR2... 5 Figure 2.1 Similarity matrix without IDF... 10 Figure 2.2 Similarity matrix using IDF... 11 Figure 2.3 Similarity matrix and hierarchical cluster... 13 Figure 2.4 Similarity matrix and k-means cluster... 15 Figure 4.1 Illustration for popular dish names... 22 Figure 4.2 Illustration for popular restaurants for orange chicken... 23 Figure 4.3 Illustration for popular restaurants for fried rice... 24 VI

List of Tables TABLE 2.1 Cluster list... 14 TABLE 5.1 Prediction Score obtained by Logistic Regression... 26 TABLE 5.2 Prediction Score obtained by Random Forest & Unigram... 28 TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model... 29 VII

1 CHAPTER 1 SUMMARY OF TASK 1: EXPLORATION OF DATA SET In this chapter, we explore the yelp data set. Particularly, we mine the reviews on restaurants from customers in order to find topics. We mine the topics based on Latent Dirichlet Allocation (LDA) model and plot the topics in a circular tree for visualization. In addition, we mine and compare the topics of two Chinese restaurants. 1.1 Tools Used R version 3.1.3 Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) 1.2 Major Packages jsonlite_0.9.14 tm_0.6-2 topicmodels_0.2-2 igraph_1.0.1 1.3 Data Import First, we read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Second, we select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Third, we merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 1

Then, we randomly select 10,000 samples from RESTAURANTS_REVIEW and record it into data set restaurants_review. A relatively small data set is sampled due to the limit in time and computer capacity. 1.4 Data Preprocess We then convert the data into a corpus by using tm packages: We build a corpus based on the text column of restaurants_review. We convert all the words into lower case. We remove anything other than English letters or space. We remove stop words. We remove extra white space. Make the text fit paper width, i.e. each line has at most 60 characters. We heuristically complete stemmed words. We constructs a document-term matrix. We plot the word cloud to see what the major words are. Figure 1.1 Word cloud 2

From the word cloud we can see the common words people use to describe a restaurant, e.g. good, food, great, place, just, like, service, and etc. In addition, we can see food names such as salad, pizza, cheese, sushi, and etc. We can also find that chicken is a very common food in US. All in all, we can find a lot of information that makes sense. 1.5 Topic Model Fitting We fit the document-term matrix into LDA model using the LDA function in the topicmodels package. We set the number of topics as 10. The plot is shown in Figure 1.2 where ten words in each topic are presented. From Figure 1.2 we can tell that people mention food, place, service, and good, great in many topics, which is to be expected. In Figure 1.2, the words in topic i have i after them. For example, in topic 1, we have food 1, great 1 and so on. Figure 1.2 The topics of the sampled Restaurant 3

1.6 Comparison of Topics for Two Chinese Restaurants We randomly select two Chinese restaurants: CR1 with business_id - 3WVw1TNQbPBzaKCaQQ1AQ and CR2 with business_id -mz0zr0dw6zasg7_ah1r8a. We carry out the same procedure as above and obtain the LDA based topic plots as shown in Figure 1.3 and Figure 1.4. Figure 1.3 The topics of the first Chinese Restaurant CR1 4

Figure 1.4 The topics of the second Chinese Restaurant CR2 1.7 Discussion on the Topics for the Two Chinese Restaurants 1.7.1 Similarity Both topic 2 for CR1 and topic 3 for CR2 contain good, dish, order, beef and place. This is not surprising because beef is very common in USA. It is very likely good tasting dishes containing beef are often ordered in both restaurants. The topics for both restaurants contains China and Chinese and other common words such as food, good, place, chicken, and dish. 1.7.2 Difference The major words of the topics really depend on the names and menus of the restaurants. It is obvious that in the topics of restaurant CR1, people are talking about chili and spiciness since the restaurant is called China Chili and probably serves a lot of spicy food. However, in 5

the restaurant CR2, fried, egg, roll and pork appear often because the second restaurant is called Sing High and serves Barbecued pork slices, egg roll, fried Won Ton. 6

2 CHAPTER 2 SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP CONSTRUCTION In this chapter, we mine the data set to construct cuisine maps to visually understand the landscape of different types of cuisines and their similarities. The cuisine map can help users understand what cuisines are available and their relations, which allows for the discovery of new cuisines, thus facilitating exploration of unfamiliar cuisines. The cuisine map is build based on the categories and customer reviews of restaurants in Yelp data. 2.1 Tools Used R version 3.1.3 Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) 2.2 Major Packages: reshape2_1.4.1 plyr_1.8.3 ggplot2_1.0.1 scales_0.2.5 HSAUR_1.3-7 cluster_2.0.3 corrplot_0.73 proxy_0.4-15 tm_0.6-2 NLP_0.1-8 jsonlite_0.9.16 2.3 Data Import First, we read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Second, we select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Third, we merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 7

2.4 Data Preprocess First, we search for the most popular cuisines by counting the frequency of cuisine names in categories column. We pick the top 50 ones to build the cuisine map: [1] "American (New)" "American (Traditional)" "Nightlife" "Bars" [5] "Mexican" "Italian" "Breakfast & Brunch" "Pizza" [9] "Steakhouses" "Sandwiches" "Burgers" "Sushi Bars" [13] "Japanese" "Chinese" "Seafood" "Buffets" [17] "Fast Food" "Thai" "Asian Fusion" "Mediterranean" [21] "French" "Cafes" "Sports Bars" "Barbeque" [25] "Pubs" "Coffee & Tea" "Vietnamese" "Delis" [29] "Vegetarian" "Lounges" "Greek" "Wine Bars" [33] "Desserts" "Bakeries" "Gluten-Free" "Diners" [37] "Indian" "Korean" "Salad" "Chicken Wings" [41] "Hot Dogs" "Tapas Bars" "Arts & Entertainment" "Southern" [45] "Tapas/Small Plates" "Middle Eastern" "Hawaiian" "Vegan" [49] "Gastropubs" "Dim Sum" Then, we eliminate the entries that are not in the 50 categories from RESTAURANTS_REVIEW, randomly sample 10,000 entries from RESTAURANTS_REVIEW and record it into data set restaurants_review. A relatively small data set is sampled due to the limit in time and computer capacity. We then convert the data into a corpus by using tm packages: We build a corpus based on the text column of restaurants_review. We convert all the words into lower case. We remove anything other than English letters or space. We remove stop words. We remove extra white space. We make the text fit paper width, i.e. each line has at most 60 characters. 8

2.5 Similarity Matrix without IDF First, we construct a document term matrix using the corpus we prepared in section 2.4 using Term Frequency (TF). We do not apply Inverse Document Frequency (IDF) in constructing the document term matrix. Second, we calculate the similarity matrix based on the document term matrix by using 1 minus cosine distance and plot the similarity matrix in Figure 2.1. As can be seen from Figure 2.1, the similarity value is between 0 and 1. The similarity of a cuisine to itself is 1 as expected. We can observe many sets of cuisines that are very similar to each other, which is consistent with common sense. To name a few: American (New), American (Traditional), Night Life, and Bars Italian and Pizza Delis and Sandwiches Fast Food and Burgers Cafes and Breakfast & Brunch Japanese and Sushi Bars Mediterranean, Greek, and Middle Eastern Vegetarian and Gluten-free Chinese, Asian Fusion, and Dim Sum 9

Figure 2.1 Similarity matrix without IDF 2.6 Similarity Matrix with IDF The results presented in the previous section make a lot of sense. The similarity values between similar cuisines are indeed higher than those between not-so-similar or very-different cuisines. However, the difference is not very significant. Therefore, we use IDF to enhance the difference. We prepared another document term matrix using TF-IDF and calculate the similarity matrix with the same method (cosine distance). The similarity matrix is shown in Figure 2.2. 10

Figure 2.2 Similarity matrix using IDF As can be seen from Figure 2.2, the similarity values between cuisines that are actually similar to each other are significantly higher than the values between cuisines that have less in common. For example, Dim Sum is a type of Chinese food, based on Figure 2.1, it appears to have high similarity to Japanese, Sushi Bars, and Seafood and its similarity to Chinese is not significantly higher than its similarity to Japanese, Sushi Bars, and Seafood. But according to Figure 2.2, the similarity of Dim Sum to Japanese, Sushi Bars, and Seafood is much weaker and its similarity to Chinese is enhanced. 11

For another example, based on Figure 2.1, Greek seems to be very similar to American (New), American (Traditional), Nightlife, Bars, and its similarity to Mediterranean and Middle Eastern does not look very significant. Based on Figure 2.2, the similarity between Greek, Mediterranean, and Middle Eastern is much easier to find. 2.7 Similarity Matrix with Clustering We improved similarity matrix by using TF-IDF in section 2.6. However, related cuisines are sometimes located far away from each other and the cuisine map is not very handy to use. For instance, Middle Eastern, Mediterranean, and Greek are far away from each other in cuisine maps shown in Figure 2.1 and Figure 2.2 though they are quite similar. Indeed, it takes a lot of eye effort to find this relationship. Therefore, we carry out hierarchical cluster and -means cluster to facilitate the visualization of the relationships between similar cuisines. 2.7.1 Hierarchical Clustering We first try hierarchical clustering. A heat map is plotted in Figure 2.3 to show the similarity. From Figure 2.3, the similarity relationship is very clear since cuisines that are very similar to each other are closed located and the cuisines that are different are far away from each other. For example, Middle Eastern, Mediterranean, and Greek now are in one cluster and are next each other. My interesting clusters are forms, such as Japanese and Sushi Bars, Fast Food and Burgers. 12

Figure 2.3 Similarity matrix and hierarchical cluster 2.7.2 k-means Clustering We also carry out means clustering on our data set using the document term matrix based on TF-IDF. The results are as shown in Figure 2.4. We set k 5. The five different clusters are presented using different colors. The clusters are listed in TABLE 2.1. The cluster result makes sense but is not as good as the result obtained by hierarchical clustering. 13

TABLE 2.1 Cluster list Cluster 1 "Sushi Bars" "Japanese" Cluster 2 Mexican" "Breakfast Brunch" "Steakhouses" "Sandwiches" "Burgers" "Chinese" Cluster 3 "Seafood" "Buffets" "Fast Food" "Thai" "Asian Fusion" "Mediterranean" "French" "Cafes" "Sports Bars" "Barbeque" "Pubs" "Coffee Tea" "Vietnamese" "Delis" "Vegetarian" "Lounges" "Greek" "Wine Bars" "Desserts" "Bakeries" "Gluten Free" "Diners" "Indian" "Korean" "Salad" "Chicken Wings" "Hotdogs" "Tapas Bars" "Arts Entertainment" "Southern" "Tapas Small Plates" "Middle Eastern" "Hawaiian" "Vegan" "Gastropubs" "Dim Sum" Cluster 4 "Italian" "Pizza" Cluster 5 "American (New)" "American (Traditional)" "Nightlife" "Bars" 14

Figure 2.4 Similarity matrix and k-means cluster 2.8 Conclusions In this chapter, we investigate the development of cuisine map based on the categories and customer reviews in Yelp data. Both TF and TF-IDF are used to build document term matrix. Similarity matrices are obtained based on cosine distance and plotted in Figure 2.1 to Figure 2.4. It is found that TF-IDF can enhance the similarity between cuisines that are indeed similar and weaken the similarity between cuisines that have less in common. We also carry out hierarchical clustering and k-means clustering to facilitate the reader to use the cuisine map. 15

3 CHAPTER 3 SUMMARY OF TASK 3: DISH RECOGNITION In this chapter, we investigate the mining of Chinese dish names from the yelp review data on Chinese restaurants. We subset the reviews on Chinese restaurants from the original data set and identify available dish names in Chinese cuisine by using TopMine [1] and SegPhrase [2]. 3.1 Task 3.1: Manual Tagging First, we revise the label file for Chinese cuisine manually. We remove false positive non-dish names phrase. We change a false negative dish name phrase to a positive label. Second, we add more annotated phrases in the same format by searching for menus from Chinese restaurants. 3.2 Task 3.2: Mining Additional Dish Names 3.2.1 Corpus Preparation We import the data into R. We read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. We select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. We merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 16

We subset RESTAURANTS_REVIEW by selecting the entries with Chinese in column categories and save it into RESTAURANTS_REVIEW_CHINESE We subset the text column in RESTAURANTS_REVIEW_CHINESE and save it into.txt file with each line being one review. 3.3 Dish Name Identify Using TopMine 3.3.1 Parameters We keep the default values of the parameters except that we modify maxpattern into 6 since we believe that a dish name is likely to contain 1 to 6 words. 3.3.2 Opinion about the Result We run the TopMine package and obtain more than 10k phrases. Some of the most frequent ones appear to be dish names, such as dim sum 2849 fried rice 2511 egg rolls 1777 orange chicken 1599 These are indeed typical Chinese dishes found in US. If you have never been to a Chinese restaurant, you may want to go and try these dishes because apparently they are very popular in US. My personal favorite is dim sum which is originally from Canton and Hong Kong. However, there are still many frequent phrases that are not dish names, for example Chinese food (with frequency 2853) and Chinese restaurant (with frequency 2108). This is because the phrase mining algorithm, TopMine, is not specifically for dish name mining. The wrong dish names are actually frequently used in the reviews and are indeed frequent phrases, which means the algorithm works very well. 17

3.3.3 Improvement With the limitation of the tools we have, we have to re-prepare our corpus so that the frequent phrases other than dish names are removed beforehand. Therefore, we remove the word Chinese and the following words from the original corpus to improve the result of phrase mining: good, food, service, great, one, like, love, pretty, place, menu, ordered, order, best, try, nice, well, didnt, dont, ive, eat, back, also, got, always, come, people, get, will, can, really, just, time, little, us, meal, diner, first, table, definitely. The reason why we remove these words is that we found they appear quite often in the corpus, as shown in the word cloud in Figure 1.1, but are very unlikely to appear in a Chinese dish name. After this procedure, the results are much better by observing that most of the top frequent phrases are dish names. 3.4 Dish Name Identify Using SegPhrase 3.4.1 Parameters We prepare a label and set algorithm parameter AUTO_LABEL=1. The first part of the label is the label we revised manually in task 3.1. The second part of the label is from the result of TopMine. Basically, we select the first 2k frequent phrases in the result of TopMine and replace the frequency with label 1. We then manually revise the label by removing false positive. 3.4.2 Opinion about the Result Using the label and the algorithm package we obtain very good dish name list. Below is the top phrases in the list. orange chicken hot and sour soup cashew chicken sea bass hot pot kung pao chicken brown rice shaved ice white rice char siu chow mein won ton steamed rice fried rice bok choy sweet and sour pork 18

4 CHAPTER 4 SUMMARY OF TASK 4 & 5: POPULAR DISHES AND RESTARURANT RECOMMENDATION In this chapter, we detect popular dishes of a specific cuisine (Chinese cuisine) and popular restaurants for a specific dish ( orange chicken and fried rice ). Popularity is measured by the frequency that a dish appears in reviews. We also carried out some sentiment analysis based on the stars each dish or restaurant receives in reviews. 4.1 Data Preparation 4.1.1 Corpus Read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. Subset RESTAURANTS_REVIEW by selecting the entries with Chinese in column categories and save it into CHINESE_REVIEW Subset the text column in CHINESE_REVIEW as the corpus. Convert the corpus into ASCII encoding. Strip extra whitespace from the corpus. Remove punctuation marks from the corpus. Remove numbers from the corpus. 19

4.1.2 Dish List We used the top 500 dish names from the dish mining results obtained in Task 3. We read the txt file (each line is a dish name) into R using function readlines. 4.2 Tools and Packages R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 4.2.1 Attached Base Packages: stats, graphics, grdevices, utils, datasets methods, base 4.2.2 Other Attached Packages: dplyr_0.4.3, tm_0.6-2, NLP_0.1-8, ggplot2_1.0.1, jsonlite_0.9.16, qdap_2.2.2, RColorBrewer_1.1-2, qdaptools_1.3.1, qdapregex_0.5.0, qdapdictionaries_1.0.6 4.3 Task 4: Popular Dishes In this section, we detect the top 100 most popular dishes in Chinese cuisine. 4.3.1 Popularity Analysis The measurement for popularity of a dish is defined as the frequency that the dish appears in customers reviews. If a dish name appears more than one time in the same piece of review, it is only counted once. We obtain a data frame with m rows and n 3) columns, where m is the number of reviews and n is the number of dishes. Each row represents an individual review. In each row, the first m columns are the counts of the m dishes. Basically, if a dish name appears in the review, then the value in the corresponding column is 1, otherwise it is 0. The n 1 st column is the stars 20

corresponding to the review, the n 2 nd column is the name of the corresponding restaurant, and the n 3 rd column is the overall star of the restaurant. 4.3.2 Sentiment Analysis A frequently ordered (mentioned) dish is not necessarily tasty as well. We use the stars given by reviewers in their reviews as an indicator for the tastiness of the corresponding dishes mentioned in the reviews. For example, if a reviewer mentions fried rice and orange chicken in his or her review, and he or she gives a five stars in the review for his or her experience, then fried rice and orange chicken both earn a tastiness 5 due to this piece of review. We count the total stars each dish earns from the reviewers as its overall tastiness. Then the overall tastiness is normalized into a range of 1 to 5. 4.3.3 Illustration The results are presented in Figure 4.1, where the x-axis is the top 100 popular dish names and y-axis is the corresponding frequency-based popularity. We used color to show the tastiness of the dishes. There exist a strong correlation that tastier dishes tends to be ordered (mentioned) more often, which makes sense in practice. 21

Figure 4.1 Illustration for popular dish names 4.4 Task 5: Popular Restaurants In this part, we mine the popular restaurants for a specific dish. Without losing generality, we use orange chicken and fried rice as two examples because they are two of the most popular dishes in Chinese cuisine as shown in Figure 4.1. Note that other dish names can be used for this task since the method and code we use to obtain the results in this section are supposed to be universal. 4.4.1 Popularity Analysis We group the data frame obtained in task 4 by restaurant and calculate the total count of dishes for each restaurant. We use the total count as popularity of the restaurant with respect to a dish. For example, for restaurant Panda Express, the total count of orange chicken is 145 while the total count of fried rice is 87. For another example, for restaurant Chino Bandido, the total count of orange chicken is 36 while the total count of fried rice is 406. As you can see that Panda Express is more popular for its orange chicken whereas Chino Bandido is more popular for its fried rice. 22

4.4.2 Sentiment Analysis A restaurant may serve a lot of orange chicken or fried rice, but it could be because of the population in that area or its low price. We want to know if the customers are happy after having its orange chicken or fried rice, which means how tasty the orange chicken and fried rice are. We use the overall stars of the restaurant as a measurement. 4.4.3 Illustration The results are presented in Figure 4.2 and Figure 4.3. The x-axis represents the top 100 restaurants that serve the dishes orange chicken and fried rice and the y-axis represents the popularity of the corresponding restaurants. We used color to show the tastiness of the dishes. Figure 4.2 Illustration for popular restaurants for orange chicken 23

Figure 4.3 Illustration for popular restaurants for fried rice 4.5 Conclusions We believe that the figures provided above can be a good guide for people who want to try Chinese food. They can find the most popular dishes in Figure 4.1 and find which restaurants serve the best orange chicken and fried rice in Figure 4.2 and Figure 4.3. 24

5 CHAPTER 5 SUMMARY OF TASK 6: HYGIENE PREDICTION In this chapter, we predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. Two text representation techniques are used: Unigram and Topic Model. Two learning algorithms are used: Logistic Regression and Random Forest. Additional features are used such as Categories, Stars, and Zipcode. 5.1 Tools Used R version 3.1.3 Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) 5.1.1 Packages topicmodels_0.2-2 qdap_2.2.2 qdaptools_1.3.1 qdapregex_0.5.0 qdapdictionaries_1.0.6 tm_0.6-2 NLP_0.1-8 quanteda_0.7.2-1 randomforest_4.6-10 caret_6.0-52 5.2 Text Preprocessing We preprocess the review text as follows. Package tm is used. Convert the text into ASCII encoding. Strip extra whitespace from the text. Remove punctuation marks from the text. Remove numbers from the text. 25

5.3 Training Method 1: Logistic Regression 5.3.1 Text Representation Techniques 5.3.1.1 Unigram First, we obtain word frequency from the reviews in training data and select the top words. Here we set = 301 and 1451. frequent Second, we use the counts of frequent words in the review of each restaurant as its corresponding text-based features. Package qdap is used. 5.3.2 Additional Features Used Stars, Zipcode, and Categories 5.3.3 Learning Algorithm Logistic Regression 5.3.4 Results Analysis The results are presented in TABLE 5.1 where Score is the score given by Coursera grader. TABLE 5.1 Prediction Score obtained by Logistic Regression # of unigram feature Additional Features Score Scheme 1 301 Stars and Zipcode 0.55778821485 Scheme 2 301 Stars, Zipcode, and 0.53725827475 Categories Scheme 3 1451 Stars and Zipcode 0.509439655385 26

From TABLE 5.1 we can observe the following. (1) The score is lower when additional feature Categories is used. This is probably because some categories in testing data set do not appear in training data set. (2) When more unigram features (frequent words) are used, the score is lower. This is probably because of overfitting. 5.4 Training Method 2: Random Forest 5.4.1 Text Representation Techniques 5.4.1.1 Unigram First, we obtain word frequency from the reviews in training data and select the top words. Here we set = 841 and 1451. frequent Second, we use the counts of frequent words in the review of each restaurant as its corresponding text-based features. Package qdap is used. 5.4.1.2 Topic Model First, we mine 10, 50 and 100 topics from training data. Second, we count the words that belong to the topics in a restaurant s review and use the counts as text-based features. Package topicmodels is used. 5.4.2 Additional Features Used Stars, Zipcode, and Categories 27

5.4.3 Learning Algorithm Random Forest Packages caret and randomforest are used. 5.4.4 Results Analysis We use two text representation techniques and different numbers of features. The results are shown in TABLE 5.2 and TABLE 5.3, respectively. In TABLE 5.2, we observe the following. (1) Results are improved by using additional feature Categories (2) More unigram feature improve the result. It seems that a large number of unigram features does not cause overfitting in these two cases. More tests are not carried out because more features will result in unbearable training time. TABLE 5.2 Prediction Score obtained by Random Forest & Unigram # of unigram features Additional Features Score Scheme 1 841 Stars, Zipcode, and 0.56127128414 Categories Scheme 3 1451 Stars, Zipcode, and 0.561925058925 Categories Scheme 2 1451 Stars, Zipcode 0.559788032673 In TABLE 5.3, we observe that more topics does not necessarily mean better result, overfitting occurs when the number of topics becomes larger. 28

TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model # of topics Additional Features Score Scheme 1 10 Stars, Zipcode, and 0.520164615311 Categories Scheme 2 50 Stars, Zipcode, and 0.552275390162 Categories Scheme 3 100 Stars, Zipcode, and Categories 0.540265423658 5.5 Method Comparison From the result we can tell that logistic regression tends to see overfitting with small numbers of features whereas random forest is less prone to overfitting. Overall, random forest provides slightly better results than logistic regression, but the former takes much more computer time than the latter. Comparing results from TABLE 5.2 and TABLE 5.3, we observe that the topic model method on average has similar results as unigram whereas its best result does not outperform Unigram. The reason could be as follows. On one hand, topic model reduces the dimension of features and enhances the important features. On the other hand, we may lose some information for prediction during the dimension reduction. 29

6 CHAPTER 6 USEFULNESS OF RESULTS In this chapter, we introduce the useful results obtained through the data mining capstone. 6.1 Cuisine Maps In Chapter 2, we build several cuisine maps which show the similarity between 50 different cuisines. 6.1.1 Usefulness for Customers These maps can be very useful for customers who want to explore new cuisines. For instance, according to the cuisine map in Figure 2.2, Mediterranean, Greek, and Middle Eastern are three very similar cuisines. People who like one of them may want to try the other two if they use the cuisine map. 6.1.2 Usefulness for Restaurant Owners These maps can also benefit restaurant owners who want to extend their businesses. They can choose to open their new restaurants next to or far away from certain restaurants. For example, an owner of a cafe may want to open a new cafe next to a restaurant that specifically provides breakfast and brunch since they are very similar according to the cuisine map and people will love to grab a cup of coffee before or after breakfast. 6.2 Dish Recognizer We recognize some dishes in task 3 as introduced in Chapter 3. This is useful for businessmen who want to open restaurants. It is very helpful to know what dishes are served in certain cuisine before opening a restaurant of that cuisine. 30

6.3 Popular Dishes Detection We detect top 100 popular Chinese dishes with corresponding tastiness in task 4 as introduced in Chapter 4. This is extremely useful for people who like Chinese food and who want to try Chinese food. The reason is obvious. People can find the most popular and tasty dishes and avoid ordering dishes that are not so welcomed. In addition, this result is also very useful to owners of Chinese restaurants and businessmen who want to start Chinese restaurants. For them, providing more popular food is more likely to bring more customers and hence more profit. 6.4 Restaurant Recommendation We recommend top 100 restaurants that serve orange chicken and fried rice in task 5 as presented in Chapter 4. This is also quite useful for customers who want to try these two special dishes. 6.5 Hygiene Prediction This result helps customers in selecting clean restaurants to go and avoid restaurants that are not so good at keeping hygiene. 31

7 CHAPTER 7 NOVELTY OF EXPLORATION 7.1 Hierarchical Clustering in Cuisine Map Development When we build the cuisine map considering clustering, hierarchical clustering is used, as shown in Figure 2.3. The hierarchical relation between cuisines are shown together with the similarity matrix. This really helps users find the clusters based on their own need. Instead of fixing the number of clusters beforehand, we allow users to choose how many clusters they want or to simply find cuisine that are connected by the hierarchical links. 7.2 TopMine Output Used as the Input for SegPhrase in Dish Recognition In recognizing dish names for Chinese cuisine, we use the output of TopMine as the input for SegPhrase so that SegPhrase has a more comprehensive labeled dish list. The first part of the labeled list is the one we revise manually in task 3.1. The second part of the list is from the result of TopMine. This method turns out to be very effective, which results in a 12 out 10 score according to the grader. 7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction In training hygiene prediction models, we use two text representation techniques: unigram and topic model. For unigram, we detect the top N popular terms first instead of using all the terms in corpus, and then use the counts of the N words in the reviews as features. For topic model, we first mine topics from customer reviews, and then use the word counts in the topics as features. The two methods are very effective according to the grader. A F1 = 0.56 is obtained using the top term counts as features and a F1 = 0.55 is obtained using the topic model. 32

8 CHAPTER 8 CONTRIBUTION OF NEW KNOWLEDGE 8.1 Some Advantages of Random Forest over Logistic Regression In carrying out task 6, we train both logistic regression model and random forest with the same number of features and compare the results. Here are some advantages of random forest over logistic regression found during the experiment. 8.1.1 Random Forest is Less Prone to Overfitting than Logistic Regression We found that random forest provides better results when more and more features are included without showing overfitting, though we do not carry out experiments with more than 1500 features. However, logistic regression shows the sign of overfitting when less than 1500 features are used. This shows us that random forest is less prone to overfitting than logistic regression. 8.1.2 Logistic Regression is not Good at Handling Missing Feature Value When we are using logistic regression as prediction algorithm and categories of restaurants as a feature, warnings occur because some restaurant categories that do not appear in training data appear in testing data, which causes worse prediction result. On the other hand, random forest seems to be able to cope with such situation and even provide better prediction when restaurant categories are used as a feature. 33

9 CHAPTER 9 IMPROVEMENT TO BE DONE Several things can be done to improve this project: First, web based tools can be developed for interactive illustration of results. Second, updating algorithm can be developed to update the results in an efficient manner when more data are available instead of carrying out data mining from scratch. Third, a location based restaurant and dish recommendation should be developed which can be more helpful for customers in specific places. 34

10 REFERENCES [1] El-Kishky, Ahmed, et al. "Scalable topical phrase mining from text corpora." Proceedings of the VLDB Endowment, 8.3 (2014): 305-316. [2] Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases from Massive Text Corpora, Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15), Melbourne, Australia, May 2015. (* equally contributed) 35