DATA MINING CAPSTONE FINAL REPORT

Size: px
Start display at page:

Download "DATA MINING CAPSTONE FINAL REPORT"

Transcription

1 DATA MINING CAPSTONE FINAL REPORT

2 ABSTRACT This report is to summarize the tasks accomplished for the Data Mining Capstone. The tasks are based on yelp review data, majorly for restaurants. Six tasks are accomplished. The first task is to visualize customer review text for all restaurants. Frequent word cloud is plotted. Topics are detected from the review text for all restaurants. In addition, topic comparison for two Chinese restaurants are provided and visualized. The second task is to build cuisine map based on similarity between cuisines using customer review text. Top fifty cuisines are found first to be included in the cuisine map. Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF), and clustering methods, i.e. hierarchical clustering and k-means clustering, are used to build similarity matrix, heat map, and cuisine map. The third task is to recognize dish names from customer review text of a certain cuisine. Chinese cuisine is chosen for this task. A labeled dish name list of Chinese cuisine is revised manually. Then two algorithms, i.e. TopMine and SegPhrase, are used to mine a comprehensive Chinese dish list based on the review text for Chinese restaurants and the labeled dish name list. The fourth and fifth tasks are to detect popular dishes and recommend good restaurants for certain dishes. Again, Chinese cuisine is chosen for this task. 700 dish names from task 3 are used as a pool of Chinese dishes. The top 100 most popular dishes and their corresponding tastiness are found by mining customer review text and review score, i.e. stars. We also recommended top 100 most popular restaurants for two popular Chinese dishes, i.e. orange chicken and fried rice. I

3 The sixth task is to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. In addition, in this report we highlight the following: (1) the most useful data mining results produced through these specific data mining tasks and potential people who might benefit from such results; (2) the novel ideas/methods explored to carry out the tasks; (3) new knowledge people can learn from the project activities, particularly through the experiments. The report is organized as follows. Chapter 1 to Chapter 5 will introduce the six tasks: Chapters 1 to 3 address tasks 1 to 3, Chapter 4 addresses tasks 4 and 5, and Chapter 5 addresses task 6. Chapter 6 introduces the useful results. Chapter 7 presents the novel method used in carrying out the tasks. Chapter 8 summarizes the contribution of new knowledge discovered throughout the capstone. II

4 TABLE OF CONTENTS 1 CHAPTER 1 SUMMARY OF TASK 1: EXPLORATION OF DATA SET Tools Used Major Packages Data Import Data Preprocess Topic Model Fitting Comparison of Topics for Two Chinese Restaurants Discussion on the Topics for the Two Chinese Restaurants Similarity Difference CHAPTER 2 SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP CONSTRUCTION Tools Used Major Packages: Data Import Data Preprocess Similarity Matrix without IDF Similarity Matrix with IDF Similarity Matrix with Clustering Hierarchical Clustering k-means Clustering Conclusions CHAPTER 3 SUMMARY OF TASK 3: DISH RECOGNITION Task 3.1: Manual Tagging Task 3.2: Mining Additional Dish Names Corpus Preparation Dish Name Identify Using TopMine Parameters Opinion about the Result Improvement Dish Name Identify Using SegPhrase Parameters Opinion about the Result CHAPTER 4 SUMMARY OF TASK 4 & 5: POPULAR DISHES AND RESTARURANT RECOMMENDATION Data Preparation III

5 4.1.1 Corpus Dish List Tools and Packages Attached Base Packages: Other Attached Packages: Task 4: Popular Dishes Popularity Analysis Sentiment Analysis Illustration Task 5: Popular Restaurants Popularity Analysis Sentiment Analysis Illustration Conclusions CHAPTER 5 SUMMARY OF TASK 6: HYGIENE PREDICTION Tools Used Packages Text Preprocessing Training Method 1: Logistic Regression Text Representation Techniques Unigram Additional Features Used Learning Algorithm Results Analysis Training Method 2: Random Forest Text Representation Techniques Unigram Topic Model Additional Features Used Learning Algorithm Results Analysis Method Comparison CHAPTER 6 USEFULNESS OF RESULTS Cuisine Maps Usefulness for Customers Usefulness for Restaurant Owners Dish Recognizer Popular Dishes Detection Restaurant Recommendation Hygiene Prediction CHAPTER 7 NOVELTY OF EXPLORATION Hierarchical Clustering in Cuisine Map Development TopMine Output Used as the Input for SegPhrase in Dish Recognition IV

6 7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction CHAPTER 8 CONTRIBUTION OF NEW KNOWLEDGE Some Advantages of Random Forest over Logistic Regression Random Forest is Less Prone to Overfitting than Logistic Regression Logistic Regression is not Good at Handling Missing Feature Value CHAPTER 9 IMPROVEMENT TO BE DONE REFERENCES V

7 List of Figures Figure 1.1 Word cloud... 2 Figure 1.2 The topics of the sampled Restaurant... 3 Figure 1.3 The topics of the first Chinese Restaurant CR Figure 1.4 The topics of the second Chinese Restaurant CR Figure 2.1 Similarity matrix without IDF Figure 2.2 Similarity matrix using IDF Figure 2.3 Similarity matrix and hierarchical cluster Figure 2.4 Similarity matrix and k-means cluster Figure 4.1 Illustration for popular dish names Figure 4.2 Illustration for popular restaurants for orange chicken Figure 4.3 Illustration for popular restaurants for fried rice VI

8 List of Tables TABLE 2.1 Cluster list TABLE 5.1 Prediction Score obtained by Logistic Regression TABLE 5.2 Prediction Score obtained by Random Forest & Unigram TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model VII

9 1 CHAPTER 1 SUMMARY OF TASK 1: EXPLORATION OF DATA SET In this chapter, we explore the yelp data set. Particularly, we mine the reviews on restaurants from customers in order to find topics. We mine the topics based on Latent Dirichlet Allocation (LDA) model and plot the topics in a circular tree for visualization. In addition, we mine and compare the topics of two Chinese restaurants. 1.1 Tools Used R version Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) 1.2 Major Packages jsonlite_ tm_0.6-2 topicmodels_0.2-2 igraph_ Data Import First, we read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Second, we select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Third, we merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 1

10 Then, we randomly select 10,000 samples from RESTAURANTS_REVIEW and record it into data set restaurants_review. A relatively small data set is sampled due to the limit in time and computer capacity. 1.4 Data Preprocess We then convert the data into a corpus by using tm packages: We build a corpus based on the text column of restaurants_review. We convert all the words into lower case. We remove anything other than English letters or space. We remove stop words. We remove extra white space. Make the text fit paper width, i.e. each line has at most 60 characters. We heuristically complete stemmed words. We constructs a document-term matrix. We plot the word cloud to see what the major words are. Figure 1.1 Word cloud 2

11 From the word cloud we can see the common words people use to describe a restaurant, e.g. good, food, great, place, just, like, service, and etc. In addition, we can see food names such as salad, pizza, cheese, sushi, and etc. We can also find that chicken is a very common food in US. All in all, we can find a lot of information that makes sense. 1.5 Topic Model Fitting We fit the document-term matrix into LDA model using the LDA function in the topicmodels package. We set the number of topics as 10. The plot is shown in Figure 1.2 where ten words in each topic are presented. From Figure 1.2 we can tell that people mention food, place, service, and good, great in many topics, which is to be expected. In Figure 1.2, the words in topic i have i after them. For example, in topic 1, we have food 1, great 1 and so on. Figure 1.2 The topics of the sampled Restaurant 3

12 1.6 Comparison of Topics for Two Chinese Restaurants We randomly select two Chinese restaurants: CR1 with business_id - 3WVw1TNQbPBzaKCaQQ1AQ and CR2 with business_id -mz0zr0dw6zasg7_ah1r8a. We carry out the same procedure as above and obtain the LDA based topic plots as shown in Figure 1.3 and Figure 1.4. Figure 1.3 The topics of the first Chinese Restaurant CR1 4

13 Figure 1.4 The topics of the second Chinese Restaurant CR2 1.7 Discussion on the Topics for the Two Chinese Restaurants Similarity Both topic 2 for CR1 and topic 3 for CR2 contain good, dish, order, beef and place. This is not surprising because beef is very common in USA. It is very likely good tasting dishes containing beef are often ordered in both restaurants. The topics for both restaurants contains China and Chinese and other common words such as food, good, place, chicken, and dish Difference The major words of the topics really depend on the names and menus of the restaurants. It is obvious that in the topics of restaurant CR1, people are talking about chili and spiciness since the restaurant is called China Chili and probably serves a lot of spicy food. However, in 5

14 the restaurant CR2, fried, egg, roll and pork appear often because the second restaurant is called Sing High and serves Barbecued pork slices, egg roll, fried Won Ton. 6

15 2 CHAPTER 2 SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP CONSTRUCTION In this chapter, we mine the data set to construct cuisine maps to visually understand the landscape of different types of cuisines and their similarities. The cuisine map can help users understand what cuisines are available and their relations, which allows for the discovery of new cuisines, thus facilitating exploration of unfamiliar cuisines. The cuisine map is build based on the categories and customer reviews of restaurants in Yelp data. 2.1 Tools Used R version Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) 2.2 Major Packages: reshape2_1.4.1 plyr_1.8.3 ggplot2_1.0.1 scales_0.2.5 HSAUR_1.3-7 cluster_2.0.3 corrplot_0.73 proxy_ tm_0.6-2 NLP_0.1-8 jsonlite_ Data Import First, we read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Second, we select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Third, we merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 7

16 2.4 Data Preprocess First, we search for the most popular cuisines by counting the frequency of cuisine names in categories column. We pick the top 50 ones to build the cuisine map: [1] "American (New)" "American (Traditional)" "Nightlife" "Bars" [5] "Mexican" "Italian" "Breakfast & Brunch" "Pizza" [9] "Steakhouses" "Sandwiches" "Burgers" "Sushi Bars" [13] "Japanese" "Chinese" "Seafood" "Buffets" [17] "Fast Food" "Thai" "Asian Fusion" "Mediterranean" [21] "French" "Cafes" "Sports Bars" "Barbeque" [25] "Pubs" "Coffee & Tea" "Vietnamese" "Delis" [29] "Vegetarian" "Lounges" "Greek" "Wine Bars" [33] "Desserts" "Bakeries" "Gluten-Free" "Diners" [37] "Indian" "Korean" "Salad" "Chicken Wings" [41] "Hot Dogs" "Tapas Bars" "Arts & Entertainment" "Southern" [45] "Tapas/Small Plates" "Middle Eastern" "Hawaiian" "Vegan" [49] "Gastropubs" "Dim Sum" Then, we eliminate the entries that are not in the 50 categories from RESTAURANTS_REVIEW, randomly sample 10,000 entries from RESTAURANTS_REVIEW and record it into data set restaurants_review. A relatively small data set is sampled due to the limit in time and computer capacity. We then convert the data into a corpus by using tm packages: We build a corpus based on the text column of restaurants_review. We convert all the words into lower case. We remove anything other than English letters or space. We remove stop words. We remove extra white space. We make the text fit paper width, i.e. each line has at most 60 characters. 8

17 2.5 Similarity Matrix without IDF First, we construct a document term matrix using the corpus we prepared in section 2.4 using Term Frequency (TF). We do not apply Inverse Document Frequency (IDF) in constructing the document term matrix. Second, we calculate the similarity matrix based on the document term matrix by using 1 minus cosine distance and plot the similarity matrix in Figure 2.1. As can be seen from Figure 2.1, the similarity value is between 0 and 1. The similarity of a cuisine to itself is 1 as expected. We can observe many sets of cuisines that are very similar to each other, which is consistent with common sense. To name a few: American (New), American (Traditional), Night Life, and Bars Italian and Pizza Delis and Sandwiches Fast Food and Burgers Cafes and Breakfast & Brunch Japanese and Sushi Bars Mediterranean, Greek, and Middle Eastern Vegetarian and Gluten-free Chinese, Asian Fusion, and Dim Sum 9

18 Figure 2.1 Similarity matrix without IDF 2.6 Similarity Matrix with IDF The results presented in the previous section make a lot of sense. The similarity values between similar cuisines are indeed higher than those between not-so-similar or very-different cuisines. However, the difference is not very significant. Therefore, we use IDF to enhance the difference. We prepared another document term matrix using TF-IDF and calculate the similarity matrix with the same method (cosine distance). The similarity matrix is shown in Figure

19 Figure 2.2 Similarity matrix using IDF As can be seen from Figure 2.2, the similarity values between cuisines that are actually similar to each other are significantly higher than the values between cuisines that have less in common. For example, Dim Sum is a type of Chinese food, based on Figure 2.1, it appears to have high similarity to Japanese, Sushi Bars, and Seafood and its similarity to Chinese is not significantly higher than its similarity to Japanese, Sushi Bars, and Seafood. But according to Figure 2.2, the similarity of Dim Sum to Japanese, Sushi Bars, and Seafood is much weaker and its similarity to Chinese is enhanced. 11

20 For another example, based on Figure 2.1, Greek seems to be very similar to American (New), American (Traditional), Nightlife, Bars, and its similarity to Mediterranean and Middle Eastern does not look very significant. Based on Figure 2.2, the similarity between Greek, Mediterranean, and Middle Eastern is much easier to find. 2.7 Similarity Matrix with Clustering We improved similarity matrix by using TF-IDF in section 2.6. However, related cuisines are sometimes located far away from each other and the cuisine map is not very handy to use. For instance, Middle Eastern, Mediterranean, and Greek are far away from each other in cuisine maps shown in Figure 2.1 and Figure 2.2 though they are quite similar. Indeed, it takes a lot of eye effort to find this relationship. Therefore, we carry out hierarchical cluster and -means cluster to facilitate the visualization of the relationships between similar cuisines Hierarchical Clustering We first try hierarchical clustering. A heat map is plotted in Figure 2.3 to show the similarity. From Figure 2.3, the similarity relationship is very clear since cuisines that are very similar to each other are closed located and the cuisines that are different are far away from each other. For example, Middle Eastern, Mediterranean, and Greek now are in one cluster and are next each other. My interesting clusters are forms, such as Japanese and Sushi Bars, Fast Food and Burgers. 12

21 Figure 2.3 Similarity matrix and hierarchical cluster k-means Clustering We also carry out means clustering on our data set using the document term matrix based on TF-IDF. The results are as shown in Figure 2.4. We set k 5. The five different clusters are presented using different colors. The clusters are listed in TABLE 2.1. The cluster result makes sense but is not as good as the result obtained by hierarchical clustering. 13

22 TABLE 2.1 Cluster list Cluster 1 "Sushi Bars" "Japanese" Cluster 2 Mexican" "Breakfast Brunch" "Steakhouses" "Sandwiches" "Burgers" "Chinese" Cluster 3 "Seafood" "Buffets" "Fast Food" "Thai" "Asian Fusion" "Mediterranean" "French" "Cafes" "Sports Bars" "Barbeque" "Pubs" "Coffee Tea" "Vietnamese" "Delis" "Vegetarian" "Lounges" "Greek" "Wine Bars" "Desserts" "Bakeries" "Gluten Free" "Diners" "Indian" "Korean" "Salad" "Chicken Wings" "Hotdogs" "Tapas Bars" "Arts Entertainment" "Southern" "Tapas Small Plates" "Middle Eastern" "Hawaiian" "Vegan" "Gastropubs" "Dim Sum" Cluster 4 "Italian" "Pizza" Cluster 5 "American (New)" "American (Traditional)" "Nightlife" "Bars" 14

23 Figure 2.4 Similarity matrix and k-means cluster 2.8 Conclusions In this chapter, we investigate the development of cuisine map based on the categories and customer reviews in Yelp data. Both TF and TF-IDF are used to build document term matrix. Similarity matrices are obtained based on cosine distance and plotted in Figure 2.1 to Figure 2.4. It is found that TF-IDF can enhance the similarity between cuisines that are indeed similar and weaken the similarity between cuisines that have less in common. We also carry out hierarchical clustering and k-means clustering to facilitate the reader to use the cuisine map. 15

24 3 CHAPTER 3 SUMMARY OF TASK 3: DISH RECOGNITION In this chapter, we investigate the mining of Chinese dish names from the yelp review data on Chinese restaurants. We subset the reviews on Chinese restaurants from the original data set and identify available dish names in Chinese cuisine by using TopMine [1] and SegPhrase [2]. 3.1 Task 3.1: Manual Tagging First, we revise the label file for Chinese cuisine manually. We remove false positive non-dish names phrase. We change a false negative dish name phrase to a positive label. Second, we add more annotated phrases in the same format by searching for menus from Chinese restaurants. 3.2 Task 3.2: Mining Additional Dish Names Corpus Preparation We import the data into R. We read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. We select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. We merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 16

25 We subset RESTAURANTS_REVIEW by selecting the entries with Chinese in column categories and save it into RESTAURANTS_REVIEW_CHINESE We subset the text column in RESTAURANTS_REVIEW_CHINESE and save it into.txt file with each line being one review. 3.3 Dish Name Identify Using TopMine Parameters We keep the default values of the parameters except that we modify maxpattern into 6 since we believe that a dish name is likely to contain 1 to 6 words Opinion about the Result We run the TopMine package and obtain more than 10k phrases. Some of the most frequent ones appear to be dish names, such as dim sum 2849 fried rice 2511 egg rolls 1777 orange chicken 1599 These are indeed typical Chinese dishes found in US. If you have never been to a Chinese restaurant, you may want to go and try these dishes because apparently they are very popular in US. My personal favorite is dim sum which is originally from Canton and Hong Kong. However, there are still many frequent phrases that are not dish names, for example Chinese food (with frequency 2853) and Chinese restaurant (with frequency 2108). This is because the phrase mining algorithm, TopMine, is not specifically for dish name mining. The wrong dish names are actually frequently used in the reviews and are indeed frequent phrases, which means the algorithm works very well. 17

26 3.3.3 Improvement With the limitation of the tools we have, we have to re-prepare our corpus so that the frequent phrases other than dish names are removed beforehand. Therefore, we remove the word Chinese and the following words from the original corpus to improve the result of phrase mining: good, food, service, great, one, like, love, pretty, place, menu, ordered, order, best, try, nice, well, didnt, dont, ive, eat, back, also, got, always, come, people, get, will, can, really, just, time, little, us, meal, diner, first, table, definitely. The reason why we remove these words is that we found they appear quite often in the corpus, as shown in the word cloud in Figure 1.1, but are very unlikely to appear in a Chinese dish name. After this procedure, the results are much better by observing that most of the top frequent phrases are dish names. 3.4 Dish Name Identify Using SegPhrase Parameters We prepare a label and set algorithm parameter AUTO_LABEL=1. The first part of the label is the label we revised manually in task 3.1. The second part of the label is from the result of TopMine. Basically, we select the first 2k frequent phrases in the result of TopMine and replace the frequency with label 1. We then manually revise the label by removing false positive Opinion about the Result Using the label and the algorithm package we obtain very good dish name list. Below is the top phrases in the list. orange chicken hot and sour soup cashew chicken sea bass hot pot kung pao chicken brown rice shaved ice white rice char siu chow mein won ton steamed rice fried rice bok choy sweet and sour pork 18

27 4 CHAPTER 4 SUMMARY OF TASK 4 & 5: POPULAR DISHES AND RESTARURANT RECOMMENDATION In this chapter, we detect popular dishes of a specific cuisine (Chinese cuisine) and popular restaurants for a specific dish ( orange chicken and fried rice ). Popularity is measured by the frequency that a dish appears in reviews. We also carried out some sentiment analysis based on the stars each dish or restaurant receives in reviews. 4.1 Data Preparation Corpus Read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. Subset RESTAURANTS_REVIEW by selecting the entries with Chinese in column categories and save it into CHINESE_REVIEW Subset the text column in CHINESE_REVIEW as the corpus. Convert the corpus into ASCII encoding. Strip extra whitespace from the corpus. Remove punctuation marks from the corpus. Remove numbers from the corpus. 19

28 4.1.2 Dish List We used the top 500 dish names from the dish mining results obtained in Task 3. We read the txt file (each line is a dish name) into R using function readlines. 4.2 Tools and Packages R version ( ) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack Attached Base Packages: stats, graphics, grdevices, utils, datasets methods, base Other Attached Packages: dplyr_0.4.3, tm_0.6-2, NLP_0.1-8, ggplot2_1.0.1, jsonlite_0.9.16, qdap_2.2.2, RColorBrewer_1.1-2, qdaptools_1.3.1, qdapregex_0.5.0, qdapdictionaries_ Task 4: Popular Dishes In this section, we detect the top 100 most popular dishes in Chinese cuisine Popularity Analysis The measurement for popularity of a dish is defined as the frequency that the dish appears in customers reviews. If a dish name appears more than one time in the same piece of review, it is only counted once. We obtain a data frame with m rows and n 3) columns, where m is the number of reviews and n is the number of dishes. Each row represents an individual review. In each row, the first m columns are the counts of the m dishes. Basically, if a dish name appears in the review, then the value in the corresponding column is 1, otherwise it is 0. The n 1 st column is the stars 20

29 corresponding to the review, the n 2 nd column is the name of the corresponding restaurant, and the n 3 rd column is the overall star of the restaurant Sentiment Analysis A frequently ordered (mentioned) dish is not necessarily tasty as well. We use the stars given by reviewers in their reviews as an indicator for the tastiness of the corresponding dishes mentioned in the reviews. For example, if a reviewer mentions fried rice and orange chicken in his or her review, and he or she gives a five stars in the review for his or her experience, then fried rice and orange chicken both earn a tastiness 5 due to this piece of review. We count the total stars each dish earns from the reviewers as its overall tastiness. Then the overall tastiness is normalized into a range of 1 to Illustration The results are presented in Figure 4.1, where the x-axis is the top 100 popular dish names and y-axis is the corresponding frequency-based popularity. We used color to show the tastiness of the dishes. There exist a strong correlation that tastier dishes tends to be ordered (mentioned) more often, which makes sense in practice. 21

30 Figure 4.1 Illustration for popular dish names 4.4 Task 5: Popular Restaurants In this part, we mine the popular restaurants for a specific dish. Without losing generality, we use orange chicken and fried rice as two examples because they are two of the most popular dishes in Chinese cuisine as shown in Figure 4.1. Note that other dish names can be used for this task since the method and code we use to obtain the results in this section are supposed to be universal Popularity Analysis We group the data frame obtained in task 4 by restaurant and calculate the total count of dishes for each restaurant. We use the total count as popularity of the restaurant with respect to a dish. For example, for restaurant Panda Express, the total count of orange chicken is 145 while the total count of fried rice is 87. For another example, for restaurant Chino Bandido, the total count of orange chicken is 36 while the total count of fried rice is 406. As you can see that Panda Express is more popular for its orange chicken whereas Chino Bandido is more popular for its fried rice. 22

31 4.4.2 Sentiment Analysis A restaurant may serve a lot of orange chicken or fried rice, but it could be because of the population in that area or its low price. We want to know if the customers are happy after having its orange chicken or fried rice, which means how tasty the orange chicken and fried rice are. We use the overall stars of the restaurant as a measurement Illustration The results are presented in Figure 4.2 and Figure 4.3. The x-axis represents the top 100 restaurants that serve the dishes orange chicken and fried rice and the y-axis represents the popularity of the corresponding restaurants. We used color to show the tastiness of the dishes. Figure 4.2 Illustration for popular restaurants for orange chicken 23

32 Figure 4.3 Illustration for popular restaurants for fried rice 4.5 Conclusions We believe that the figures provided above can be a good guide for people who want to try Chinese food. They can find the most popular dishes in Figure 4.1 and find which restaurants serve the best orange chicken and fried rice in Figure 4.2 and Figure

33 5 CHAPTER 5 SUMMARY OF TASK 6: HYGIENE PREDICTION In this chapter, we predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. Two text representation techniques are used: Unigram and Topic Model. Two learning algorithms are used: Logistic Regression and Random Forest. Additional features are used such as Categories, Stars, and Zipcode. 5.1 Tools Used R version Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) Packages topicmodels_0.2-2 qdap_2.2.2 qdaptools_1.3.1 qdapregex_0.5.0 qdapdictionaries_1.0.6 tm_0.6-2 NLP_0.1-8 quanteda_ randomforest_ caret_ Text Preprocessing We preprocess the review text as follows. Package tm is used. Convert the text into ASCII encoding. Strip extra whitespace from the text. Remove punctuation marks from the text. Remove numbers from the text. 25

34 5.3 Training Method 1: Logistic Regression Text Representation Techniques Unigram First, we obtain word frequency from the reviews in training data and select the top words. Here we set = 301 and frequent Second, we use the counts of frequent words in the review of each restaurant as its corresponding text-based features. Package qdap is used Additional Features Used Stars, Zipcode, and Categories Learning Algorithm Logistic Regression Results Analysis The results are presented in TABLE 5.1 where Score is the score given by Coursera grader. TABLE 5.1 Prediction Score obtained by Logistic Regression # of unigram feature Additional Features Score Scheme Stars and Zipcode Scheme Stars, Zipcode, and Categories Scheme Stars and Zipcode

35 From TABLE 5.1 we can observe the following. (1) The score is lower when additional feature Categories is used. This is probably because some categories in testing data set do not appear in training data set. (2) When more unigram features (frequent words) are used, the score is lower. This is probably because of overfitting. 5.4 Training Method 2: Random Forest Text Representation Techniques Unigram First, we obtain word frequency from the reviews in training data and select the top words. Here we set = 841 and frequent Second, we use the counts of frequent words in the review of each restaurant as its corresponding text-based features. Package qdap is used Topic Model First, we mine 10, 50 and 100 topics from training data. Second, we count the words that belong to the topics in a restaurant s review and use the counts as text-based features. Package topicmodels is used Additional Features Used Stars, Zipcode, and Categories 27

36 5.4.3 Learning Algorithm Random Forest Packages caret and randomforest are used Results Analysis We use two text representation techniques and different numbers of features. The results are shown in TABLE 5.2 and TABLE 5.3, respectively. In TABLE 5.2, we observe the following. (1) Results are improved by using additional feature Categories (2) More unigram feature improve the result. It seems that a large number of unigram features does not cause overfitting in these two cases. More tests are not carried out because more features will result in unbearable training time. TABLE 5.2 Prediction Score obtained by Random Forest & Unigram # of unigram features Additional Features Score Scheme Stars, Zipcode, and Categories Scheme Stars, Zipcode, and Categories Scheme Stars, Zipcode In TABLE 5.3, we observe that more topics does not necessarily mean better result, overfitting occurs when the number of topics becomes larger. 28

37 TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model # of topics Additional Features Score Scheme 1 10 Stars, Zipcode, and Categories Scheme 2 50 Stars, Zipcode, and Categories Scheme Stars, Zipcode, and Categories Method Comparison From the result we can tell that logistic regression tends to see overfitting with small numbers of features whereas random forest is less prone to overfitting. Overall, random forest provides slightly better results than logistic regression, but the former takes much more computer time than the latter. Comparing results from TABLE 5.2 and TABLE 5.3, we observe that the topic model method on average has similar results as unigram whereas its best result does not outperform Unigram. The reason could be as follows. On one hand, topic model reduces the dimension of features and enhances the important features. On the other hand, we may lose some information for prediction during the dimension reduction. 29

38 6 CHAPTER 6 USEFULNESS OF RESULTS In this chapter, we introduce the useful results obtained through the data mining capstone. 6.1 Cuisine Maps In Chapter 2, we build several cuisine maps which show the similarity between 50 different cuisines Usefulness for Customers These maps can be very useful for customers who want to explore new cuisines. For instance, according to the cuisine map in Figure 2.2, Mediterranean, Greek, and Middle Eastern are three very similar cuisines. People who like one of them may want to try the other two if they use the cuisine map Usefulness for Restaurant Owners These maps can also benefit restaurant owners who want to extend their businesses. They can choose to open their new restaurants next to or far away from certain restaurants. For example, an owner of a cafe may want to open a new cafe next to a restaurant that specifically provides breakfast and brunch since they are very similar according to the cuisine map and people will love to grab a cup of coffee before or after breakfast. 6.2 Dish Recognizer We recognize some dishes in task 3 as introduced in Chapter 3. This is useful for businessmen who want to open restaurants. It is very helpful to know what dishes are served in certain cuisine before opening a restaurant of that cuisine. 30

39 6.3 Popular Dishes Detection We detect top 100 popular Chinese dishes with corresponding tastiness in task 4 as introduced in Chapter 4. This is extremely useful for people who like Chinese food and who want to try Chinese food. The reason is obvious. People can find the most popular and tasty dishes and avoid ordering dishes that are not so welcomed. In addition, this result is also very useful to owners of Chinese restaurants and businessmen who want to start Chinese restaurants. For them, providing more popular food is more likely to bring more customers and hence more profit. 6.4 Restaurant Recommendation We recommend top 100 restaurants that serve orange chicken and fried rice in task 5 as presented in Chapter 4. This is also quite useful for customers who want to try these two special dishes. 6.5 Hygiene Prediction This result helps customers in selecting clean restaurants to go and avoid restaurants that are not so good at keeping hygiene. 31

40 7 CHAPTER 7 NOVELTY OF EXPLORATION 7.1 Hierarchical Clustering in Cuisine Map Development When we build the cuisine map considering clustering, hierarchical clustering is used, as shown in Figure 2.3. The hierarchical relation between cuisines are shown together with the similarity matrix. This really helps users find the clusters based on their own need. Instead of fixing the number of clusters beforehand, we allow users to choose how many clusters they want or to simply find cuisine that are connected by the hierarchical links. 7.2 TopMine Output Used as the Input for SegPhrase in Dish Recognition In recognizing dish names for Chinese cuisine, we use the output of TopMine as the input for SegPhrase so that SegPhrase has a more comprehensive labeled dish list. The first part of the labeled list is the one we revise manually in task 3.1. The second part of the list is from the result of TopMine. This method turns out to be very effective, which results in a 12 out 10 score according to the grader. 7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction In training hygiene prediction models, we use two text representation techniques: unigram and topic model. For unigram, we detect the top N popular terms first instead of using all the terms in corpus, and then use the counts of the N words in the reviews as features. For topic model, we first mine topics from customer reviews, and then use the word counts in the topics as features. The two methods are very effective according to the grader. A F1 = 0.56 is obtained using the top term counts as features and a F1 = 0.55 is obtained using the topic model. 32

41 8 CHAPTER 8 CONTRIBUTION OF NEW KNOWLEDGE 8.1 Some Advantages of Random Forest over Logistic Regression In carrying out task 6, we train both logistic regression model and random forest with the same number of features and compare the results. Here are some advantages of random forest over logistic regression found during the experiment Random Forest is Less Prone to Overfitting than Logistic Regression We found that random forest provides better results when more and more features are included without showing overfitting, though we do not carry out experiments with more than 1500 features. However, logistic regression shows the sign of overfitting when less than 1500 features are used. This shows us that random forest is less prone to overfitting than logistic regression Logistic Regression is not Good at Handling Missing Feature Value When we are using logistic regression as prediction algorithm and categories of restaurants as a feature, warnings occur because some restaurant categories that do not appear in training data appear in testing data, which causes worse prediction result. On the other hand, random forest seems to be able to cope with such situation and even provide better prediction when restaurant categories are used as a feature. 33

42 9 CHAPTER 9 IMPROVEMENT TO BE DONE Several things can be done to improve this project: First, web based tools can be developed for interactive illustration of results. Second, updating algorithm can be developed to update the results in an efficient manner when more data are available instead of carrying out data mining from scratch. Third, a location based restaurant and dish recommendation should be developed which can be more helpful for customers in specific places. 34

43 10 REFERENCES [1] El-Kishky, Ahmed, et al. "Scalable topical phrase mining from text corpora." Proceedings of the VLDB Endowment, 8.3 (2014): [2] Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases from Massive Text Corpora, Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15), Melbourne, Australia, May (* equally contributed) 35

What Makes a Cuisine Unique?

What Makes a Cuisine Unique? What Makes a Cuisine Unique? Sunaya Shivakumar sshivak2@illinois.edu ABSTRACT There are many different national and cultural cuisines from around the world, but what makes each of them unique? We try to

More information

Predicting Wine Quality

Predicting Wine Quality March 8, 2016 Ilker Karakasoglu Predicting Wine Quality Problem description: You have been retained as a statistical consultant for a wine co-operative, and have been asked to analyze these data. Each

More information

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution STA 2023 Module 6 The Normal Distribution Learning Objectives 1. Explain what it means for a variable to be normally distributed or approximately normally distributed. 2. Explain the meaning of the parameters

More information

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves STA 2023 Module 6 The Normal Distribution Learning Objectives 1. Explain what it means for a variable to be normally distributed or approximately normally distributed. 2. Explain the meaning of the parameters

More information

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts When you need to understand situations that seem to defy data analysis, you may be able to use techniques

More information

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data . Activity 10 Coffee Break Economists often use math to analyze growth trends for a company. Based on past performance, a mathematical equation or formula can sometimes be developed to help make predictions

More information

Yelp Chanllenge. Tianshu Fan Xinhang Shao University of Washington. June 7, 2013

Yelp Chanllenge. Tianshu Fan Xinhang Shao University of Washington. June 7, 2013 Yelp Chanllenge Tianshu Fan Xinhang Shao University of Washington June 7, 2013 1 Introduction In this project, we took the Yelp challenge and generated some interesting results about restaurants. Yelp

More information

Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Capacity Utilization. Last Updated: December 21, 2016

Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Capacity Utilization. Last Updated: December 21, 2016 1 Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Capacity Utilization Last Updated: December 21, 2016 I. General Comments This file provides documentation for the Philadelphia

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Apache Mahout Feb 13, 2012 Shannon Quinn MapReduce Review Scalable programming model Map phase Shuffle Reduce phase MapReduce Implementations Google Hadoop Map Phase Reduce Phase

More information

Tips for Writing the RESULTS AND DISCUSSION:

Tips for Writing the RESULTS AND DISCUSSION: Tips for Writing the RESULTS AND DISCUSSION: 1. The contents of the R&D section depends on the sequence of procedures described in the Materials and Methods section of the paper. 2. Data should be presented

More information

IT 403 Project Beer Advocate Analysis

IT 403 Project Beer Advocate Analysis 1. Exploratory Data Analysis (EDA) IT 403 Project Beer Advocate Analysis Beer Advocate is a membership-based reviews website where members rank different beers based on a wide number of categories. The

More information

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN NVIVO 10 WORKSHOP Hui Bian Office for Faculty Excellence BY HUI BIAN 1 CONTACT INFORMATION Email: bianh@ecu.edu Phone: 328-5428 Temporary Location: 1413 Joyner library Website: http://core.ecu.edu/ofe/statisticsresearch/

More information

Wine Rating Prediction

Wine Rating Prediction CS 229 FALL 2017 1 Wine Rating Prediction Ke Xu (kexu@), Xixi Wang(xixiwang@) Abstract In this project, we want to predict rating points of wines based on the historical reviews from experts. The wine

More information

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H. Online Appendix to Are Two heads Better Than One: Team versus Individual Play in Signaling Games David C. Cooper and John H. Kagel This appendix contains a discussion of the robustness of the regression

More information

Lesson 11: Comparing Ratios Using Ratio Tables

Lesson 11: Comparing Ratios Using Ratio Tables Student Outcomes Students solve problems by comparing different ratios using two or more ratio tables. Classwork Example 1 (10 minutes) Allow students time to complete the activity. If time permits, allow

More information

2016 China Dry Bean Historical production And Estimated planting intentions Analysis

2016 China Dry Bean Historical production And Estimated planting intentions Analysis 2016 China Dry Bean Historical production And Estimated planting intentions Analysis Performed by Fairman International Business Consulting 1 of 10 P a g e I. EXECUTIVE SUMMARY A. Overall Bean Planting

More information

PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN DOWNLOAD EBOOK : PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN PDF

PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN DOWNLOAD EBOOK : PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN PDF PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN DOWNLOAD EBOOK : PROFESSIONAL COOKING, 8TH EDITION BY WAYNE Click link bellow and free register to download ebook: PROFESSIONAL COOKING, 8TH EDITION BY

More information

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

The Market Potential for Exporting Bottled Wine to Mainland China (PRC) The Market Potential for Exporting Bottled Wine to Mainland China (PRC) The Machine Learning Element Data Reimagined SCOPE OF THE ANALYSIS This analysis was undertaken on behalf of a California company

More information

AWRI Refrigeration Demand Calculator

AWRI Refrigeration Demand Calculator AWRI Refrigeration Demand Calculator Resources and expertise are readily available to wine producers to manage efficient refrigeration supply and plant capacity. However, efficient management of winery

More information

Buying Filberts On a Sample Basis

Buying Filberts On a Sample Basis E 55 m ^7q Buying Filberts On a Sample Basis Special Report 279 September 1969 Cooperative Extension Service c, 789/0 ite IP") 0, i mi 1910 S R e, `g,,ttsoliktill:torvti EARs srin ITQ, E,6

More information

TRTP and TRTA in BDS Application per CDISC ADaM Standards Maggie Ci Jiang, Teva Pharmaceuticals, West Chester, PA

TRTP and TRTA in BDS Application per CDISC ADaM Standards Maggie Ci Jiang, Teva Pharmaceuticals, West Chester, PA PharmaSUG 2016 - Paper DS14 TRTP and TRTA in BDS Application per CDISC ADaM Standards Maggie Ci Jiang, Teva Pharmaceuticals, West Chester, PA ABSTRACT CDSIC ADaM Implementation Guide v1.1 (IG) [1]. has

More information

Lollapalooza Did Not Attend (n = 800) Attended (n = 438)

Lollapalooza Did Not Attend (n = 800) Attended (n = 438) D SDS H F 1, 16 ( ) Warm-ups (A) Which bands come to ACL Fest? Is it true that if a band plays at Lollapalooza, then it is more likely to play at Austin City Limits (ACL) that year? To be able to provide

More information

Missing Data Treatments

Missing Data Treatments Missing Data Treatments Lindsey Perry EDU7312: Spring 2012 Presentation Outline Types of Missing Data Listwise Deletion Pairwise Deletion Single Imputation Methods Mean Imputation Hot Deck Imputation Multiple

More information

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017 Modeling Wine Quality Using Classification and Mario Wijaya MGT 8803 November 28, 2017 Motivation 1 Quality How to assess it? What makes a good quality wine? Good or Bad Wine? Subjective? Wine taster Who

More information

What makes a good muffin? Ivan Ivanov. CS229 Final Project

What makes a good muffin? Ivan Ivanov. CS229 Final Project What makes a good muffin? Ivan Ivanov CS229 Final Project Introduction Today most cooking projects start off by consulting the Internet for recipes. A quick search for chocolate chip muffins returns a

More information

RENAL DIET HQ 1

RENAL DIET HQ   1 Hello and welcome to the Renal Diet Headquarters Podcast. This is Mathea Ford again and we are on podcast number 42. And you can find all the links and the information on our website at www.renaldiethq.com/042.

More information

Feasibility report on best fast food options on University Drive in Denton, Texas.

Feasibility report on best fast food options on University Drive in Denton, Texas. Feasibility report on best fast food options on University Drive in Denton, Texas. By: Reagan Teltschik Table of Contents iii Table of contents Introduction... 5 Chapter 1... 3 Methods... 3 Chapter 2...

More information

Thought: The Great Coffee Experiment

Thought: The Great Coffee Experiment Thought: The Great Coffee Experiment 7/7/16 By Kevin DeLuca ThoughtBurner Opportunity Cost of Reading this ThoughtBurner post: $1.97 about 8.95 minutes I drink a lot of coffee. In fact, I m drinking a

More information

Step 1: Prepare To Use the System

Step 1: Prepare To Use the System Step : Prepare To Use the System PROCESS Step : Set-Up the System MAP Step : Prepare Your Menu Cycle MENU Step : Enter Your Menu Cycle Information MODULE Step 5: Prepare For Production Step 6: Execute

More information

Amazon Fine Food Reviews wait I don t know what they are reviewing

Amazon Fine Food Reviews wait I don t know what they are reviewing David Tsukiyama CSE 190 Dahta Mining and Predictive Analytics Professor Julian McAuley Amazon Fine Food Reviews wait I don t know what they are reviewing Dataset This paper uses Amazon Fine Food reviews

More information

The Dun & Bradstreet Asia Match Environment. AME FAQ. Warwick R Matthews

The Dun & Bradstreet Asia Match Environment. AME FAQ. Warwick R Matthews The Dun & Bradstreet Asia Match Environment. AME FAQ Updated April 8, 2015 Updated By Warwick R Matthews (matthewswa@dnb.com) 1. Can D&B do matching in Asian languages? 2. What is AME? 3. What is AME Central?

More information

Predicting Wine Varietals from Professional Reviews

Predicting Wine Varietals from Professional Reviews Predicting Wine Varietals from Professional Reviews By Ron Tidhar, Eli Ben-Joseph, Kate Willison 11th December 2015 CS 229 - Machine Learning: Final Project - Stanford University Abstract This paper outlines

More information

Word Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017

Word Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017 Word Embeddings for NLP in Python Marco Bonzanini PyCon Italia 2017 Nice to meet you WORD EMBEDDINGS? Word Embeddings = Word Vectors = Distributed Representations Why should you care? Why should you care?

More information

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method Name: Period: 5 Populations Estimating Animal Populations by Using the Mark-Recapture Method Background Information: Lincoln-Peterson Sampling Techniques In the field, it is difficult to estimate the population

More information

FOR PERSONAL USE. Capacity BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN ACTIVITY ASSESSMENT OPPORTUNITIES. Grade 3 Quarter 1 Activity 2

FOR PERSONAL USE. Capacity BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN ACTIVITY ASSESSMENT OPPORTUNITIES. Grade 3 Quarter 1 Activity 2 activity 2 Capacity BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN Grade 3 Quarter 1 Activity 2 SC.A.1.2.1 The student determines that the properties of materials (e.g., density and volume) can be compared

More information

Voice Control System. Voice Recognition

Voice Control System. Voice Recognition Voice Control System Your vehicle has a voice control system that allows hands-free operation of the navigation system functions. The voice control system uses the (Talk) and (Back) buttons on the steering

More information

Mini Project 3: Fermentation, Due Monday, October 29. For this Mini Project, please make sure you hand in the following, and only the following:

Mini Project 3: Fermentation, Due Monday, October 29. For this Mini Project, please make sure you hand in the following, and only the following: Mini Project 3: Fermentation, Due Monday, October 29 For this Mini Project, please make sure you hand in the following, and only the following: A cover page, as described under the Homework Assignment

More information

Barista at a Glance BASIS International Ltd.

Barista at a Glance BASIS International Ltd. 2007 BASIS International Ltd. www.basis.com Barista at a Glance 1 A Brewing up GUI Apps With Barista Application Framework By Jon Bradley lmost as fast as the Starbucks barista turns milk, java beans,

More information

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform This document contains several additional results that are untabulated but referenced

More information

Wideband HF Channel Availability Measurement Techniques and Results W.N. Furman, J.W. Nieto, W.M. Batts

Wideband HF Channel Availability Measurement Techniques and Results W.N. Furman, J.W. Nieto, W.M. Batts Wideband HF Channel Availability Measurement Techniques and Results W.N. Furman, J.W. Nieto, W.M. Batts THIS INFORMATION IS NOT EXPORT CONTROLLED THIS INFORMATION IS APPROVED FOR RELEASE WITHOUT EXPORT

More information

Customer Survey Summary of Results March 2015

Customer Survey Summary of Results March 2015 Customer Survey Summary of Results March 2015 Overview In February and March 2015, we conducted a survey of customers in three corporate- owned Bruges Waffles & Frites locations: Downtown Salt Lake City,

More information

Labor Requirements and Costs for Harvesting Tomatoes. Zhengfei Guan, 1 Feng Wu, and Steven Sargent University of Florida

Labor Requirements and Costs for Harvesting Tomatoes. Zhengfei Guan, 1 Feng Wu, and Steven Sargent University of Florida Labor Requirements and Costs for ing Tomatoes Zhengfei Guan, 1 Feng Wu, and Steven Sargent University of Florida Introduction Florida accounted for 30% to 40% of all commercially produced fresh-market

More information

2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project

2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project 1 Abstract HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project This project looks to apply machine learning techniques in the area of beer recommendation and style prediction. The first

More information

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE Victoria SAS Users Group November 26, 2013 Missing value imputation in SAS: an intro to Proc MI and MIANALYZE Sylvain Tremblay SAS Canada Education Copyright 2010 SAS Institute Inc. All rights reserved.

More information

PRODUCT EXAMPLE PIZZA

PRODUCT EXAMPLE PIZZA PRODUCT EXAMPLE PIZZA Carla is using an old family recipe to develop a frozen pizza product for her company. Carla would like to do the following: Create a dough formula. Convert the dough formula into

More information

What Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes

What Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes UNIVERSITY OF CALIFORNIA: SAN DIEGO, NOVEMBER 2015 1 What Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes Hendrik Hannes Holste, Maya Nyayapati, Edward Wong Abstract

More information

Handling Missing Data. Ashley Parker EDU 7312

Handling Missing Data. Ashley Parker EDU 7312 Handling Missing Data Ashley Parker EDU 7312 Presentation Outline Types of Missing Data Treatments for Handling Missing Data Deletion Techniques Listwise Deletion Pairwise Deletion Single Imputation Techniques

More information

Elemental Analysis of Yixing Tea Pots by Laser Excited Atomic. Fluorescence of Desorbed Plumes (PLEAF) Bruno Y. Cai * and N.H. Cheung Dec.

Elemental Analysis of Yixing Tea Pots by Laser Excited Atomic. Fluorescence of Desorbed Plumes (PLEAF) Bruno Y. Cai * and N.H. Cheung Dec. Elemental Analysis of Yixing Tea Pots by Laser Excited Atomic Fluorescence of Desorbed Plumes (PLEAF) Bruno Y. Cai * and N.H. Cheung 2012 Dec. 31 Summary Two Yixing tea pot samples were analyzed by PLEAF.

More information

WiX Cookbook Free Ebooks PDF

WiX Cookbook Free Ebooks PDF WiX Cookbook Free Ebooks PDF Over 60 hands-on recipes packed with tips and tricks to boost your Windows installationsabout This BookBuild WiX projects within Visual Studio, as part of a continuous-integration

More information

MEAT WEBQUEST Foods and Nutrition

MEAT WEBQUEST Foods and Nutrition MEAT WEBQUEST Foods and Nutrition Overview When a person cooks for themselves, or for family, and/or friends, they want to serve a meat dish that is appealing, very tasty, as well as nutritious. They do

More information

Chinese Cooking: The Chinese Takeout Recipes, Quick & Easy To Prepare Dishes At Home Ebooks Free

Chinese Cooking: The Chinese Takeout Recipes, Quick & Easy To Prepare Dishes At Home Ebooks Free Chinese Cooking: The Chinese Takeout Recipes, Quick & Easy To Prepare Dishes At Home Ebooks Free Discover How Easy It Is To Cook Delicious And Healthy Chinese Food!Step by step instructions on how to prepare

More information

STAT 5302 Applied Regression Analysis. Hawkins

STAT 5302 Applied Regression Analysis. Hawkins Homework 3 sample solution 1. MinnLand data STAT 5302 Applied Regression Analysis. Hawkins newdata

More information

GCSE 4091/01 DESIGN AND TECHNOLOGY UNIT 1 FOCUS AREA: Food Technology

GCSE 4091/01 DESIGN AND TECHNOLOGY UNIT 1 FOCUS AREA: Food Technology Surname Centre Number Candidate Number Other Names 0 GCSE 4091/01 DESIGN AND TECHNOLOGY UNIT 1 FOCUS AREA: Food Technology A.M. TUESDAY, 19 May 2015 2 hours S15-4091-01 For s use Question Maximum Mark

More information

Directions: Read the passage. Then answer the questions below.

Directions: Read the passage. Then answer the questions below. READTHEORY Reading Comprehension 2 Level 7 Name Date Directions: Read the passage. Then answer the questions below. For two months, I have been trying to decide who makes the best ice cream. I have narrowed

More information

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017 Decision making with incomplete information Some new developments Rudolf Vetschera University of Vienna Tamkang University May 15, 2017 Agenda Problem description Overview of methods Single parameter approaches

More information

Better Punctuation Prediction with Hierarchical Phrase-Based Translation

Better Punctuation Prediction with Hierarchical Phrase-Based Translation Better Punctuation Prediction with Hierarchical Phrase-Based Translation Stephan Peitz, Markus Freitag and Hermann Ney peitz@cs.rwth-aachen.de IWSLT 2014, Lake Tahoe, CA December 4th, 2014 Human Language

More information

WINE RECOGNITION ANALYSIS BY USING DATA MINING

WINE RECOGNITION ANALYSIS BY USING DATA MINING 9 th International Research/Expert Conference Trends in the Development of Machinery and Associated Technology TMT 2005, Antalya, Turkey, 26-30 September, 2005 WINE RECOGNITION ANALYSIS BY USING DATA MINING

More information

HW 5 SOLUTIONS Inference for Two Population Means

HW 5 SOLUTIONS Inference for Two Population Means HW 5 SOLUTIONS Inference for Two Population Means 1. The Type II Error rate, β = P{failing to reject H 0 H 0 is false}, for a hypothesis test was calculated to be β = 0.07. What is the power = P{rejecting

More information

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials Project Overview The overall goal of this project is to deliver the tools, techniques, and information for spatial data driven variable rate management in commercial vineyards. Identified 2016 Needs: 1.

More information

What Is This Module About?

What Is This Module About? What Is This Module About? Do you enjoy shopping or going to the market? Is it hard for you to choose what to buy? Sometimes, you see that there are different quantities available of one product. Do you

More information

Gasoline Empirical Analysis: Competition Bureau March 2005

Gasoline Empirical Analysis: Competition Bureau March 2005 Gasoline Empirical Analysis: Update of Four Elements of the January 2001 Conference Board study: "The Final Fifteen Feet of Hose: The Canadian Gasoline Industry in the Year 2000" Competition Bureau March

More information

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name: 3 rd Science Notebook Structures of Life Investigation 1: Origin of Seeds Name: Big Question: What are the properties of seeds and how does water affect them? 1 Alignment with New York State Science Standards

More information

wine 1 wine 2 wine 3 person person person person person

wine 1 wine 2 wine 3 person person person person person 1. A trendy wine bar set up an experiment to evaluate the quality of 3 different wines. Five fine connoisseurs of wine were asked to taste each of the wine and give it a rating between 0 and 10. The order

More information

Non-Structural Carbohydrates in Forage Cultivars Troy Downing Oregon State University

Non-Structural Carbohydrates in Forage Cultivars Troy Downing Oregon State University Non-Structural Carbohydrates in Forage Cultivars Troy Downing Oregon State University Contact at: OSU Extension Service, Tillamook County, 2204 4 th St., Tillamook, OR 97141, 503-842-3433, Email, troy.downing@oregonstate.edu

More information

Multiple Imputation for Missing Data in KLoSA

Multiple Imputation for Missing Data in KLoSA Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1. Missing Data and Missing Data Mechanisms 2. Imputation 3. Missing Data and Multiple Imputation in Baseline

More information

STABILITY IN THE SOCIAL PERCOLATION MODELS FOR TWO TO FOUR DIMENSIONS

STABILITY IN THE SOCIAL PERCOLATION MODELS FOR TWO TO FOUR DIMENSIONS International Journal of Modern Physics C, Vol. 11, No. 2 (2000 287 300 c World Scientific Publishing Company STABILITY IN THE SOCIAL PERCOLATION MODELS FOR TWO TO FOUR DIMENSIONS ZHI-FENG HUANG Institute

More information

FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE

FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE 12 November 1953 FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE The present paper is the first in a series which will offer analyses of the factors that account for the imports into the United States

More information

The Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method

The Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method Name Date The Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method Introduction: In order to effectively study living organisms, scientists often need to know the size of

More information

International Journal of Business and Commerce Vol. 3, No.8: Apr 2014[01-10] (ISSN: )

International Journal of Business and Commerce Vol. 3, No.8: Apr 2014[01-10] (ISSN: ) The Comparative Influences of Relationship Marketing, National Cultural values, and Consumer values on Consumer Satisfaction between Local and Global Coffee Shop Brands Yi Hsu Corresponding author: Associate

More information

ENGI E1006 Percolation Handout

ENGI E1006 Percolation Handout ENGI E1006 Percolation Handout NOTE: This is not your assignment. These are notes from lecture about your assignment. Be sure to actually read the assignment as posted on Courseworks and follow the instructions

More information

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2]

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2] Can You Tell the Difference? A Study on the Preference of Bottled Water [Anonymous Name 1], [Anonymous Name 2] Abstract Our study aims to discover if people will rate the taste of bottled water differently

More information

Which of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not?

Which of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not? wrong 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 right 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 score 100 98.5 97.0 95.5 93.9 92.4 90.9 89.4 87.9 86.4 84.8 83.3 81.8 80.3 78.8 77.3 75.8 74.2

More information

Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Indexes of Aggregate Weekly Hours. Last Updated: December 22, 2016

Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Indexes of Aggregate Weekly Hours. Last Updated: December 22, 2016 1 Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Indexes of Aggregate Weekly Hours Last Updated: December 22, 2016 I. General Comments This file provides documentation for

More information

MBA 503 Final Project Guidelines and Rubric

MBA 503 Final Project Guidelines and Rubric MBA 503 Final Project Guidelines and Rubric Overview There are two summative assessments for this course. For your first assessment, you will be objectively assessed by your completion of a series of MyAccountingLab

More information

A Note on a Test for the Sum of Ranksums*

A Note on a Test for the Sum of Ranksums* Journal of Wine Economics, Volume 2, Number 1, Spring 2007, Pages 98 102 A Note on a Test for the Sum of Ranksums* Richard E. Quandt a I. Introduction In wine tastings, in which several tasters (judges)

More information

PINEAPPLE LEAF FIBRE EXTRACTIONS: COMPARISON BETWEEN PALF M1 AND HAND SCRAPPING

PINEAPPLE LEAF FIBRE EXTRACTIONS: COMPARISON BETWEEN PALF M1 AND HAND SCRAPPING PINEAPPLE LEAF FIBRE EXTRACTIONS: COMPARISON BETWEEN PALF M1 AND HAND SCRAPPING Yusri Yusof, Siti Asia Yahya and Anbia Adam Universiti Tun Hussein Onn Malaysia (UTHM), Johor, Malaysia E-Mail: yusri@uthm.edu.my

More information

Noun-Verb Decomposition

Noun-Verb Decomposition Noun-Verb Decomposition Nouns Restaurant [Regular, Catering, Take- Out] (Location, Type of food, Hours of operation, Reservations) Verbs has (information) SWEN-261 Introduction to Software Engineering

More information

INTRO TO TEXT MINING: BAG OF WORDS. What is text mining?

INTRO TO TEXT MINING: BAG OF WORDS. What is text mining? INTRO TO TEXT MINING: BAG OF WORDS What is text mining? Intro to Text Mining: Bag of Words What is text mining? The process of distilling actionable insights from text Intro to Text Mining: Bag of Words

More information

Ideas for group discussion / exercises - Section 3 Applying food hygiene principles to the coffee chain

Ideas for group discussion / exercises - Section 3 Applying food hygiene principles to the coffee chain Ideas for group discussion / exercises - Section 3 Applying food hygiene principles to the coffee chain Activity 4: National level planning Reviewing national codes of practice and the regulatory framework

More information

Growth in early yyears: statistical and clinical insights

Growth in early yyears: statistical and clinical insights Growth in early yyears: statistical and clinical insights Tim Cole Population, Policy and Practice Programme UCL Great Ormond Street Institute of Child Health London WC1N 1EH UK Child growth Growth is

More information

Which of the following are resistant statistical measures? 1. Mean 2. Median 3. Mode 4. Range 5. Standard Deviation

Which of the following are resistant statistical measures? 1. Mean 2. Median 3. Mode 4. Range 5. Standard Deviation Which of the following are resistant statistical measures? 1. Mean 2. Median 3. Mode 4. Range 5. Standard Deviation For the variable number of parking tickets in the past year would you expect the distribution

More information

Biologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name

Biologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name wrong 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 right 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 score 100 98.6 97.2 95.8 94.4 93.1 91.7 90.3 88.9 87.5 86.1 84.7 83.3 81.9

More information

A Note on H-Cordial Graphs

A Note on H-Cordial Graphs arxiv:math/9906012v1 [math.co] 2 Jun 1999 A Note on H-Cordial Graphs M. Ghebleh and R. Khoeilar Institute for Studies in Theoretical Physics and Mathematics (IPM) and Department of Mathematical Sciences

More information

PISA Style Scientific Literacy Question

PISA Style Scientific Literacy Question PISA Style Scientific Literacy Question The dodo was a large bird, roughly the size of a swan. It has been described as heavily built or even fat. It was flightless, but is believed to have been able to

More information

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4 The following group project is to be worked on by no more than four students. You may use any materials you think may be useful in solving the problems but you may not ask anyone for help other than the

More information

IMAGE B BASE THERAPY. I can identify and give a straightforward description of the similarities and differences between texts.

IMAGE B BASE THERAPY. I can identify and give a straightforward description of the similarities and differences between texts. I can identify and give a straightforward description of the similarities and differences between texts. BASE THERAPY Breaking down the skill: Identify to use skimming and scanning skills to locate parts

More information

Analysis of Things (AoT)

Analysis of Things (AoT) Analysis of Things (AoT) Big Data & Machine Learning Applied to Brent Crude Executive Summary Data Selecting & Visualising Data We select historical, monthly, fundamental data We check for correlations

More information

Grapes of Class. Investigative Question: What changes take place in plant material (fruit, leaf, seed) when the water inside changes state?

Grapes of Class. Investigative Question: What changes take place in plant material (fruit, leaf, seed) when the water inside changes state? Grapes of Class 1 Investigative Question: What changes take place in plant material (fruit, leaf, seed) when the water inside changes state? Goal: Students will investigate the differences between frozen,

More information

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK 2013 SUMMARY Several breeding lines and hybrids were peeled in an 18% lye solution using an exposure time of

More information

Flo s Chinese Asian Restaurant Review. Rachel Marlin. Recently I had the pleasure of attending Flo s Chinese Asian restaurant.

Flo s Chinese Asian Restaurant Review. Rachel Marlin. Recently I had the pleasure of attending Flo s Chinese Asian restaurant. Flo s Chinese Asian Restaurant Review Rachel Marlin Recently I had the pleasure of attending Flo s Chinese Asian restaurant. I had never dined here before. Flo s is a small local restaurant that has two

More information

Pavilion Organizer - BRAZIL

Pavilion Organizer - BRAZIL Pavilion Organizer - BRAZIL With the new comers or those who are looking for a Japanese partner so that promotion can be more adequate, I think that Foodex is important. It is the best place for our food

More information

Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology

Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology Emmanuel Munguia Tapia 1, Tanzeem Choudhury and Matthai Philipose 2 1 Massachusetts Institute of Technology 2 Intel Research

More information

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years G. Lopez 1 and T. DeJong 2 1 Àrea de Tecnologia del Reg, IRTA, Lleida, Spain 2 Department

More information

Instruction (Manual) Document

Instruction (Manual) Document Instruction (Manual) Document This part should be filled by author before your submission. 1. Information about Author Your Surname Your First Name Your Country Your Email Address Your ID on our website

More information

GLOBALIZATION UNIT 1 ACTIVATE YOUR KNOWLEDGE LEARNING OBJECTIVES

GLOBALIZATION UNIT 1 ACTIVATE YOUR KNOWLEDGE LEARNING OBJECTIVES UNIT GLOBALIZATION LEARNING OBJECTIVES Key Reading Skills Additional Reading Skills Language Development Making predictions from a text type; scanning topic sentences; taking notes on supporting examples

More information

Appendix A. Table A.1: Logit Estimates for Elasticities

Appendix A. Table A.1: Logit Estimates for Elasticities Estimates from historical sales data Appendix A Table A.1. reports the estimates from the discrete choice model for the historical sales data. Table A.1: Logit Estimates for Elasticities Dependent Variable:

More information

Learning Connectivity Networks from High-Dimensional Point Processes

Learning Connectivity Networks from High-Dimensional Point Processes Learning Connectivity Networks from High-Dimensional Point Processes Ali Shojaie Department of Biostatistics University of Washington faculty.washington.edu/ashojaie Feb 21st 2018 Motivation: Unlocking

More information

appetizer choices commodities cuisine culture ethnicity geography ingredients nutrition pyramid religion

appetizer choices commodities cuisine culture ethnicity geography ingredients nutrition pyramid religion Four Goodness Sake: Lesson for Fourth Grade Purpose To help students develop awareness that food preferences and cooking styles may be based upon geographic, ethnic, and/or religious/family beliefs, but

More information

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship Juliano Assunção Department of Economics PUC-Rio Luis H. B. Braido Graduate School of Economics Getulio

More information

2017 Summary of changes to rules for World Coffee In Good Spirits Championship

2017 Summary of changes to rules for World Coffee In Good Spirits Championship 2017 Summary of changes to rules for World Coffee In Good Spirits Championship To take effect in Budapest WCIGS 2017 For internal use only not to be used in replacement of the WCIGS Rules. Please refer

More information