DATA MINING CAPSTONE FINAL REPORT
|
|
- Colin Preston
- 6 years ago
- Views:
Transcription
1 DATA MINING CAPSTONE FINAL REPORT
2 ABSTRACT This report is to summarize the tasks accomplished for the Data Mining Capstone. The tasks are based on yelp review data, majorly for restaurants. Six tasks are accomplished. The first task is to visualize customer review text for all restaurants. Frequent word cloud is plotted. Topics are detected from the review text for all restaurants. In addition, topic comparison for two Chinese restaurants are provided and visualized. The second task is to build cuisine map based on similarity between cuisines using customer review text. Top fifty cuisines are found first to be included in the cuisine map. Term Frequency (TF), Term Frequency-Inverse Document Frequency (TF-IDF), and clustering methods, i.e. hierarchical clustering and k-means clustering, are used to build similarity matrix, heat map, and cuisine map. The third task is to recognize dish names from customer review text of a certain cuisine. Chinese cuisine is chosen for this task. A labeled dish name list of Chinese cuisine is revised manually. Then two algorithms, i.e. TopMine and SegPhrase, are used to mine a comprehensive Chinese dish list based on the review text for Chinese restaurants and the labeled dish name list. The fourth and fifth tasks are to detect popular dishes and recommend good restaurants for certain dishes. Again, Chinese cuisine is chosen for this task. 700 dish names from task 3 are used as a pool of Chinese dishes. The top 100 most popular dishes and their corresponding tastiness are found by mining customer review text and review score, i.e. stars. We also recommended top 100 most popular restaurants for two popular Chinese dishes, i.e. orange chicken and fried rice. I
3 The sixth task is to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. In addition, in this report we highlight the following: (1) the most useful data mining results produced through these specific data mining tasks and potential people who might benefit from such results; (2) the novel ideas/methods explored to carry out the tasks; (3) new knowledge people can learn from the project activities, particularly through the experiments. The report is organized as follows. Chapter 1 to Chapter 5 will introduce the six tasks: Chapters 1 to 3 address tasks 1 to 3, Chapter 4 addresses tasks 4 and 5, and Chapter 5 addresses task 6. Chapter 6 introduces the useful results. Chapter 7 presents the novel method used in carrying out the tasks. Chapter 8 summarizes the contribution of new knowledge discovered throughout the capstone. II
4 TABLE OF CONTENTS 1 CHAPTER 1 SUMMARY OF TASK 1: EXPLORATION OF DATA SET Tools Used Major Packages Data Import Data Preprocess Topic Model Fitting Comparison of Topics for Two Chinese Restaurants Discussion on the Topics for the Two Chinese Restaurants Similarity Difference CHAPTER 2 SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP CONSTRUCTION Tools Used Major Packages: Data Import Data Preprocess Similarity Matrix without IDF Similarity Matrix with IDF Similarity Matrix with Clustering Hierarchical Clustering k-means Clustering Conclusions CHAPTER 3 SUMMARY OF TASK 3: DISH RECOGNITION Task 3.1: Manual Tagging Task 3.2: Mining Additional Dish Names Corpus Preparation Dish Name Identify Using TopMine Parameters Opinion about the Result Improvement Dish Name Identify Using SegPhrase Parameters Opinion about the Result CHAPTER 4 SUMMARY OF TASK 4 & 5: POPULAR DISHES AND RESTARURANT RECOMMENDATION Data Preparation III
5 4.1.1 Corpus Dish List Tools and Packages Attached Base Packages: Other Attached Packages: Task 4: Popular Dishes Popularity Analysis Sentiment Analysis Illustration Task 5: Popular Restaurants Popularity Analysis Sentiment Analysis Illustration Conclusions CHAPTER 5 SUMMARY OF TASK 6: HYGIENE PREDICTION Tools Used Packages Text Preprocessing Training Method 1: Logistic Regression Text Representation Techniques Unigram Additional Features Used Learning Algorithm Results Analysis Training Method 2: Random Forest Text Representation Techniques Unigram Topic Model Additional Features Used Learning Algorithm Results Analysis Method Comparison CHAPTER 6 USEFULNESS OF RESULTS Cuisine Maps Usefulness for Customers Usefulness for Restaurant Owners Dish Recognizer Popular Dishes Detection Restaurant Recommendation Hygiene Prediction CHAPTER 7 NOVELTY OF EXPLORATION Hierarchical Clustering in Cuisine Map Development TopMine Output Used as the Input for SegPhrase in Dish Recognition IV
6 7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction CHAPTER 8 CONTRIBUTION OF NEW KNOWLEDGE Some Advantages of Random Forest over Logistic Regression Random Forest is Less Prone to Overfitting than Logistic Regression Logistic Regression is not Good at Handling Missing Feature Value CHAPTER 9 IMPROVEMENT TO BE DONE REFERENCES V
7 List of Figures Figure 1.1 Word cloud... 2 Figure 1.2 The topics of the sampled Restaurant... 3 Figure 1.3 The topics of the first Chinese Restaurant CR Figure 1.4 The topics of the second Chinese Restaurant CR Figure 2.1 Similarity matrix without IDF Figure 2.2 Similarity matrix using IDF Figure 2.3 Similarity matrix and hierarchical cluster Figure 2.4 Similarity matrix and k-means cluster Figure 4.1 Illustration for popular dish names Figure 4.2 Illustration for popular restaurants for orange chicken Figure 4.3 Illustration for popular restaurants for fried rice VI
8 List of Tables TABLE 2.1 Cluster list TABLE 5.1 Prediction Score obtained by Logistic Regression TABLE 5.2 Prediction Score obtained by Random Forest & Unigram TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model VII
9 1 CHAPTER 1 SUMMARY OF TASK 1: EXPLORATION OF DATA SET In this chapter, we explore the yelp data set. Particularly, we mine the reviews on restaurants from customers in order to find topics. We mine the topics based on Latent Dirichlet Allocation (LDA) model and plot the topics in a circular tree for visualization. In addition, we mine and compare the topics of two Chinese restaurants. 1.1 Tools Used R version Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) 1.2 Major Packages jsonlite_ tm_0.6-2 topicmodels_0.2-2 igraph_ Data Import First, we read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Second, we select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Third, we merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 1
10 Then, we randomly select 10,000 samples from RESTAURANTS_REVIEW and record it into data set restaurants_review. A relatively small data set is sampled due to the limit in time and computer capacity. 1.4 Data Preprocess We then convert the data into a corpus by using tm packages: We build a corpus based on the text column of restaurants_review. We convert all the words into lower case. We remove anything other than English letters or space. We remove stop words. We remove extra white space. Make the text fit paper width, i.e. each line has at most 60 characters. We heuristically complete stemmed words. We constructs a document-term matrix. We plot the word cloud to see what the major words are. Figure 1.1 Word cloud 2
11 From the word cloud we can see the common words people use to describe a restaurant, e.g. good, food, great, place, just, like, service, and etc. In addition, we can see food names such as salad, pizza, cheese, sushi, and etc. We can also find that chicken is a very common food in US. All in all, we can find a lot of information that makes sense. 1.5 Topic Model Fitting We fit the document-term matrix into LDA model using the LDA function in the topicmodels package. We set the number of topics as 10. The plot is shown in Figure 1.2 where ten words in each topic are presented. From Figure 1.2 we can tell that people mention food, place, service, and good, great in many topics, which is to be expected. In Figure 1.2, the words in topic i have i after them. For example, in topic 1, we have food 1, great 1 and so on. Figure 1.2 The topics of the sampled Restaurant 3
12 1.6 Comparison of Topics for Two Chinese Restaurants We randomly select two Chinese restaurants: CR1 with business_id - 3WVw1TNQbPBzaKCaQQ1AQ and CR2 with business_id -mz0zr0dw6zasg7_ah1r8a. We carry out the same procedure as above and obtain the LDA based topic plots as shown in Figure 1.3 and Figure 1.4. Figure 1.3 The topics of the first Chinese Restaurant CR1 4
13 Figure 1.4 The topics of the second Chinese Restaurant CR2 1.7 Discussion on the Topics for the Two Chinese Restaurants Similarity Both topic 2 for CR1 and topic 3 for CR2 contain good, dish, order, beef and place. This is not surprising because beef is very common in USA. It is very likely good tasting dishes containing beef are often ordered in both restaurants. The topics for both restaurants contains China and Chinese and other common words such as food, good, place, chicken, and dish Difference The major words of the topics really depend on the names and menus of the restaurants. It is obvious that in the topics of restaurant CR1, people are talking about chili and spiciness since the restaurant is called China Chili and probably serves a lot of spicy food. However, in 5
14 the restaurant CR2, fried, egg, roll and pork appear often because the second restaurant is called Sing High and serves Barbecued pork slices, egg roll, fried Won Ton. 6
15 2 CHAPTER 2 SUMMARY OF TASK 2: CUISINE CLUSTERING AND MAP CONSTRUCTION In this chapter, we mine the data set to construct cuisine maps to visually understand the landscape of different types of cuisines and their similarities. The cuisine map can help users understand what cuisines are available and their relations, which allows for the discovery of new cuisines, thus facilitating exploration of unfamiliar cuisines. The cuisine map is build based on the categories and customer reviews of restaurants in Yelp data. 2.1 Tools Used R version Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) 2.2 Major Packages: reshape2_1.4.1 plyr_1.8.3 ggplot2_1.0.1 scales_0.2.5 HSAUR_1.3-7 cluster_2.0.3 corrplot_0.73 proxy_ tm_0.6-2 NLP_0.1-8 jsonlite_ Data Import First, we read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Second, we select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Third, we merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 7
16 2.4 Data Preprocess First, we search for the most popular cuisines by counting the frequency of cuisine names in categories column. We pick the top 50 ones to build the cuisine map: [1] "American (New)" "American (Traditional)" "Nightlife" "Bars" [5] "Mexican" "Italian" "Breakfast & Brunch" "Pizza" [9] "Steakhouses" "Sandwiches" "Burgers" "Sushi Bars" [13] "Japanese" "Chinese" "Seafood" "Buffets" [17] "Fast Food" "Thai" "Asian Fusion" "Mediterranean" [21] "French" "Cafes" "Sports Bars" "Barbeque" [25] "Pubs" "Coffee & Tea" "Vietnamese" "Delis" [29] "Vegetarian" "Lounges" "Greek" "Wine Bars" [33] "Desserts" "Bakeries" "Gluten-Free" "Diners" [37] "Indian" "Korean" "Salad" "Chicken Wings" [41] "Hot Dogs" "Tapas Bars" "Arts & Entertainment" "Southern" [45] "Tapas/Small Plates" "Middle Eastern" "Hawaiian" "Vegan" [49] "Gastropubs" "Dim Sum" Then, we eliminate the entries that are not in the 50 categories from RESTAURANTS_REVIEW, randomly sample 10,000 entries from RESTAURANTS_REVIEW and record it into data set restaurants_review. A relatively small data set is sampled due to the limit in time and computer capacity. We then convert the data into a corpus by using tm packages: We build a corpus based on the text column of restaurants_review. We convert all the words into lower case. We remove anything other than English letters or space. We remove stop words. We remove extra white space. We make the text fit paper width, i.e. each line has at most 60 characters. 8
17 2.5 Similarity Matrix without IDF First, we construct a document term matrix using the corpus we prepared in section 2.4 using Term Frequency (TF). We do not apply Inverse Document Frequency (IDF) in constructing the document term matrix. Second, we calculate the similarity matrix based on the document term matrix by using 1 minus cosine distance and plot the similarity matrix in Figure 2.1. As can be seen from Figure 2.1, the similarity value is between 0 and 1. The similarity of a cuisine to itself is 1 as expected. We can observe many sets of cuisines that are very similar to each other, which is consistent with common sense. To name a few: American (New), American (Traditional), Night Life, and Bars Italian and Pizza Delis and Sandwiches Fast Food and Burgers Cafes and Breakfast & Brunch Japanese and Sushi Bars Mediterranean, Greek, and Middle Eastern Vegetarian and Gluten-free Chinese, Asian Fusion, and Dim Sum 9
18 Figure 2.1 Similarity matrix without IDF 2.6 Similarity Matrix with IDF The results presented in the previous section make a lot of sense. The similarity values between similar cuisines are indeed higher than those between not-so-similar or very-different cuisines. However, the difference is not very significant. Therefore, we use IDF to enhance the difference. We prepared another document term matrix using TF-IDF and calculate the similarity matrix with the same method (cosine distance). The similarity matrix is shown in Figure
19 Figure 2.2 Similarity matrix using IDF As can be seen from Figure 2.2, the similarity values between cuisines that are actually similar to each other are significantly higher than the values between cuisines that have less in common. For example, Dim Sum is a type of Chinese food, based on Figure 2.1, it appears to have high similarity to Japanese, Sushi Bars, and Seafood and its similarity to Chinese is not significantly higher than its similarity to Japanese, Sushi Bars, and Seafood. But according to Figure 2.2, the similarity of Dim Sum to Japanese, Sushi Bars, and Seafood is much weaker and its similarity to Chinese is enhanced. 11
20 For another example, based on Figure 2.1, Greek seems to be very similar to American (New), American (Traditional), Nightlife, Bars, and its similarity to Mediterranean and Middle Eastern does not look very significant. Based on Figure 2.2, the similarity between Greek, Mediterranean, and Middle Eastern is much easier to find. 2.7 Similarity Matrix with Clustering We improved similarity matrix by using TF-IDF in section 2.6. However, related cuisines are sometimes located far away from each other and the cuisine map is not very handy to use. For instance, Middle Eastern, Mediterranean, and Greek are far away from each other in cuisine maps shown in Figure 2.1 and Figure 2.2 though they are quite similar. Indeed, it takes a lot of eye effort to find this relationship. Therefore, we carry out hierarchical cluster and -means cluster to facilitate the visualization of the relationships between similar cuisines Hierarchical Clustering We first try hierarchical clustering. A heat map is plotted in Figure 2.3 to show the similarity. From Figure 2.3, the similarity relationship is very clear since cuisines that are very similar to each other are closed located and the cuisines that are different are far away from each other. For example, Middle Eastern, Mediterranean, and Greek now are in one cluster and are next each other. My interesting clusters are forms, such as Japanese and Sushi Bars, Fast Food and Burgers. 12
21 Figure 2.3 Similarity matrix and hierarchical cluster k-means Clustering We also carry out means clustering on our data set using the document term matrix based on TF-IDF. The results are as shown in Figure 2.4. We set k 5. The five different clusters are presented using different colors. The clusters are listed in TABLE 2.1. The cluster result makes sense but is not as good as the result obtained by hierarchical clustering. 13
22 TABLE 2.1 Cluster list Cluster 1 "Sushi Bars" "Japanese" Cluster 2 Mexican" "Breakfast Brunch" "Steakhouses" "Sandwiches" "Burgers" "Chinese" Cluster 3 "Seafood" "Buffets" "Fast Food" "Thai" "Asian Fusion" "Mediterranean" "French" "Cafes" "Sports Bars" "Barbeque" "Pubs" "Coffee Tea" "Vietnamese" "Delis" "Vegetarian" "Lounges" "Greek" "Wine Bars" "Desserts" "Bakeries" "Gluten Free" "Diners" "Indian" "Korean" "Salad" "Chicken Wings" "Hotdogs" "Tapas Bars" "Arts Entertainment" "Southern" "Tapas Small Plates" "Middle Eastern" "Hawaiian" "Vegan" "Gastropubs" "Dim Sum" Cluster 4 "Italian" "Pizza" Cluster 5 "American (New)" "American (Traditional)" "Nightlife" "Bars" 14
23 Figure 2.4 Similarity matrix and k-means cluster 2.8 Conclusions In this chapter, we investigate the development of cuisine map based on the categories and customer reviews in Yelp data. Both TF and TF-IDF are used to build document term matrix. Similarity matrices are obtained based on cosine distance and plotted in Figure 2.1 to Figure 2.4. It is found that TF-IDF can enhance the similarity between cuisines that are indeed similar and weaken the similarity between cuisines that have less in common. We also carry out hierarchical clustering and k-means clustering to facilitate the reader to use the cuisine map. 15
24 3 CHAPTER 3 SUMMARY OF TASK 3: DISH RECOGNITION In this chapter, we investigate the mining of Chinese dish names from the yelp review data on Chinese restaurants. We subset the reviews on Chinese restaurants from the original data set and identify available dish names in Chinese cuisine by using TopMine [1] and SegPhrase [2]. 3.1 Task 3.1: Manual Tagging First, we revise the label file for Chinese cuisine manually. We remove false positive non-dish names phrase. We change a false negative dish name phrase to a positive label. Second, we add more annotated phrases in the same format by searching for menus from Chinese restaurants. 3.2 Task 3.2: Mining Additional Dish Names Corpus Preparation We import the data into R. We read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. We select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. We merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. 16
25 We subset RESTAURANTS_REVIEW by selecting the entries with Chinese in column categories and save it into RESTAURANTS_REVIEW_CHINESE We subset the text column in RESTAURANTS_REVIEW_CHINESE and save it into.txt file with each line being one review. 3.3 Dish Name Identify Using TopMine Parameters We keep the default values of the parameters except that we modify maxpattern into 6 since we believe that a dish name is likely to contain 1 to 6 words Opinion about the Result We run the TopMine package and obtain more than 10k phrases. Some of the most frequent ones appear to be dish names, such as dim sum 2849 fried rice 2511 egg rolls 1777 orange chicken 1599 These are indeed typical Chinese dishes found in US. If you have never been to a Chinese restaurant, you may want to go and try these dishes because apparently they are very popular in US. My personal favorite is dim sum which is originally from Canton and Hong Kong. However, there are still many frequent phrases that are not dish names, for example Chinese food (with frequency 2853) and Chinese restaurant (with frequency 2108). This is because the phrase mining algorithm, TopMine, is not specifically for dish name mining. The wrong dish names are actually frequently used in the reviews and are indeed frequent phrases, which means the algorithm works very well. 17
26 3.3.3 Improvement With the limitation of the tools we have, we have to re-prepare our corpus so that the frequent phrases other than dish names are removed beforehand. Therefore, we remove the word Chinese and the following words from the original corpus to improve the result of phrase mining: good, food, service, great, one, like, love, pretty, place, menu, ordered, order, best, try, nice, well, didnt, dont, ive, eat, back, also, got, always, come, people, get, will, can, really, just, time, little, us, meal, diner, first, table, definitely. The reason why we remove these words is that we found they appear quite often in the corpus, as shown in the word cloud in Figure 1.1, but are very unlikely to appear in a Chinese dish name. After this procedure, the results are much better by observing that most of the top frequent phrases are dish names. 3.4 Dish Name Identify Using SegPhrase Parameters We prepare a label and set algorithm parameter AUTO_LABEL=1. The first part of the label is the label we revised manually in task 3.1. The second part of the label is from the result of TopMine. Basically, we select the first 2k frequent phrases in the result of TopMine and replace the frequency with label 1. We then manually revise the label by removing false positive Opinion about the Result Using the label and the algorithm package we obtain very good dish name list. Below is the top phrases in the list. orange chicken hot and sour soup cashew chicken sea bass hot pot kung pao chicken brown rice shaved ice white rice char siu chow mein won ton steamed rice fried rice bok choy sweet and sour pork 18
27 4 CHAPTER 4 SUMMARY OF TASK 4 & 5: POPULAR DISHES AND RESTARURANT RECOMMENDATION In this chapter, we detect popular dishes of a specific cuisine (Chinese cuisine) and popular restaurants for a specific dish ( orange chicken and fried rice ). Popularity is measured by the frequency that a dish appears in reviews. We also carried out some sentiment analysis based on the stars each dish or restaurant receives in reviews. 4.1 Data Preparation Corpus Read yelp_academic_dataset_business.json into variable BUSINESS and yelp_academic_dataset_review.json into variable REVIEW using jsonlite package. Select all the restaurants from BUSINESS by finding the entries that have Restaurants in column categories. We denote this selected data set as RESTAURANTS. Merge RESTAURANTS and REVIEW into RESTAURANTS_REVIEW by business_id column. This also eliminates the entries in REVIEW that are not for restaurants. Subset RESTAURANTS_REVIEW by selecting the entries with Chinese in column categories and save it into CHINESE_REVIEW Subset the text column in CHINESE_REVIEW as the corpus. Convert the corpus into ASCII encoding. Strip extra whitespace from the corpus. Remove punctuation marks from the corpus. Remove numbers from the corpus. 19
28 4.1.2 Dish List We used the top 500 dish names from the dish mining results obtained in Task 3. We read the txt file (each line is a dish name) into R using function readlines. 4.2 Tools and Packages R version ( ) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack Attached Base Packages: stats, graphics, grdevices, utils, datasets methods, base Other Attached Packages: dplyr_0.4.3, tm_0.6-2, NLP_0.1-8, ggplot2_1.0.1, jsonlite_0.9.16, qdap_2.2.2, RColorBrewer_1.1-2, qdaptools_1.3.1, qdapregex_0.5.0, qdapdictionaries_ Task 4: Popular Dishes In this section, we detect the top 100 most popular dishes in Chinese cuisine Popularity Analysis The measurement for popularity of a dish is defined as the frequency that the dish appears in customers reviews. If a dish name appears more than one time in the same piece of review, it is only counted once. We obtain a data frame with m rows and n 3) columns, where m is the number of reviews and n is the number of dishes. Each row represents an individual review. In each row, the first m columns are the counts of the m dishes. Basically, if a dish name appears in the review, then the value in the corresponding column is 1, otherwise it is 0. The n 1 st column is the stars 20
29 corresponding to the review, the n 2 nd column is the name of the corresponding restaurant, and the n 3 rd column is the overall star of the restaurant Sentiment Analysis A frequently ordered (mentioned) dish is not necessarily tasty as well. We use the stars given by reviewers in their reviews as an indicator for the tastiness of the corresponding dishes mentioned in the reviews. For example, if a reviewer mentions fried rice and orange chicken in his or her review, and he or she gives a five stars in the review for his or her experience, then fried rice and orange chicken both earn a tastiness 5 due to this piece of review. We count the total stars each dish earns from the reviewers as its overall tastiness. Then the overall tastiness is normalized into a range of 1 to Illustration The results are presented in Figure 4.1, where the x-axis is the top 100 popular dish names and y-axis is the corresponding frequency-based popularity. We used color to show the tastiness of the dishes. There exist a strong correlation that tastier dishes tends to be ordered (mentioned) more often, which makes sense in practice. 21
30 Figure 4.1 Illustration for popular dish names 4.4 Task 5: Popular Restaurants In this part, we mine the popular restaurants for a specific dish. Without losing generality, we use orange chicken and fried rice as two examples because they are two of the most popular dishes in Chinese cuisine as shown in Figure 4.1. Note that other dish names can be used for this task since the method and code we use to obtain the results in this section are supposed to be universal Popularity Analysis We group the data frame obtained in task 4 by restaurant and calculate the total count of dishes for each restaurant. We use the total count as popularity of the restaurant with respect to a dish. For example, for restaurant Panda Express, the total count of orange chicken is 145 while the total count of fried rice is 87. For another example, for restaurant Chino Bandido, the total count of orange chicken is 36 while the total count of fried rice is 406. As you can see that Panda Express is more popular for its orange chicken whereas Chino Bandido is more popular for its fried rice. 22
31 4.4.2 Sentiment Analysis A restaurant may serve a lot of orange chicken or fried rice, but it could be because of the population in that area or its low price. We want to know if the customers are happy after having its orange chicken or fried rice, which means how tasty the orange chicken and fried rice are. We use the overall stars of the restaurant as a measurement Illustration The results are presented in Figure 4.2 and Figure 4.3. The x-axis represents the top 100 restaurants that serve the dishes orange chicken and fried rice and the y-axis represents the popularity of the corresponding restaurants. We used color to show the tastiness of the dishes. Figure 4.2 Illustration for popular restaurants for orange chicken 23
32 Figure 4.3 Illustration for popular restaurants for fried rice 4.5 Conclusions We believe that the figures provided above can be a good guide for people who want to try Chinese food. They can find the most popular dishes in Figure 4.1 and find which restaurants serve the best orange chicken and fried rice in Figure 4.2 and Figure
33 5 CHAPTER 5 SUMMARY OF TASK 6: HYGIENE PREDICTION In this chapter, we predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. Two text representation techniques are used: Unigram and Topic Model. Two learning algorithms are used: Logistic Regression and Random Forest. Additional features are used such as Categories, Stars, and Zipcode. 5.1 Tools Used R version Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) Packages topicmodels_0.2-2 qdap_2.2.2 qdaptools_1.3.1 qdapregex_0.5.0 qdapdictionaries_1.0.6 tm_0.6-2 NLP_0.1-8 quanteda_ randomforest_ caret_ Text Preprocessing We preprocess the review text as follows. Package tm is used. Convert the text into ASCII encoding. Strip extra whitespace from the text. Remove punctuation marks from the text. Remove numbers from the text. 25
34 5.3 Training Method 1: Logistic Regression Text Representation Techniques Unigram First, we obtain word frequency from the reviews in training data and select the top words. Here we set = 301 and frequent Second, we use the counts of frequent words in the review of each restaurant as its corresponding text-based features. Package qdap is used Additional Features Used Stars, Zipcode, and Categories Learning Algorithm Logistic Regression Results Analysis The results are presented in TABLE 5.1 where Score is the score given by Coursera grader. TABLE 5.1 Prediction Score obtained by Logistic Regression # of unigram feature Additional Features Score Scheme Stars and Zipcode Scheme Stars, Zipcode, and Categories Scheme Stars and Zipcode
35 From TABLE 5.1 we can observe the following. (1) The score is lower when additional feature Categories is used. This is probably because some categories in testing data set do not appear in training data set. (2) When more unigram features (frequent words) are used, the score is lower. This is probably because of overfitting. 5.4 Training Method 2: Random Forest Text Representation Techniques Unigram First, we obtain word frequency from the reviews in training data and select the top words. Here we set = 841 and frequent Second, we use the counts of frequent words in the review of each restaurant as its corresponding text-based features. Package qdap is used Topic Model First, we mine 10, 50 and 100 topics from training data. Second, we count the words that belong to the topics in a restaurant s review and use the counts as text-based features. Package topicmodels is used Additional Features Used Stars, Zipcode, and Categories 27
36 5.4.3 Learning Algorithm Random Forest Packages caret and randomforest are used Results Analysis We use two text representation techniques and different numbers of features. The results are shown in TABLE 5.2 and TABLE 5.3, respectively. In TABLE 5.2, we observe the following. (1) Results are improved by using additional feature Categories (2) More unigram feature improve the result. It seems that a large number of unigram features does not cause overfitting in these two cases. More tests are not carried out because more features will result in unbearable training time. TABLE 5.2 Prediction Score obtained by Random Forest & Unigram # of unigram features Additional Features Score Scheme Stars, Zipcode, and Categories Scheme Stars, Zipcode, and Categories Scheme Stars, Zipcode In TABLE 5.3, we observe that more topics does not necessarily mean better result, overfitting occurs when the number of topics becomes larger. 28
37 TABLE 5.3 Prediction Score obtained by Random Forest & Topic Model # of topics Additional Features Score Scheme 1 10 Stars, Zipcode, and Categories Scheme 2 50 Stars, Zipcode, and Categories Scheme Stars, Zipcode, and Categories Method Comparison From the result we can tell that logistic regression tends to see overfitting with small numbers of features whereas random forest is less prone to overfitting. Overall, random forest provides slightly better results than logistic regression, but the former takes much more computer time than the latter. Comparing results from TABLE 5.2 and TABLE 5.3, we observe that the topic model method on average has similar results as unigram whereas its best result does not outperform Unigram. The reason could be as follows. On one hand, topic model reduces the dimension of features and enhances the important features. On the other hand, we may lose some information for prediction during the dimension reduction. 29
38 6 CHAPTER 6 USEFULNESS OF RESULTS In this chapter, we introduce the useful results obtained through the data mining capstone. 6.1 Cuisine Maps In Chapter 2, we build several cuisine maps which show the similarity between 50 different cuisines Usefulness for Customers These maps can be very useful for customers who want to explore new cuisines. For instance, according to the cuisine map in Figure 2.2, Mediterranean, Greek, and Middle Eastern are three very similar cuisines. People who like one of them may want to try the other two if they use the cuisine map Usefulness for Restaurant Owners These maps can also benefit restaurant owners who want to extend their businesses. They can choose to open their new restaurants next to or far away from certain restaurants. For example, an owner of a cafe may want to open a new cafe next to a restaurant that specifically provides breakfast and brunch since they are very similar according to the cuisine map and people will love to grab a cup of coffee before or after breakfast. 6.2 Dish Recognizer We recognize some dishes in task 3 as introduced in Chapter 3. This is useful for businessmen who want to open restaurants. It is very helpful to know what dishes are served in certain cuisine before opening a restaurant of that cuisine. 30
39 6.3 Popular Dishes Detection We detect top 100 popular Chinese dishes with corresponding tastiness in task 4 as introduced in Chapter 4. This is extremely useful for people who like Chinese food and who want to try Chinese food. The reason is obvious. People can find the most popular and tasty dishes and avoid ordering dishes that are not so welcomed. In addition, this result is also very useful to owners of Chinese restaurants and businessmen who want to start Chinese restaurants. For them, providing more popular food is more likely to bring more customers and hence more profit. 6.4 Restaurant Recommendation We recommend top 100 restaurants that serve orange chicken and fried rice in task 5 as presented in Chapter 4. This is also quite useful for customers who want to try these two special dishes. 6.5 Hygiene Prediction This result helps customers in selecting clean restaurants to go and avoid restaurants that are not so good at keeping hygiene. 31
40 7 CHAPTER 7 NOVELTY OF EXPLORATION 7.1 Hierarchical Clustering in Cuisine Map Development When we build the cuisine map considering clustering, hierarchical clustering is used, as shown in Figure 2.3. The hierarchical relation between cuisines are shown together with the similarity matrix. This really helps users find the clusters based on their own need. Instead of fixing the number of clusters beforehand, we allow users to choose how many clusters they want or to simply find cuisine that are connected by the hierarchical links. 7.2 TopMine Output Used as the Input for SegPhrase in Dish Recognition In recognizing dish names for Chinese cuisine, we use the output of TopMine as the input for SegPhrase so that SegPhrase has a more comprehensive labeled dish list. The first part of the labeled list is the one we revise manually in task 3.1. The second part of the list is from the result of TopMine. This method turns out to be very effective, which results in a 12 out 10 score according to the grader. 7.3 Top Frequent Unigram Terms and Topic Model are Used in Hygiene Prediction In training hygiene prediction models, we use two text representation techniques: unigram and topic model. For unigram, we detect the top N popular terms first instead of using all the terms in corpus, and then use the counts of the N words in the reviews as features. For topic model, we first mine topics from customer reviews, and then use the word counts in the topics as features. The two methods are very effective according to the grader. A F1 = 0.56 is obtained using the top term counts as features and a F1 = 0.55 is obtained using the topic model. 32
41 8 CHAPTER 8 CONTRIBUTION OF NEW KNOWLEDGE 8.1 Some Advantages of Random Forest over Logistic Regression In carrying out task 6, we train both logistic regression model and random forest with the same number of features and compare the results. Here are some advantages of random forest over logistic regression found during the experiment Random Forest is Less Prone to Overfitting than Logistic Regression We found that random forest provides better results when more and more features are included without showing overfitting, though we do not carry out experiments with more than 1500 features. However, logistic regression shows the sign of overfitting when less than 1500 features are used. This shows us that random forest is less prone to overfitting than logistic regression Logistic Regression is not Good at Handling Missing Feature Value When we are using logistic regression as prediction algorithm and categories of restaurants as a feature, warnings occur because some restaurant categories that do not appear in training data appear in testing data, which causes worse prediction result. On the other hand, random forest seems to be able to cope with such situation and even provide better prediction when restaurant categories are used as a feature. 33
42 9 CHAPTER 9 IMPROVEMENT TO BE DONE Several things can be done to improve this project: First, web based tools can be developed for interactive illustration of results. Second, updating algorithm can be developed to update the results in an efficient manner when more data are available instead of carrying out data mining from scratch. Third, a location based restaurant and dish recommendation should be developed which can be more helpful for customers in specific places. 34
43 10 REFERENCES [1] El-Kishky, Ahmed, et al. "Scalable topical phrase mining from text corpora." Proceedings of the VLDB Endowment, 8.3 (2014): [2] Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases from Massive Text Corpora, Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15), Melbourne, Australia, May (* equally contributed) 35
What Makes a Cuisine Unique?
What Makes a Cuisine Unique? Sunaya Shivakumar sshivak2@illinois.edu ABSTRACT There are many different national and cultural cuisines from around the world, but what makes each of them unique? We try to
More informationPredicting Wine Quality
March 8, 2016 Ilker Karakasoglu Predicting Wine Quality Problem description: You have been retained as a statistical consultant for a wine co-operative, and have been asked to analyze these data. Each
More informationSTA Module 6 The Normal Distribution
STA 2023 Module 6 The Normal Distribution Learning Objectives 1. Explain what it means for a variable to be normally distributed or approximately normally distributed. 2. Explain the meaning of the parameters
More informationSTA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves
STA 2023 Module 6 The Normal Distribution Learning Objectives 1. Explain what it means for a variable to be normally distributed or approximately normally distributed. 2. Explain the meaning of the parameters
More informationWine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts
Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts When you need to understand situations that seem to defy data analysis, you may be able to use techniques
More informationActivity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data
. Activity 10 Coffee Break Economists often use math to analyze growth trends for a company. Based on past performance, a mathematical equation or formula can sometimes be developed to help make predictions
More informationYelp Chanllenge. Tianshu Fan Xinhang Shao University of Washington. June 7, 2013
Yelp Chanllenge Tianshu Fan Xinhang Shao University of Washington June 7, 2013 1 Introduction In this project, we took the Yelp challenge and generated some interesting results about restaurants. Yelp
More informationNotes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Capacity Utilization. Last Updated: December 21, 2016
1 Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Capacity Utilization Last Updated: December 21, 2016 I. General Comments This file provides documentation for the Philadelphia
More informationCloud Computing CS
Cloud Computing CS 15-319 Apache Mahout Feb 13, 2012 Shannon Quinn MapReduce Review Scalable programming model Map phase Shuffle Reduce phase MapReduce Implementations Google Hadoop Map Phase Reduce Phase
More informationTips for Writing the RESULTS AND DISCUSSION:
Tips for Writing the RESULTS AND DISCUSSION: 1. The contents of the R&D section depends on the sequence of procedures described in the Materials and Methods section of the paper. 2. Data should be presented
More informationIT 403 Project Beer Advocate Analysis
1. Exploratory Data Analysis (EDA) IT 403 Project Beer Advocate Analysis Beer Advocate is a membership-based reviews website where members rank different beers based on a wide number of categories. The
More informationNVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN
NVIVO 10 WORKSHOP Hui Bian Office for Faculty Excellence BY HUI BIAN 1 CONTACT INFORMATION Email: bianh@ecu.edu Phone: 328-5428 Temporary Location: 1413 Joyner library Website: http://core.ecu.edu/ofe/statisticsresearch/
More informationWine Rating Prediction
CS 229 FALL 2017 1 Wine Rating Prediction Ke Xu (kexu@), Xixi Wang(xixiwang@) Abstract In this project, we want to predict rating points of wines based on the historical reviews from experts. The wine
More informationOnline Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.
Online Appendix to Are Two heads Better Than One: Team versus Individual Play in Signaling Games David C. Cooper and John H. Kagel This appendix contains a discussion of the robustness of the regression
More informationLesson 11: Comparing Ratios Using Ratio Tables
Student Outcomes Students solve problems by comparing different ratios using two or more ratio tables. Classwork Example 1 (10 minutes) Allow students time to complete the activity. If time permits, allow
More information2016 China Dry Bean Historical production And Estimated planting intentions Analysis
2016 China Dry Bean Historical production And Estimated planting intentions Analysis Performed by Fairman International Business Consulting 1 of 10 P a g e I. EXECUTIVE SUMMARY A. Overall Bean Planting
More informationPROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN DOWNLOAD EBOOK : PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN PDF
PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN DOWNLOAD EBOOK : PROFESSIONAL COOKING, 8TH EDITION BY WAYNE Click link bellow and free register to download ebook: PROFESSIONAL COOKING, 8TH EDITION BY
More informationThe Market Potential for Exporting Bottled Wine to Mainland China (PRC)
The Market Potential for Exporting Bottled Wine to Mainland China (PRC) The Machine Learning Element Data Reimagined SCOPE OF THE ANALYSIS This analysis was undertaken on behalf of a California company
More informationAWRI Refrigeration Demand Calculator
AWRI Refrigeration Demand Calculator Resources and expertise are readily available to wine producers to manage efficient refrigeration supply and plant capacity. However, efficient management of winery
More informationBuying Filberts On a Sample Basis
E 55 m ^7q Buying Filberts On a Sample Basis Special Report 279 September 1969 Cooperative Extension Service c, 789/0 ite IP") 0, i mi 1910 S R e, `g,,ttsoliktill:torvti EARs srin ITQ, E,6
More informationTRTP and TRTA in BDS Application per CDISC ADaM Standards Maggie Ci Jiang, Teva Pharmaceuticals, West Chester, PA
PharmaSUG 2016 - Paper DS14 TRTP and TRTA in BDS Application per CDISC ADaM Standards Maggie Ci Jiang, Teva Pharmaceuticals, West Chester, PA ABSTRACT CDSIC ADaM Implementation Guide v1.1 (IG) [1]. has
More informationLollapalooza Did Not Attend (n = 800) Attended (n = 438)
D SDS H F 1, 16 ( ) Warm-ups (A) Which bands come to ACL Fest? Is it true that if a band plays at Lollapalooza, then it is more likely to play at Austin City Limits (ACL) that year? To be able to provide
More informationMissing Data Treatments
Missing Data Treatments Lindsey Perry EDU7312: Spring 2012 Presentation Outline Types of Missing Data Listwise Deletion Pairwise Deletion Single Imputation Methods Mean Imputation Hot Deck Imputation Multiple
More informationModeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017
Modeling Wine Quality Using Classification and Mario Wijaya MGT 8803 November 28, 2017 Motivation 1 Quality How to assess it? What makes a good quality wine? Good or Bad Wine? Subjective? Wine taster Who
More informationWhat makes a good muffin? Ivan Ivanov. CS229 Final Project
What makes a good muffin? Ivan Ivanov CS229 Final Project Introduction Today most cooking projects start off by consulting the Internet for recipes. A quick search for chocolate chip muffins returns a
More informationRENAL DIET HQ 1
Hello and welcome to the Renal Diet Headquarters Podcast. This is Mathea Ford again and we are on podcast number 42. And you can find all the links and the information on our website at www.renaldiethq.com/042.
More informationFeasibility report on best fast food options on University Drive in Denton, Texas.
Feasibility report on best fast food options on University Drive in Denton, Texas. By: Reagan Teltschik Table of Contents iii Table of contents Introduction... 5 Chapter 1... 3 Methods... 3 Chapter 2...
More informationThought: The Great Coffee Experiment
Thought: The Great Coffee Experiment 7/7/16 By Kevin DeLuca ThoughtBurner Opportunity Cost of Reading this ThoughtBurner post: $1.97 about 8.95 minutes I drink a lot of coffee. In fact, I m drinking a
More informationStep 1: Prepare To Use the System
Step : Prepare To Use the System PROCESS Step : Set-Up the System MAP Step : Prepare Your Menu Cycle MENU Step : Enter Your Menu Cycle Information MODULE Step 5: Prepare For Production Step 6: Execute
More informationAmazon Fine Food Reviews wait I don t know what they are reviewing
David Tsukiyama CSE 190 Dahta Mining and Predictive Analytics Professor Julian McAuley Amazon Fine Food Reviews wait I don t know what they are reviewing Dataset This paper uses Amazon Fine Food reviews
More informationThe Dun & Bradstreet Asia Match Environment. AME FAQ. Warwick R Matthews
The Dun & Bradstreet Asia Match Environment. AME FAQ Updated April 8, 2015 Updated By Warwick R Matthews (matthewswa@dnb.com) 1. Can D&B do matching in Asian languages? 2. What is AME? 3. What is AME Central?
More informationPredicting Wine Varietals from Professional Reviews
Predicting Wine Varietals from Professional Reviews By Ron Tidhar, Eli Ben-Joseph, Kate Willison 11th December 2015 CS 229 - Machine Learning: Final Project - Stanford University Abstract This paper outlines
More informationWord Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017
Word Embeddings for NLP in Python Marco Bonzanini PyCon Italia 2017 Nice to meet you WORD EMBEDDINGS? Word Embeddings = Word Vectors = Distributed Representations Why should you care? Why should you care?
More information5 Populations Estimating Animal Populations by Using the Mark-Recapture Method
Name: Period: 5 Populations Estimating Animal Populations by Using the Mark-Recapture Method Background Information: Lincoln-Peterson Sampling Techniques In the field, it is difficult to estimate the population
More informationFOR PERSONAL USE. Capacity BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN ACTIVITY ASSESSMENT OPPORTUNITIES. Grade 3 Quarter 1 Activity 2
activity 2 Capacity BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN Grade 3 Quarter 1 Activity 2 SC.A.1.2.1 The student determines that the properties of materials (e.g., density and volume) can be compared
More informationVoice Control System. Voice Recognition
Voice Control System Your vehicle has a voice control system that allows hands-free operation of the navigation system functions. The voice control system uses the (Talk) and (Back) buttons on the steering
More informationMini Project 3: Fermentation, Due Monday, October 29. For this Mini Project, please make sure you hand in the following, and only the following:
Mini Project 3: Fermentation, Due Monday, October 29 For this Mini Project, please make sure you hand in the following, and only the following: A cover page, as described under the Homework Assignment
More informationBarista at a Glance BASIS International Ltd.
2007 BASIS International Ltd. www.basis.com Barista at a Glance 1 A Brewing up GUI Apps With Barista Application Framework By Jon Bradley lmost as fast as the Starbucks barista turns milk, java beans,
More informationOnline Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform
Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform This document contains several additional results that are untabulated but referenced
More informationWideband HF Channel Availability Measurement Techniques and Results W.N. Furman, J.W. Nieto, W.M. Batts
Wideband HF Channel Availability Measurement Techniques and Results W.N. Furman, J.W. Nieto, W.M. Batts THIS INFORMATION IS NOT EXPORT CONTROLLED THIS INFORMATION IS APPROVED FOR RELEASE WITHOUT EXPORT
More informationCustomer Survey Summary of Results March 2015
Customer Survey Summary of Results March 2015 Overview In February and March 2015, we conducted a survey of customers in three corporate- owned Bruges Waffles & Frites locations: Downtown Salt Lake City,
More informationLabor Requirements and Costs for Harvesting Tomatoes. Zhengfei Guan, 1 Feng Wu, and Steven Sargent University of Florida
Labor Requirements and Costs for ing Tomatoes Zhengfei Guan, 1 Feng Wu, and Steven Sargent University of Florida Introduction Florida accounted for 30% to 40% of all commercially produced fresh-market
More information2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project
1 Abstract HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project This project looks to apply machine learning techniques in the area of beer recommendation and style prediction. The first
More informationMissing value imputation in SAS: an intro to Proc MI and MIANALYZE
Victoria SAS Users Group November 26, 2013 Missing value imputation in SAS: an intro to Proc MI and MIANALYZE Sylvain Tremblay SAS Canada Education Copyright 2010 SAS Institute Inc. All rights reserved.
More informationPRODUCT EXAMPLE PIZZA
PRODUCT EXAMPLE PIZZA Carla is using an old family recipe to develop a frozen pizza product for her company. Carla would like to do the following: Create a dough formula. Convert the dough formula into
More informationWhat Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes
UNIVERSITY OF CALIFORNIA: SAN DIEGO, NOVEMBER 2015 1 What Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes Hendrik Hannes Holste, Maya Nyayapati, Edward Wong Abstract
More informationHandling Missing Data. Ashley Parker EDU 7312
Handling Missing Data Ashley Parker EDU 7312 Presentation Outline Types of Missing Data Treatments for Handling Missing Data Deletion Techniques Listwise Deletion Pairwise Deletion Single Imputation Techniques
More informationElemental Analysis of Yixing Tea Pots by Laser Excited Atomic. Fluorescence of Desorbed Plumes (PLEAF) Bruno Y. Cai * and N.H. Cheung Dec.
Elemental Analysis of Yixing Tea Pots by Laser Excited Atomic Fluorescence of Desorbed Plumes (PLEAF) Bruno Y. Cai * and N.H. Cheung 2012 Dec. 31 Summary Two Yixing tea pot samples were analyzed by PLEAF.
More informationWiX Cookbook Free Ebooks PDF
WiX Cookbook Free Ebooks PDF Over 60 hands-on recipes packed with tips and tricks to boost your Windows installationsabout This BookBuild WiX projects within Visual Studio, as part of a continuous-integration
More informationMEAT WEBQUEST Foods and Nutrition
MEAT WEBQUEST Foods and Nutrition Overview When a person cooks for themselves, or for family, and/or friends, they want to serve a meat dish that is appealing, very tasty, as well as nutritious. They do
More informationChinese Cooking: The Chinese Takeout Recipes, Quick & Easy To Prepare Dishes At Home Ebooks Free
Chinese Cooking: The Chinese Takeout Recipes, Quick & Easy To Prepare Dishes At Home Ebooks Free Discover How Easy It Is To Cook Delicious And Healthy Chinese Food!Step by step instructions on how to prepare
More informationSTAT 5302 Applied Regression Analysis. Hawkins
Homework 3 sample solution 1. MinnLand data STAT 5302 Applied Regression Analysis. Hawkins newdata
More informationGCSE 4091/01 DESIGN AND TECHNOLOGY UNIT 1 FOCUS AREA: Food Technology
Surname Centre Number Candidate Number Other Names 0 GCSE 4091/01 DESIGN AND TECHNOLOGY UNIT 1 FOCUS AREA: Food Technology A.M. TUESDAY, 19 May 2015 2 hours S15-4091-01 For s use Question Maximum Mark
More informationDirections: Read the passage. Then answer the questions below.
READTHEORY Reading Comprehension 2 Level 7 Name Date Directions: Read the passage. Then answer the questions below. For two months, I have been trying to decide who makes the best ice cream. I have narrowed
More informationDecision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017
Decision making with incomplete information Some new developments Rudolf Vetschera University of Vienna Tamkang University May 15, 2017 Agenda Problem description Overview of methods Single parameter approaches
More informationBetter Punctuation Prediction with Hierarchical Phrase-Based Translation
Better Punctuation Prediction with Hierarchical Phrase-Based Translation Stephan Peitz, Markus Freitag and Hermann Ney peitz@cs.rwth-aachen.de IWSLT 2014, Lake Tahoe, CA December 4th, 2014 Human Language
More informationWINE RECOGNITION ANALYSIS BY USING DATA MINING
9 th International Research/Expert Conference Trends in the Development of Machinery and Associated Technology TMT 2005, Antalya, Turkey, 26-30 September, 2005 WINE RECOGNITION ANALYSIS BY USING DATA MINING
More informationHW 5 SOLUTIONS Inference for Two Population Means
HW 5 SOLUTIONS Inference for Two Population Means 1. The Type II Error rate, β = P{failing to reject H 0 H 0 is false}, for a hypothesis test was calculated to be β = 0.07. What is the power = P{rejecting
More information1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials
Project Overview The overall goal of this project is to deliver the tools, techniques, and information for spatial data driven variable rate management in commercial vineyards. Identified 2016 Needs: 1.
More informationWhat Is This Module About?
What Is This Module About? Do you enjoy shopping or going to the market? Is it hard for you to choose what to buy? Sometimes, you see that there are different quantities available of one product. Do you
More informationGasoline Empirical Analysis: Competition Bureau March 2005
Gasoline Empirical Analysis: Update of Four Elements of the January 2001 Conference Board study: "The Final Fifteen Feet of Hose: The Canadian Gasoline Industry in the Year 2000" Competition Bureau March
More informationStructures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:
3 rd Science Notebook Structures of Life Investigation 1: Origin of Seeds Name: Big Question: What are the properties of seeds and how does water affect them? 1 Alignment with New York State Science Standards
More informationwine 1 wine 2 wine 3 person person person person person
1. A trendy wine bar set up an experiment to evaluate the quality of 3 different wines. Five fine connoisseurs of wine were asked to taste each of the wine and give it a rating between 0 and 10. The order
More informationNon-Structural Carbohydrates in Forage Cultivars Troy Downing Oregon State University
Non-Structural Carbohydrates in Forage Cultivars Troy Downing Oregon State University Contact at: OSU Extension Service, Tillamook County, 2204 4 th St., Tillamook, OR 97141, 503-842-3433, Email, troy.downing@oregonstate.edu
More informationMultiple Imputation for Missing Data in KLoSA
Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1. Missing Data and Missing Data Mechanisms 2. Imputation 3. Missing Data and Multiple Imputation in Baseline
More informationSTABILITY IN THE SOCIAL PERCOLATION MODELS FOR TWO TO FOUR DIMENSIONS
International Journal of Modern Physics C, Vol. 11, No. 2 (2000 287 300 c World Scientific Publishing Company STABILITY IN THE SOCIAL PERCOLATION MODELS FOR TWO TO FOUR DIMENSIONS ZHI-FENG HUANG Institute
More informationFACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE
12 November 1953 FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE The present paper is the first in a series which will offer analyses of the factors that account for the imports into the United States
More informationThe Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method
Name Date The Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method Introduction: In order to effectively study living organisms, scientists often need to know the size of
More informationInternational Journal of Business and Commerce Vol. 3, No.8: Apr 2014[01-10] (ISSN: )
The Comparative Influences of Relationship Marketing, National Cultural values, and Consumer values on Consumer Satisfaction between Local and Global Coffee Shop Brands Yi Hsu Corresponding author: Associate
More informationENGI E1006 Percolation Handout
ENGI E1006 Percolation Handout NOTE: This is not your assignment. These are notes from lecture about your assignment. Be sure to actually read the assignment as posted on Courseworks and follow the instructions
More informationCan You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2]
Can You Tell the Difference? A Study on the Preference of Bottled Water [Anonymous Name 1], [Anonymous Name 2] Abstract Our study aims to discover if people will rate the taste of bottled water differently
More informationWhich of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not?
wrong 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 right 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 score 100 98.5 97.0 95.5 93.9 92.4 90.9 89.4 87.9 86.4 84.8 83.3 81.8 80.3 78.8 77.3 75.8 74.2
More informationNotes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Indexes of Aggregate Weekly Hours. Last Updated: December 22, 2016
1 Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Indexes of Aggregate Weekly Hours Last Updated: December 22, 2016 I. General Comments This file provides documentation for
More informationMBA 503 Final Project Guidelines and Rubric
MBA 503 Final Project Guidelines and Rubric Overview There are two summative assessments for this course. For your first assessment, you will be objectively assessed by your completion of a series of MyAccountingLab
More informationA Note on a Test for the Sum of Ranksums*
Journal of Wine Economics, Volume 2, Number 1, Spring 2007, Pages 98 102 A Note on a Test for the Sum of Ranksums* Richard E. Quandt a I. Introduction In wine tastings, in which several tasters (judges)
More informationPINEAPPLE LEAF FIBRE EXTRACTIONS: COMPARISON BETWEEN PALF M1 AND HAND SCRAPPING
PINEAPPLE LEAF FIBRE EXTRACTIONS: COMPARISON BETWEEN PALF M1 AND HAND SCRAPPING Yusri Yusof, Siti Asia Yahya and Anbia Adam Universiti Tun Hussein Onn Malaysia (UTHM), Johor, Malaysia E-Mail: yusri@uthm.edu.my
More informationNoun-Verb Decomposition
Noun-Verb Decomposition Nouns Restaurant [Regular, Catering, Take- Out] (Location, Type of food, Hours of operation, Reservations) Verbs has (information) SWEN-261 Introduction to Software Engineering
More informationINTRO TO TEXT MINING: BAG OF WORDS. What is text mining?
INTRO TO TEXT MINING: BAG OF WORDS What is text mining? Intro to Text Mining: Bag of Words What is text mining? The process of distilling actionable insights from text Intro to Text Mining: Bag of Words
More informationIdeas for group discussion / exercises - Section 3 Applying food hygiene principles to the coffee chain
Ideas for group discussion / exercises - Section 3 Applying food hygiene principles to the coffee chain Activity 4: National level planning Reviewing national codes of practice and the regulatory framework
More informationGrowth in early yyears: statistical and clinical insights
Growth in early yyears: statistical and clinical insights Tim Cole Population, Policy and Practice Programme UCL Great Ormond Street Institute of Child Health London WC1N 1EH UK Child growth Growth is
More informationWhich of the following are resistant statistical measures? 1. Mean 2. Median 3. Mode 4. Range 5. Standard Deviation
Which of the following are resistant statistical measures? 1. Mean 2. Median 3. Mode 4. Range 5. Standard Deviation For the variable number of parking tickets in the past year would you expect the distribution
More informationBiologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name
wrong 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 right 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 score 100 98.6 97.2 95.8 94.4 93.1 91.7 90.3 88.9 87.5 86.1 84.7 83.3 81.9
More informationA Note on H-Cordial Graphs
arxiv:math/9906012v1 [math.co] 2 Jun 1999 A Note on H-Cordial Graphs M. Ghebleh and R. Khoeilar Institute for Studies in Theoretical Physics and Mathematics (IPM) and Department of Mathematical Sciences
More informationPISA Style Scientific Literacy Question
PISA Style Scientific Literacy Question The dodo was a large bird, roughly the size of a swan. It has been described as heavily built or even fat. It was flightless, but is believed to have been able to
More informationPlease sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4
The following group project is to be worked on by no more than four students. You may use any materials you think may be useful in solving the problems but you may not ask anyone for help other than the
More informationIMAGE B BASE THERAPY. I can identify and give a straightforward description of the similarities and differences between texts.
I can identify and give a straightforward description of the similarities and differences between texts. BASE THERAPY Breaking down the skill: Identify to use skimming and scanning skills to locate parts
More informationAnalysis of Things (AoT)
Analysis of Things (AoT) Big Data & Machine Learning Applied to Brent Crude Executive Summary Data Selecting & Visualising Data We select historical, monthly, fundamental data We check for correlations
More informationGrapes of Class. Investigative Question: What changes take place in plant material (fruit, leaf, seed) when the water inside changes state?
Grapes of Class 1 Investigative Question: What changes take place in plant material (fruit, leaf, seed) when the water inside changes state? Goal: Students will investigate the differences between frozen,
More informationEFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY
EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK 2013 SUMMARY Several breeding lines and hybrids were peeled in an 18% lye solution using an exposure time of
More informationFlo s Chinese Asian Restaurant Review. Rachel Marlin. Recently I had the pleasure of attending Flo s Chinese Asian restaurant.
Flo s Chinese Asian Restaurant Review Rachel Marlin Recently I had the pleasure of attending Flo s Chinese Asian restaurant. I had never dined here before. Flo s is a small local restaurant that has two
More informationPavilion Organizer - BRAZIL
Pavilion Organizer - BRAZIL With the new comers or those who are looking for a Japanese partner so that promotion can be more adequate, I think that Foodex is important. It is the best place for our food
More informationBuilding Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology
Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology Emmanuel Munguia Tapia 1, Tanzeem Choudhury and Matthai Philipose 2 1 Massachusetts Institute of Technology 2 Intel Research
More informationUsing Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years
Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years G. Lopez 1 and T. DeJong 2 1 Àrea de Tecnologia del Reg, IRTA, Lleida, Spain 2 Department
More informationInstruction (Manual) Document
Instruction (Manual) Document This part should be filled by author before your submission. 1. Information about Author Your Surname Your First Name Your Country Your Email Address Your ID on our website
More informationGLOBALIZATION UNIT 1 ACTIVATE YOUR KNOWLEDGE LEARNING OBJECTIVES
UNIT GLOBALIZATION LEARNING OBJECTIVES Key Reading Skills Additional Reading Skills Language Development Making predictions from a text type; scanning topic sentences; taking notes on supporting examples
More informationAppendix A. Table A.1: Logit Estimates for Elasticities
Estimates from historical sales data Appendix A Table A.1. reports the estimates from the discrete choice model for the historical sales data. Table A.1: Logit Estimates for Elasticities Dependent Variable:
More informationLearning Connectivity Networks from High-Dimensional Point Processes
Learning Connectivity Networks from High-Dimensional Point Processes Ali Shojaie Department of Biostatistics University of Washington faculty.washington.edu/ashojaie Feb 21st 2018 Motivation: Unlocking
More informationappetizer choices commodities cuisine culture ethnicity geography ingredients nutrition pyramid religion
Four Goodness Sake: Lesson for Fourth Grade Purpose To help students develop awareness that food preferences and cooking styles may be based upon geographic, ethnic, and/or religious/family beliefs, but
More informationAJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship
AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship Juliano Assunção Department of Economics PUC-Rio Luis H. B. Braido Graduate School of Economics Getulio
More information2017 Summary of changes to rules for World Coffee In Good Spirits Championship
2017 Summary of changes to rules for World Coffee In Good Spirits Championship To take effect in Budapest WCIGS 2017 For internal use only not to be used in replacement of the WCIGS Rules. Please refer
More information