What Makes a Cuisine Unique?

Similar documents
Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Predicting Wine Quality

DATA MINING CAPSTONE FINAL REPORT

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

What makes a good muffin? Ivan Ivanov. CS229 Final Project

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

What Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes

Cloud Computing CS

IT 403 Project Beer Advocate Analysis

2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2]

Predicting Wine Varietals from Professional Reviews

A CASE STUDY: HOW CONSUMER INSIGHTS DROVE THE SUCCESSFUL LAUNCH OF A NEW RED WINE

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

Multiple Choice: Which product on this map is found in the location that is farthest from Delaware? vanilla sugar walnuts chocolate

Herb And Spice Chart.

Wine Rating Prediction

IMSI Annual Business Meeting Amherst, Massachusetts October 26, 2008

DIR2017. Training Neural Rankers with Weak Supervision. Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W.

WINE RECOGNITION ANALYSIS BY USING DATA MINING

MBA 503 Final Project Guidelines and Rubric

FOR PERSONAL USE. Capacity BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN ACTIVITY ASSESSMENT OPPORTUNITIES. Grade 3 Quarter 1 Activity 2

MEAT WEBQUEST Foods and Nutrition

Make and Bake a Hand Stretched Neapolitan Pizza

Instruction (Manual) Document

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN

Detecting Melamine Adulteration in Milk Powder

Learning the Language of Wine CS 229 Term Project - Final Report

Unit of competency Content Activity. Element 1: Organise coffee workstation n/a n/a. Element 2: Select and grind coffee beans n/a n/a

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

How Much Sugar Is in Your Favorite Drinks?

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Mahout

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Answering the Question

1. Explain how temperature affects the amount of carbohydrate (sugar) in a solution.

Golden kingdoms of Africa *

STUDY REGARDING THE RATIONALE OF COFFEE CONSUMPTION ACCORDING TO GENDER AND AGE GROUPS

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4

Abstract. Keywords: Gray Pine, Species Classification, Lidar, Hyperspectral, Elevation, Slope.

Mastering Measurements

Computerized Models for Shelf Life Prediction of Post-Harvest Coffee Sterilized Milk Drink

WALNUT BLIGHT CONTROL USING XANTHOMONAS JUGLANDIS BUD POPULATION SAMPLING

The Roles of Social Media and Expert Reviews in the Market for High-End Goods: An Example Using Bordeaux and California Wines

confidence for front line staff Key Skills for the WSET Level 1 Certificate Key Skills in Wines and Spirits ISSUE FIVE JULY 2005

Sample Guide and Delivery Schedule/Curriculum plan Culinary Operations

Relation between Grape Wine Quality and Related Physicochemical Indexes

Tips for Writing the RESULTS AND DISCUSSION:

Rail Haverhill Viability Study

Grade: Kindergarten Nutrition Lesson 4: My Favorite Fruits

The Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method

What Is This Module About?

UNIT TITLE: PROVIDE ADVICE TO PATRONS ON FOOD AND BEVERAGE SERVICES NOMINAL HOURS: 80

Unit title: Fermented Patisserie Products (SCQF level 7)

Grade 2: Nutrition Lesson 3: Using Your Sense of Taste

Plant Parts - Roots. Fall Lesson 5 Grade 3. Lesson Description. Learning Objectives. Attitude and Behavior Goals. Materials and Preparation

Experiment # Lemna minor (Duckweed) Population Growth

A Recipe Recommendation System Based on Regional Flavor Similarity Lin-rong GUO, Shi-zhong YUAN *, Xue-hui MAO and Yi-ning GU

Lollapalooza Did Not Attend (n = 800) Attended (n = 438)

SENSATIONAL SEASONINGS Idaho Child Nutrition Programs

Determining the Optimum Time to Pick Gwen

Amazon Fine Food Reviews wait I don t know what they are reviewing

Compare Measures and Bake Cookies

Primary Learning Outcomes: Students will be able to define the term intent to purchase evaluation and explain its use.

Classifying the Edible Parts of Plants

Colorado State University Viticulture and Enology. Grapevine Cold Hardiness

Word Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017

Perceptual Mapping and Opportunity Identification. Dr. Chris Findlay Compusense Inc.

Acadian Way of Life - on Social Media Secondary

Semantic Web. Ontology Engineering. Gerd Gröner, Matthias Thimm. Institute for Web Science and Technologies (WeST) University of Koblenz-Landau

GrillCam: A Real-time Eating Action Recognition System

Grade 5 / Scored Student Samples ITEM #5 SMARTER BALANCED PERFORMANCE TASK

ALBINISM AND ABNORMAL DEVELOPMENT OF AVOCADO SEEDLINGS 1

From Code to Confectionary

Plants of the Tropical Rainforest By Jane Saxer. Objective The students will learn how sunlight affects plants in the tropical rainforest.

Fractions with Frosting

Analysis of Things (AoT)

Coffee weather report November 10, 2017.

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

Alisa had a liter of juice in a bottle. She drank of the juice that was in the bottle.

SYLLABUS. Departmental Syllabus. Food Production II CULN0140. Departmental Syllabus. Departmental Syllabus. Departmental Syllabus

World of Wine: From Grape to Glass

LEVEL: BEGINNING HIGH

INFLUENCE OF THIN JUICE ph MANAGEMENT ON THICK JUICE COLOR IN A FACTORY UTILIZING WEAK CATION THIN JUICE SOFTENING

John Paul College Year 10 food studies Teacher program Technology & Enterprise Learning Area

Level 2 Mathematics and Statistics, 2016

JCAST. Department of Viticulture and Enology, B.S. in Viticulture

Practice of Chinese Food II Hotel Restaurant and Culinary Science

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

Reliable Profiling for Chocolate and Cacao

Bay Area Scientists in Schools Presentation Plan

DISCLOSURE LEARNING OUTCOMES DIETARY GUIDELINES IN THE KITCHEN 10/8/2016

Duration of resource: 17 Minutes. Year of Production: Stock code: VEA12062

The Mediterranean Cuisine;

Promoting Whole Foods

WACS culinary certification scheme

appetizer choices commodities cuisine culture ethnicity geography ingredients nutrition pyramid religion

STABILITY IN THE SOCIAL PERCOLATION MODELS FOR TWO TO FOUR DIMENSIONS

Classification Bias in Commercial Business Lists for Retail Food Outlets in the U.S

Transcription:

What Makes a Cuisine Unique? Sunaya Shivakumar sshivak2@illinois.edu ABSTRACT There are many different national and cultural cuisines from around the world, but what makes each of them unique? We try to answer that question by making use of multinomial logistic regression model to learn and predict unique cuisine styles. Furthermore, we attempt to identify fusion-cuisines that are borne of two or more distinct cuisine styles, using k-means clustering. Results demonstrate that cuisines are too diverse to predict with very high accuracies which may have led to the creation of fusion-cuisines. Keywords Machine learning; text mining; cuisines; classification; logistic regression; k-means. INTRODUCTION We humans have always been very creative with the food that we eat combining many distinct ingredients in ingenious combinations to create unique dishes and recipes. Historically speaking many food ingredients were native to only specific regions of the world. Recipes that make frequent use of these region-specific ingredients have come to characterize different national, regional, and cultural cuisine styles. For example, the Indian cuisine style recipes make heavy use of ingredients like cumin, coriander, and turmeric, while olive oil, parmesan cheese, and basil are more characteristic of the cuisine. There is an obvious relation between ingredients and cuisines one that can be used to predict cuisines. However, recipes which are unique combinations of these ingredients can be more useful in identifying a cuisine style. One goal of this project is to train a classification model that can predict and identify the cuisine style of any given recipe. As we explore our dataset in the next section, we learn that some sets of ingredients are used in more than one cuisine type an observation from which we can guess that these similar cuisines can be easily merged together to create fusion-cuisines. We try to solidify this train of thought by using k-means to cluster our dataset of recipes. After inspecting the results section, we conclude this project with a discussion of future work and applications of classifying cuisines and fusion-cuisines. DATASET For the purposes of this project, we make use of the raw recipe data that is publicly available as part of recipe collection by Ahn et al., 2011 [1]. The dataset used was obtained by crawling epicurious.com [2], a large, digital food platform, and comprises of 13408 recipes with 350 unique ingredients, and 26 different cuisines. Each recipe is represented as a cuisine style and a list of ingredients that it comprises of. Preliminary analysis of this dataset shows us the distribution of cuisines and recipe ingredients. We can understand the cuisine distribution of the dataset, from Figure 1 a bar plot of recipes per cuisine in the dataset, and the ingredient distribution across the recipe dataset, from Figure 2 a plot of the top 15 ingredients that occur in the dataset we are using. We find that the cuisine clearly dominates the dataset with 4988 recipes, followed by cuisine with 1715 recipes, and cuisine with 1176 recipes. Examining the frequency distribution of the ingredients we find that there are some ingredients like garlic, butter, and egg, that are very ubiquitous in nature and occur in many recipes, and across cuisines, as evident in Table 1. African Chinese Creole German Greek Indian Irish Japanese Figure 1. Bar plot showing the cuisine frequency distribution of the recipe dataset recipes make up for nearly 40% of the dataset. Jewish Mediterranean cuisine MiddleEastern Moroccan Portuguese Russian Scandinavian Scottish SoulFood South Southwestern Thai Vietnamese

5000 4000 750 3000 freq 2000 freq 500 1000 250 0 0 black_pepper butter cayenne egg garlic lemon_juice milk word olive_oil parsley tomato vegetable_oil vinegar wheat black_pepper butter cayenne egg garlic lemon_juice milk word olive_oil tomato vanilla vegetable_oil vinegar wheat Figure 2. Top 15 ingredients and their frequency distribution in the recipe dataset note how some ingredients dominate the ingredient space of the dataset. Figure 3. Top 15 ingredients and their frequency distribution in the recipe dataset, after calculating their tf-idf weights this reduces existing bias towards any single ingredient. Cuisine Top 7 Ingredients butter, egg, wheat, olive oil, garlic,, olive oil, garlic, tomato, parmesan cheese,, butter, egg soy sauce, ginger, garlic, rice,, vegetable oil, cayenne butter, egg, wheat, olive oil, garlic,, cayenne,, garlic, tomato, cilantro, corn, olive oil. Table 1. Top ingredients of the top 5 cuisines in the dataset ubiquitous ingredients are frequently used in many cuisines. To overcome the dominating effects of these commonly occurring ingredients, and to determine the ingredients that distinguish the different cuisines, we calculate the termfrequency inverse-document-frequency (tf-idf) [2] weights of individual ingredients. Tf-idf weights of ingredients reflect how important or unique an ingredient is to a single recipe in a corpus of recipes, by calculating relative frequency of an ingredient in recipes of a particular cuisine in contrast to its frequency in recipes of other cuisines. This weighting model will prove to be relatively more helpful in predicting the cuisine style of a recipe. tf (i, r) = number of occurrences of ingredient i in a recipe r idf (i) =!"# $ %&%'( *+,-./ &0 /.123.4 *+,-./ &0 /.123.4 56./. 2*7/.82.*% 2 '33.'/4 term frequency inverse document frequency = tf (i, r) idf (i) In order to gain some insight to help us cluster the dataset, we try to visualize the structure of the recipe-ingredient space in our recipe corpus by using t-distributed Stochastic Neighbor Embedding [3]. t-sne makes use of dimensionality reduction and can be very helpful when trying to visualize high-dimensional data. Figure 4 is a scatterplot mapping of recipe-ingredient space and cuisines in 2 dimensions. As evident from the figure,,,, and cuisines cover a wide variety of the ingredient space, and also overlap with other cuisines; cuisines like Chinese, Japanese, Indian, Thai are seen in distinct groups that occur close to each other. These observations give us a clue as to what we can expect when we perform k- means clustering of our dataset. PREDICTION TASK For this task, we aim to predict the cuisine style of a recipe, given a list of ingredients used. Methods The data is first converted to a document-term matrix, where each document is a recipe and each term is a distinct ingredient. Each row of this matrix represents a recipe and and each column - an ingredient, with values of 1 or 0, to indicate its presence or absence. We also calculate the tf-idf scores of ingredients in each recipe to create a new matrix. To remove any existing bias, we then randomly shuffle these matrices by rows. The shuffled data is then divided into a training set (80%), and a validation set (20%). We use a multinomial logistic regression model [4] to classify our 26 classes/cuisines data, and to evaluate the performance of the model we measure the accuracy of the predictions made with our validation dataset. Next, we test our model on feature vectors of each recipe in our validation set, and then determine the prediction accuracies.

Figure 4. Visualizing recipe-ingredient space a scatterplot mapping of recipes and cuisines in a 2-dimensional space using t-sne. Each dot represents a single recipe and the different colors represent different cuisines. Probability of prediction To better understand the cuisine prediction performance of our model, we plot the accuracies with which our logistic regression model predicts the cuisine type of each recipe in the validation set Figure 5. We can observe that Indian and recipes were the most correctly identified cuisines, on average which can be explained by the vast number of these cuisines recipes in the dataset, and how distinct and Indian food ingredients can be, when compared to the rest. African Irish Japanese Jewish Mediterranean Figure 5. Box plot of the accuracies with which the cuisine type of each recipe was predicted on average, Indian and recipes are the most accurately identified cuisines. Logistic Regression Model No tf-idf 0.5706 With tf-idf 0.5878 MiddleEastern Moroccan Portuguese Russian Scandinavian Accuracy (Validation Set) Table 2. Performance accuracies of our multinomial logistic regression model for 26 classes (cuisines). Results The performance of our classification model, is recorded in Table 2. We can observe that the model performs better when trained on the recipe-ingredient matrix with tf-idf weights, but only slightly. Scottish Cuisine SoulFood South Southwestern Thai Vietnamese Chinese Creole German Greek Indian CLUSTERING TASK The objective of this task is to identify and understand cuisine and fusion-cuisine clusters. Methods We use the same recipe-ingredient matrix that was computed for the previous task. However, we do strip away the cuisine labels associated with each of the recipes. Then we cluster our data using the k-means clustering algorithm that uses Euclidean distances between the feature vectors of every recipe in our data. Having prior knowledge that we have 26 different cuisine types in our dataset, a naïve-way to cluster data would be to run k-means using 26 centers. We then do a frequency analysis of every cluster to obtain information about top common ingredients used. Mapping the recipes in each cluster back to our original data, we can also assess the top cuisines and/or fusion-cuisines that can be seen in a cluster of recipes. Subsequently, we try to determine what the ideal number of clusters should be, by using the Elbow method Figure 6.

Within groups sum of squares 80000 85000 90000 95000 100000 105000 0 5 10 15 20 25 Figure 6. Elbow plot to determine the ideal number of clusters for our dataset elbow occurs when number of clusters = 7 Cuisine distribution (64%) (8%) (5%) (44%) (14%) Indian (10%) (26%) (26%) South (8%) Number of Clusters Top ingredients Egg, wheat, butter, milk,, vanilla, cinnamon, cane molasses, milk fat, lemon juice, cocoa, vegetable oil, nutmeg, ginger, lard Garlic, ginger, vegetable oil, cayenne, soy sauce, rice, scallion, vinegar, pepper, cilantro, coriander,, sesame oil, fish Cayenne,, garlic, tomato, cilantro, cumin, olive oil, corn, bell pepper, lime juice, pepper, oregano, vinegar, scallion, beef Table 3. A sample of 3 clusters from the first run of k-means clustering using 26 centers note the distinctiveness of the ingredient space. We can notice a slight elbow in the plot, indicating that k- means clustering using 7 clusters will give us the best results. Therefore, we run k-means using 7 centers to cluster our data. Results A sample clusters are tabulated in Tables 3 and 4. The first run of k-means yielded 26 clusters that were mostly groupings of a very high number of recipes from a single cuisine,,, etc. There were very few clusters that had nearly equal distributions of recipes from two cuisines -, etc. This clustering behavior can tell us the most recipes in a single cuisine make heavy use of the same ingredients, maybe in different combinations and measures. We can also observe how some clusters are very sweet, some savory, and some spicy telling us that cuisines around the world use the same basic ingredients to create meals and desserts There were 7 clusters as a result of the second run of k- means clustering. These clusters had more equal distributions of the top two cuisines which we take as an indication of a possible fusion cuisine style. Of the 7 clusters, 4 of them had distributions like, -, -, -, - which is a strong indication of how diverse cuisine can be, and how it has been influenced by other major cuisine styles. This can also be confirmed our dataset of recipes from Epicurious has an -centric view of world cuisines, as these recipes are collected from publications like Bon Appétit. The other 3 clusters had interesting and exotic distributions like, -, -Chinese, Japanese-Mediterranean. These were great examples of how well two distinct cuisines can work together to create appealing fusioncuisines. Cuisine distribution Top ingredients Sample recipes in cluster (36%) (31%) (32%) (24%) Japanese (43%) Mediterranean (40%) Garlic, olive oil,, tomato, cayenne, wheat, egg, butter, vegetable oil, cilantro, black pepper, corn, parsley,, milk Garlic, olive oil,, tomato, black pepper, vinegar, parsley, butter, wheat, egg, bread, chicken, basil, pepper, beef, thyme, cheese Olive oil, garlic, soy sauce,, rice, vinegar, vegetable oil, scallion, tomato, wine, ginger, wheat, bell pepper, egg, lemon juice, parsley, fish, cayenne, sake, barley, honey, beef, potato [1] "tomato olive_oil lemon cayenne garlic bell_pepper olive" [2] "olive_oil wheat cheese corn cayenne oregano" [3] "coriander tomato shallot avocado lime_juice garlic" [1] "olive_oil wheat yeast fish bell_pepper oregano" [2] "butter cheese goat_cheese macaroni black_pepper" [3] "tomato olive_oil chicken_broth garlic bread" [1] "beef sake soy_sauce scallion chive vegetable_oil wine" [2] "kohlrabi mandarin_peel olive_oil pepper sesame_seed potato pea wine" [3] "olive_oil vinegar lamb red_wine fennel garlic lemon wheat" Table 4. A sample of 3 clusters from the second run of k-means clustering using 7 centers.

FUTURE WORK AND APPLICATIONS We classify recipes by cuisine style, using a multinomial logistic regression model. It would be helpful to run this classification experiment with different models to find the best one. Classifying recipes can be used to learn more about the cuisine style of a dish, just by looking at the ingredients used, or even by looking at a picture of the dish. Recipe and restaurant suggestions can be made by using user preferences observed over time. The data set that we use for this project lists recipes as a set of ingredients accompanied by a cuisine label. This dataset can be enhanced by adding another feature to the recipes the name of the recipe. Equipped with this additional information, we can suggest recipes, based on user input that is set of ingredients. This can be very useful when trying to decide what to cook with given a restricted set of ingredients, or even trying to decide what to eat a restaurant. Topic modelling can be used to learn generative models of recipe and ingredient distributions, to later produce new recipes to try, or even recipes similar to existing ones, so that users can enjoy a new dish that has their favorite ingredients. CONCLUSION In this project we use machine learning approaches to classify recipes into distinct cuisines. We also demonstrate initial attempts that aim to understand the recipe-ingredient space and learn more about fusion-cuisines. Multinomial Logistic Regression classifier is used to classify a recipe or a list of ingredients into a cuisine type, with fairly good accuracies. This is followed by k-means clustering and analysis to learn more about fusion-cuisines. Results from this approach show some promise clusters containing nearly even distributions of two different cuisines inform us about possible fusion-cuisine styles. Future work on this subject can show us if this can be observed better with vast datasets. REFERENCES 1. Y.-Y. Ahn, S. E. Ahnert, J. P. Bagrow, and A.-L. Barabási. Flavor network and the principles of food pairing. Scientific Reports, 1(196), 2011. http://www.nature.com/articles/srep00196 2. http://www.epicurious.com/ 3. https://cran.r-project.org/web/packages/tm/tm.pdf 4. L. J. P. van der Maaten and G. E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579 2605, Nov 2008. http://lvdmaaten.github.io/tsne/ 5. https://cran.rproject.org/web/packages/maxent/maxent.pdf