Predicting Wine Quality

Similar documents
Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

What makes a good muffin? Ivan Ivanov. CS229 Final Project

What Makes a Cuisine Unique?

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

Analysis of Things (AoT)

wine 1 wine 2 wine 3 person person person person person

THE STATISTICAL SOMMELIER

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

Wine Rating Prediction

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

IT 403 Project Beer Advocate Analysis

Gender and Firm-size: Evidence from Africa

Predicting Wine Varietals from Professional Reviews

DIR2017. Training Neural Rankers with Weak Supervision. Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W.

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa

Varietal Specific Barrel Profiles

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Guided Study Program in System Dynamics System Dynamics in Education Project System Dynamics Group MIT Sloan School of Management 1

Panel A: Treated firm matched to one control firm. t + 1 t + 2 t + 3 Total CFO Compensation 5.03% 0.84% 10.27% [0.384] [0.892] [0.

A Hedonic Analysis of Retail Italian Vinegars. Summary. The Model. Vinegar. Methodology. Survey. Results. Concluding remarks.

ENGI E1006 Percolation Handout

Multiple Imputation for Missing Data in KLoSA

You know what you like, but what about everyone else? A Case study on Incomplete Block Segmentation of white-bread consumers.

Flexible Imputation of Missing Data

Curtis Miller MATH 3080 Final Project pg. 1. The first question asks for an analysis on car data. The data was collected from the Kelly

2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project

Which of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not?

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

1. Determine which types of fruit are susceptible to enzymatic browning.

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

AST Live November 2016 Roasting Module. Presenter: John Thompson Coffee Nexus Ltd, Scotland

Biologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name

Perceptual Mapping and Opportunity Identification. Dr. Chris Findlay Compusense Inc.

Learning Connectivity Networks from High-Dimensional Point Processes

Learning the Language of Wine CS 229 Term Project - Final Report

ARM4 Advances: Genetic Algorithm Improvements. Ed Downs & Gianluca Paganoni

4-H Food Preservation Proficiency Program A Member s Guide

Out of Home ROI and Optimization in the Media Mix Summary Report

Relation between Grape Wine Quality and Related Physicochemical Indexes

An Advanced Tool to Optimize Product Characteristics and to Study Population Segmentation

1. Determine methods that can be used to form curds and whey from milk. 2. Explain the Law of Conservation of Mass using quantitative observations.

Tips for Writing the RESULTS AND DISCUSSION:

A Note on a Test for the Sum of Ranksums*

Imputation of multivariate continuous data with non-ignorable missingness

Appendix A. Table A.1: Logit Estimates for Elasticities

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years

Handling Missing Data. Ashley Parker EDU 7312

Missing Data Treatments

Maximising Sensitivity with Percolator

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

Abstract. Keywords: Gray Pine, Species Classification, Lidar, Hyperspectral, Elevation, Slope.

Mini Project 3: Fermentation, Due Monday, October 29. For this Mini Project, please make sure you hand in the following, and only the following:

Reliable Profiling for Chocolate and Cacao

Gasoline Empirical Analysis: Competition Bureau March 2005

1. Title: Identification of High Yielding, Root Rot Tolerant Sweet Corn Hybrids

Specialty Coffee Market Research 2013

Mastering Measurements

Method for the imputation of the earnings variable in the Belgian LFS

DEVELOPMENT OF A RAPID METHOD FOR THE ASSESSMENT OF PHENOLIC MATURITY IN BURGUNDY PINOT NOIR

As described in the test schedule the wines were stored in the following container types:

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

Increasing Toast Character in French Oak Profiles

CAUTION!!! Do not eat anything (Skittles, cylinders, dishes, etc.) associated with the lab!!!

Thought: The Great Coffee Experiment

DATA MINING CAPSTONE FINAL REPORT

Non-Allergenic Egg Substitutes in Muffins

4-H Food Preservation Proficiency

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

Whisky pricing: A dram good case study. Anirudh Kashyap General Assembly 12/22/2017 Capstone Project The Whisky Exchange

AGREEMENT n LLP-LDV-TOI-10-IT-538 UNITS FRAMEWORK ABOUT THE MAITRE QUALIFICATION

1. Determine methods that can be used to form curds and whey from milk. 2. Explain the Law of Conservation of Mass using quantitative observations.

Which of the following are resistant statistical measures? 1. Mean 2. Median 3. Mode 4. Range 5. Standard Deviation

In the Eye of the Beer-Holder. Lexical Descriptors of Aroma and Taste Sensations in Beer Reviews

Molecular Gastronomy: The Chemistry of Cooking

Zeitschrift für Soziologie, Jg., Heft 5, 2015, Online- Anhang

Cloud Computing CS

Sample Guide and Delivery Schedule/Curriculum plan Culinary Operations

WINE RECOGNITION ANALYSIS BY USING DATA MINING

GrillCam: A Real-time Eating Action Recognition System

Chemical Components and Taste of Green Tea

Assessment of the CDR BeerLab Touch Analyser. March Report for: QuadraChem Laboratories Ltd. Campden BRI Group contracting company:

Unit 4P.2: Heat and Temperature

The Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method

Regression Models for Saffron Yields in Iran

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Mahout

Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Lecture 13. We continue our discussion of the economic causes of conflict, but now we work with detailed data on a single conflict.

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

Internet Appendix. For. Birds of a feather: Value implications of political alignment between top management and directors

From VOC to IPA: This Beer s For You!

Emerging Local Food Systems in the Caribbean and Southern USA July 6, 2014

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

INFLUENCE OF THIN JUICE ph MANAGEMENT ON THICK JUICE COLOR IN A FACTORY UTILIZING WEAK CATION THIN JUICE SOFTENING

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Introduction to Measurement and Error Analysis: Measuring the Density of a Solution

Dietary Diversity in Urban and Rural China: An Endogenous Variety Approach

Transcription:

March 8, 2016 Ilker Karakasoglu Predicting Wine Quality Problem description: You have been retained as a statistical consultant for a wine co-operative, and have been asked to analyze these data. Each row represents data on a particular Portuguese wine, and the columns are attributes. The last column is the response quality, which is a quantitative (integer) score between 0 (very bad) and 10 (excellent) made by wine experts (in our data there was no wine lower than a 3, and none higher than 9). Your clients are interested in predicting the quality score based on the attributes. They would also like to get some sense of which attributes are more important for this task, and their role in the prediction procedure. The file wine.test.ho.csv consists of 1300 wines where the quality score is omitted. Use your model to predict the quality score for each of these wines. Solution: We first visualize the data to get a better understanding of it. Below is a pairplot which illustrates all variables and output quality plotted among each other. We color code red wines with green and white wines with blue. We observe two things. First, white and red wines do not exactly have same attributes. In many variables, there are distinctions between them. They even look separable. Therefore, it may make more sense to predict the wine quality separately for reds and whites. Second, there are strong correlations among certain variables. Some of these correlated variables can be left out.

In order to better see the correlations, a heatmap of correlations is illustrated below. Warm colors indicate a positive correlation, while cold colors indicate a negative one. In an ideal case of all variables being normal to each other, we would get a white map. In this one, we observe various amounts of correlations among variables. For instance, there is a very strong positive correlation between total sulfur dioxide and free sulfur dioxide. Similarly, density is positively

correlated with fixed acidity and residual sugar. On the other hand, it has a strong negative correlation with alcohol. All these correlations intuitively make sense. Strong correlations mean that these correlated variables should be handled carefully in a learning model. Possibly, some of them can be dropped out depending on which ones carry the highest importance. Next, we visualize the distribution of the outcome variable- wine quality. We see a strong concentration on average wine scores- 5 and 6. This unbalanced distribution is a challenge for a learning model, since most predictions would center around 5 and 6. So other scores may get harder to predict.

Having seen the basic properties of the data, we continue with developing a learning model for quality prediction. We prefer a regression instead of a classification because the quality is inherently ordered. After regressing, we round up the scores to the closest digits. We split the training set into to get a smaller training set and a validation set. We use the validation set to judge our performance on the real test set. The algorithm we choose is Support Vector Machines for Regression (SVR). SVR has four parameters to choose. First one is the kernel. We use a radial kernel (rbf). This kernel trick is a powerful method used to transform input data into a higher dimensional space while not increasing the computational cost. We prefer the radial kernel over the other popular choice linear kernel after seeing the better performance of the radial kernel in this data set. The other three parameters are gamma, epsilon and C. Each affects the bias-variance tradeoff. Higher values of gamma makes the radial kernel more localized. The kernel doesn t expand much onto all data points. It rather only sample around the given observation. The higher the gamma, the less bias but the more variance we would get. Epsilon determines the epsilon-insensitive where there are at most epsilon deviations from the actually obtained target values for all the training data. The higher the epsilon, the less variance we can get. C is the cost parameter, which is positive and controls the tradeoff between the model complexity and the amount up to which deviations greater than epsilon are tolerated. Similar to the epsilon, the higher C leads to less variance.

The radial kernel is a nonlinear and flexible one. Therefore, it may give rise to overfitting. To avoid this problem, other parameters- gamma, epsilon and C- must be chosen carefully. We do this using cross validation. Since there are 3 parameters, we use a grid of them to choose the best tuple. We employ the scikit-learn s GridSearchCV function, which does an exhaustive search to find the tuple giving the highest cross validation score. 5-fold cross validation is used. We test for the values of C: [0.1, 1, 3, 10], gamma: [0.001, 0.01, 0.1, 1], epsilon: [0.01, 0.1, 1]. The best tuple giving the highest cross validation score is {epsilon: 0.1, C: 10, gamma: 0.01.}. We run 3 different regressions: for reds, for whites and for two combined. CV scores indicate that separate regressions for reds and whites give better results. Still the combined set performance is close. In the combined regression we use the color of the wine as dummy variable. We obtain a validation set score of RMS = 0.71. Next we assess the variable importance. To do that, in built scikit-learn property of feature_importances_ is used. This is a function which automatically ranks the variables in terms of their significance. Relative importance a variable is assessed by the high variance it produces in data. We plot 3 importance charts for the 3 regressions. In the combined one we see that the color is not significant at all. This explains the combined dataset s close performance to reds and whites alone. Alcohol and sulphates are the most important variables. Density and ph are not very important and previously they were found to be correlated with other variables. Therefore, they can be good candidates to be dropped from the feature space. A helpful analysis would the confusion matrix rather than the RMS to assess the model (although we haven t implemented yet). As previously shown on the histogram, outcome variable is very skewed around 5 and 6. Therefore, it is likely that scores other than 5 and 6 will have lower recall and precision rates. F1 score and ROC curves could be helpful to summarize the precision and recalls considerations.