Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Similar documents
Predicting Wine Quality

What makes a good muffin? Ivan Ivanov. CS229 Final Project

Wine Rating Prediction

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

What Makes a Cuisine Unique?

Analysis of Things (AoT)

Learning Connectivity Networks from High-Dimensional Point Processes

Learning the Language of Wine CS 229 Term Project - Final Report

2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project

THE STATISTICAL SOMMELIER

Cloud Computing CS

Method for the imputation of the earnings variable in the Belgian LFS

Predicting Wine Varietals from Professional Reviews

A CASE STUDY: HOW CONSUMER INSIGHTS DROVE THE SUCCESSFUL LAUNCH OF A NEW RED WINE

Flexible Imputation of Missing Data

IT 403 Project Beer Advocate Analysis

Relation between Grape Wine Quality and Related Physicochemical Indexes

Multiple Imputation for Missing Data in KLoSA

Table A.1: Use of funds by frequency of ROSCA meetings in 9 research sites (Note multiple answers are allowed per respondent)

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Mahout

DIR2017. Training Neural Rankers with Weak Supervision. Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W.

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Food Image Recognition by Deep Learning

Word Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017

wine 1 wine 2 wine 3 person person person person person

Nuclear reactors construction costs: The role of lead-time, standardization and technological progress

WINE RECOGNITION ANALYSIS BY USING DATA MINING

Climate change may alter human physical activity patterns

Combining high throughput genotyping and phenotyping for the genetic improvement of table grapes in Chile

Imputation of multivariate continuous data with non-ignorable missingness

Fractions with Frosting

Crea%ng value is our business

Mastering Measurements

Regression Models for Saffron Yields in Iran

Abstract. Keywords: Gray Pine, Species Classification, Lidar, Hyperspectral, Elevation, Slope.

Dietary Diversity in Urban and Rural China: An Endogenous Variety Approach

Efficient Image Search and Identification: The Making of WINE-O.AI

From VOC to IPA: This Beer s For You!

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Missing Data Treatments

Appendix Table A1 Number of years since deregulation

Perceptual Mapping and Opportunity Identification. Dr. Chris Findlay Compusense Inc.

You know what you like, but what about everyone else? A Case study on Incomplete Block Segmentation of white-bread consumers.

The Development of a Weather-based Crop Disaster Program

ARM4 Advances: Genetic Algorithm Improvements. Ed Downs & Gianluca Paganoni

Appendix A. Table A.1: Logit Estimates for Elasticities

Introduction. Introduction. Introduction. Cistus. Cistus Pyrophytic ecology. Cistus 07/03/2014

OC Curves in QC Applied to Sampling for Mycotoxins in Coffee

Computerized Models for Shelf Life Prediction of Post-Harvest Coffee Sterilized Milk Drink

STAT 5302 Applied Regression Analysis. Hawkins

PSYC 6140 November 16, 2005 ANOVA output in R

Curtis Miller MATH 3080 Final Project pg. 1. The first question asks for an analysis on car data. The data was collected from the Kelly

Credit Supply and Monetary Policy: Identifying the Bank Balance-Sheet Channel with Loan Applications. Web Appendix

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

MICROWAVE DIELECTRIC SPECTRA AND THE COMPOSITION OF FOODS: PRINCIPAL COMPONENT ANALYSIS VERSUS ARTIFICIAL NEURAL NETWORKS.

The Financing and Growth of Firms in China and India: Evidence from Capital Markets

Corn Quality for Alkaline Cooking: Analytical Challenges

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

XVII th World Congress of the International Commission of Agricultural and biosystems Engineering (CIGR)

The Sources of Risk Spillovers among REITs: Asset Similarities and Regional Proximity

Comparison of Multivariate Data Representations: Three Eyes are Better than One

Handling Missing Data. Ashley Parker EDU 7312

The Importance of Dose Rate and Contact Time in the Use of Oak Alternatives

An application of cumulative prospect theory to travel time variability

Incremental Record Linkage. Anja Gruenheid!! Xin Luna Dong!!! Divesh Srivastava

Science Grade 5 FORMATIVE MINI ASSESSMENTS. Read each question and choose the best answer. Be sure to mark all of your answers.

Evaluation and Analysis Model of Wine Quality Based on Mathematical Model

ULTRA FRESH SWEET INTRODUCTION

Harvest Series 2017: Wine Analysis. Jasha Karasek. Winemaking Specialist Enartis USA

Heat stress increases long-term human migration in rural Pakistan

Using the Forest to see the Trees: A computational model relating features, objects and scenes

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

Identification of Adulteration or origins of whisky and alcohol with the Electronic Nose

Detecting Melamine Adulteration in Milk Powder

Coffee weather report November 10, 2017.

Assignment # 1: Answer key

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

Building the A Team: Engaging your School in Food Allergy Management

2016 AGU Fall Meeting Scientific Program Public Affairs

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

NO TO ARTIFICIAL, YES TO FLAVOR: A LOOK AT CLEAN BALANCERS

Grain and Flour Quality of Ethiopian Sorghum in Respect of their Injera Making Potential

Jake Bernstein Trading Webinar

DATA MINING CAPSTONE FINAL REPORT

KEYWORDS:Classification, Discriminant Analysis, Wine Quality, PH, Residual Sugar

An Advanced Tool to Optimize Product Characteristics and to Study Population Segmentation

Hybrid ARIMA-ANN Modelling for Forecasting the Price of Robusta Coffee in India

What Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes

Research Article Incremental Support Vector Machine Combined with Ultraviolet- Visible Spectroscopy for Rapid Discriminant Analysis of Red Wine

New challenges of flour quality fluctuations and enzymatic flour standardization.

Ice Cream. Ice Cream. 1 of 9. Copyright 2007, Exemplars, Inc. All rights reserved.

Liquidity and Risk Premia in Electricity Futures Markets

DENSELY CONNECTED CONVOLUTIONAL NETWORKS

Olea Tumor Basic VPMC-13988A

MBA 503 Final Project Guidelines and Rubric

Transcription:

Modeling Wine Quality Using Classification and Mario Wijaya MGT 8803 November 28, 2017

Motivation 1 Quality How to assess it? What makes a good quality wine? Good or Bad Wine? Subjective? Wine taster Who cares? Consumer Wine industry Data Science Classification Goal Predict quality of a given wine Classify whether a wine is good or bad

Consists of Dataset Solution 2 White wine: 4898 samples Red wine: 1599 samples Variables: Fixed acidity Volatile acidity Quality GoodBad Quality > 5: Class 1 Quality <=5: Class 0 etc Potential problem? Class imbalance Bias High variance Oversampling underrepresented class Downsampling overrepresented class Overweight underrepresented classes in loss function Normalization for classification and regression (SGD) Source of Dataset: UCI (https://archive.ics.uci.edu/ml/datasets/wine+quality)

General Strategy 3 Train 10-fold cross validation Validation Tune parameters Optimal Parameters Find Prediction Accuracy or R^2 Test Tools: Python3 with Scikit-learn package, Matplotlib & Seaborn (Plot & Visualization)

Models & Challenges 4 Classification Challenges Multi linear regression Stochastic Gradient Descent Ridge Lasso Decision Tree SVM K-Nearest Neighbor Decision Tree Classification Used PCA to do dimension reduction 11 variables mapped to 2 dimension Find optimal parameters SVM: C, gamma Etc Find model that can be generalized Prevent overfitting K-fold cross validation

Quick Lecture Classification - KNN 5 Ridge L-2 penalty Lasso L-1 Penalty Decision Tree

6 Correlation Matrix Look at possible high correlation feature

7 Correlation Matrix Look at possible high correlation feature Multiple Linear Y = X1beta1 + X2beta2 +... XnbetaN + E R^2 = 0.325 Pretty bad! SGD - R^2: 0.323 Lasso and Ridge equally bad Used interaction terms and remove high p-value -> bad Forward selection -> not good either

8 Correlation Matrix Look at possible high correlation feature Multiple Linear Y = X1beta1 + X2beta2 +... XnbetaN + E R^2 = 0.325 Pretty bad! SGD - R^2: 0.323 Lasso and Ridge equally bad Used interaction terms and remove high p-value -> bad Forward selection -> not good either

Classification Classification - SVM Prediction Accuracy - RBF Kernel Varying C and gamma Prediction Accuracy - Linear Kernel Varying C 9 Normalize data (0,1) Varies parameter of C and gamma 10-fold cross validation Find best model that gives lowest error rate or highest accuracy rate ~83% prediction accuracy but clearly linear kernel is better in this case from support vector drawn How do you draw 11 dimensions into 2 dimensions? PCA

Classification - KNN 10 Classification Ad-hoc knowledge: K = 1/sqrt(# of samples) = ~99 Use 10-fold CV Determine error rate Use it to find best K K = 40 -> K = 100 Not much different Higher K -> smoother curves Relatively good for classification Easily overfitting Careful!

Classification - Decision Tree 11 Classification Recursively find label Used Gini Index for splitting Other methods: Information Gain (Entropy) 88% prediction accuracy Also tried with testing data Need to set depth, otherwise we will have overfitting

Conclusion & Discussion 12 Conclusion Several clustering algorithm works well with the dataset Bad performance with regression Possibly need more work in determining which features to keep Combat subjective result from wine taster when we can use Data Science to answer the question Discussion If good regression model can be found then a Python based application can be build for interactivity Need to understand dataset well and find optimal parameters

Modeling Wine Quality Ran several algorithm on multiple linear regression Ordinary Least Square (Linear ) Ridge Lasso Stochastic Gradient Descent Forward Selection Decision Tree Created several classification models to predict whether the quality of a given wine is good or bad K-Nearest Neighbors SVM Decision Tree Classification Used PCA for dimensionality reduction Mario Wijaya