Wine Rating Prediction

Similar documents
Computerized Models for Shelf Life Prediction of Post-Harvest Coffee Sterilized Milk Drink

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Predicting Wine Quality

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

What makes a good muffin? Ivan Ivanov. CS229 Final Project

Multiple Imputation for Missing Data in KLoSA

DIR2017. Training Neural Rankers with Weak Supervision. Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Sascha Rothe, Jaap Kamps, and W.

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

IT 403 Project Beer Advocate Analysis

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

Relation between Grape Wine Quality and Related Physicochemical Indexes

What Makes a Cuisine Unique?

Analysis of Things (AoT)

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

Learning Connectivity Networks from High-Dimensional Point Processes

wine 1 wine 2 wine 3 person person person person person

Emerging Local Food Systems in the Caribbean and Southern USA July 6, 2014

Regression Models for Saffron Yields in Iran

Hybrid ARIMA-ANN Modelling for Forecasting the Price of Robusta Coffee in India

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

WINE RECOGNITION ANALYSIS BY USING DATA MINING

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

2016 China Dry Bean Historical production And Estimated planting intentions Analysis

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Missing Data Treatments

Flexible Imputation of Missing Data

Cloud Computing CS

Buying Filberts On a Sample Basis

INFLUENCE OF THIN JUICE ph MANAGEMENT ON THICK JUICE COLOR IN A FACTORY UTILIZING WEAK CATION THIN JUICE SOFTENING

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Supporing Information. Modelling the Atomic Arrangement of Amorphous 2D Silica: Analysis

Grapes of Class. Investigative Question: What changes take place in plant material (fruit, leaf, seed) when the water inside changes state?

Word Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017

Specialty Coffee Market Research 2013

Lesson 23: Newton s Law of Cooling

Instruction (Manual) Document

What Cuisine? - A Machine Learning Strategy for Multi-label Classification of Food Recipes

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

DATA MINING CAPSTONE FINAL REPORT

PSYC 6140 November 16, 2005 ANOVA output in R

STACKING CUPS STEM CATEGORY TOPIC OVERVIEW STEM LESSON FOCUS OBJECTIVES MATERIALS. Math. Linear Equations

Varietal Specific Barrel Profiles

The R survey package used in these examples is version 3.22 and was run under R v2.7 on a PC.

Update to A Comprehensive Look at the Empirical Performance of Equity Premium Prediction

Wideband HF Channel Availability Measurement Techniques and Results W.N. Furman, J.W. Nieto, W.M. Batts

AWRI Refrigeration Demand Calculator

MBA 503 Final Project Guidelines and Rubric

Yelp Chanllenge. Tianshu Fan Xinhang Shao University of Washington. June 7, 2013

Predicting Wine Varietals from Professional Reviews

Appendix A. Table A.1: Logit Estimates for Elasticities

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT

The Development of a Weather-based Crop Disaster Program

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Mahout

MEAT WEBQUEST Foods and Nutrition

Reading Essentials and Study Guide

Credit Supply and Monetary Policy: Identifying the Bank Balance-Sheet Channel with Loan Applications. Web Appendix

From VOC to IPA: This Beer s For You!

The Roles of Social Media and Expert Reviews in the Market for High-End Goods: An Example Using Bordeaux and California Wines

Method for the imputation of the earnings variable in the Belgian LFS

Shelf life prediction of paneer tikka by artificial neural networks

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

Pasta Market in Italy to Market Size, Development, and Forecasts

Dietary Diversity in Urban and Rural China: An Endogenous Variety Approach

THE STATISTICAL SOMMELIER

Imputation of multivariate continuous data with non-ignorable missingness

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

Growth in early yyears: statistical and clinical insights

Valuation in the Life Settlements Market

STAT 5302 Applied Regression Analysis. Hawkins

The Effect of Almond Flour on Texture and Palatability of Chocolate Chip Cookies. Joclyn Wallace FN 453 Dr. Daniel

Vegan Ice Cream with Similar Nutritional Value to Dairy-based Ice Cream

Napa County Planning Commission Board Agenda Letter

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

Uniform Rules Update Final EIR APPENDIX 6 ASSUMPTIONS AND CALCULATIONS USED FOR ESTIMATING TRAFFIC VOLUMES

End to End Chilled Water Optimization Merck West Point, PA Site

Handling Missing Data. Ashley Parker EDU 7312

FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE

This appendix tabulates results summarized in Section IV of our paper, and also reports the results of additional tests.

An application of cumulative prospect theory to travel time variability

Michigan Grape & Wine Industry Council Annual Report 2012

Grape Growers of Ontario Developing key measures to critically look at the grape and wine industry

Evaluation of univariate time series models for forecasting of coffee export in India

GrillCam: A Real-time Eating Action Recognition System

ARM4 Advances: Genetic Algorithm Improvements. Ed Downs & Gianluca Paganoni

Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand

You know what you like, but what about everyone else? A Case study on Incomplete Block Segmentation of white-bread consumers.

Update on Wheat vs. Gluten-Free Bread Properties

An Advanced Tool to Optimize Product Characteristics and to Study Population Segmentation

Curtis Miller MATH 3080 Final Project pg. 1. The first question asks for an analysis on car data. The data was collected from the Kelly

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa

Effect of SPT Hammer Energy Efficiency in the Bearing Capacity Evaluation in Sands

Soybean Yield Loss Due to Hail Damage*

Learning the Language of Wine CS 229 Term Project - Final Report

A latent class approach for estimating energy demands and efficiency in transport:

Transcription:

CS 229 FALL 2017 1 Wine Rating Prediction Ke Xu (kexu@), Xixi Wang(xixiwang@) Abstract In this project, we want to predict rating points of wines based on the historical reviews from experts. The wine data is scraped from WineEnthusiast[1] and we used the price, wine variety and several winery location related information as the training features, and output the predicted rating for a wine. Since the desired output was a real-number value of rating, We focused on exploring a variety of linear regression models and also explored one neural network model. We were able to get 5.711 Mean Square Error(MSE) on testing data. I. INTRODUCTION The overall goal is to build models to predict the rating on a scale of 1-100 of a wine. The initial idea was to provide personalized recommendation of wines based on historical reviews from experts, which is similar to Winc[2], but we would like to empower users with the freedom of choosing recommendations instead of blindly trusting Winc to send users the choices they provide. However, personal review dataset is not available, only public ratings are accessible via websites like WineEnthusiast[1]. Therefore, we treated the group of experts of the website as one person and simplified the problem to predicting the rating from the experts. The input to our models is {price, variety, {winery, country, region}} and label is rating points. We then use a variety of linear regression models and one type of neural network models to output a predicted rating on a scale of 1-100. The rest of the paper is organized as follows: Section 2 describes the related work. Section 3 describes the dataset and the relevant features for prediction. Section 4 describes the models performed on the dataset. Section 5 discusses the results of using the models for prediction. Section 6 summarizes the insights gained from this project and the future work. II. RELATED WORK We conducted research on the existing related work. From input data perspective, [4] and [5] use chemical attributes as features. [6], [7] and [8] use derived statistics from review text such as number of reviews, review time and number of adjectives. [6] also has some metadata of the data such as age of wine and variety. While those are definitely good input features to predict wine ratings, but the underlying assumption is that tasting experience dominates the wine rating. We would argue that other aspects, such as winery and price can change your expectation of the wine, thus giving impacts on the rating. From model perspective, [4] and [5] treat this as a classification problem while [6], [7] and [8] use regression approach. Other than various versions of linear regressions and logistic regressions, other methods such as Support Vector Machine, Linear Discriminant Analysis [5] and Random Forest[6] are also used. We started with converting this problem into a classification problem by dividing the 1-100 scale points into 4 categories as shown in I. We tried logistic regression model to predict the rating category, but the model didn t perform well. The reason behind it was that there are many data points near the category boundaries so that it s easy to be off by 1 category. Therefore, we thought regression should be a better solution to tackle this problem. We went with the commonly used linear regression model and later tried neural network which is not explored in the existing work. TABLE I CLASSIFICATION BUCKETS Original rating points 95-100 4 90-94 3 85-89 2 below 85 1 Point bucket Compared with the results from the earlier effort with linear regression models, our models have better performance, showing that the features we used such as winery location and wine price are critical factors for determining the wine ratings. Exploring with other models, such as Support Vector Machine and Linear Discriminant Analysis, will be our future plan.

CS 229 FALL 2017 2 TABLE II SOURCE DATA EXAMPLE country province region 1 winery variety price points US California Napa Valley Heitz Cabernet Sauvignon 235.0 96 US California Knights Valley Macauley Sauvignon Blanc 90.0 96 France Burgundy Chablis Domaine Grard Duplessis Chardonnay 45.0 91 A. Source Data III. DATASET AND FEATURES Our source data includes 150,000 wine review data points scraped from WineEnthusiast[1]. The dataset is available on Kaggle[3]. The original columns we used include: Price: the cost for a bottle of the wine. Variety: the type of grapes used to make the wine (ie Pinot Noir). Winery: the winery where the wine was produced. Country: the country that the wine is from. Province: the province or state that the wine is from. Region: the wine growing area in a province or state (ie Napa). Points: Rating points on 1-100 scale. Table II shows examples of the original data. Fig 1 shows the distribution of rating points. For the data set we use, points are always in the range of 80-100. TABLE III REVIEW POINT STATISTICS Metric Processed Training Testing Max 100 100 100 Min 80 80 80 Mean 88.20 88.21 88.16 Median 88 88 88 Standard Deviation 3.29 3.29 3.29 and let regression model figuring out their relationship can be inefficient and unnecessary. Therefore, we combined country, province and region to produce a new signal: location. In order to train linear model, we preprocessed the input features by converted the string format features into categorical features. We use one-hot encoding[9] for those features. To improve data quality, we removed the duplicates data points and filtered out data points that have any empty features. In the end, to ensure that we have enough training data for a given value of a feature, we filtered out rarely seen values which are defined as values with less than 10 occurrence in the whole data set. After those steps, we got roughly 30,000 data points. As shown in Fig 1, the distribution of rating points are similar to the original data. Then we used 70% data as the training set and 30% as the testing set. The label used for training models is the original rating points from the experts with the range from 80 to 100. Fig. 1. Distribution of rating points Table III shows statistics of points in processed, training and testing datasets. They have very similar stats. B. Feature Engineering & Data Processing There are a couple of location related columns in the original data. Treating them as separate input features A. Linear regression IV. METHODS We began with basic linear regression approach introduced in class X = y where X are the input features with intercept terms, are the weights associated to each feature, and y is the vector of the predicted ratings. However, without any regularization, the model had a strong tendency to overfit. It had about 1k outliers, of which the predicted values were either extremely

CS 229 FALL 2017 3 large or extremely small. We took further exploration on the reason behind it, and observed that the coefficients were quite large. So we decided to add regularization into the model to improve it. We tried three different regularization techniques. Lasso (L1) {(y X) 2 + λ 1 1 } We tried λ 1 with 0.1 / 1 / 10 / 100 and the best result comes with λ 1 = 1 Ridge (L2) {(y X) 2 + λ 2 2 2} We used cholesky solver for Ridge, which obtains the closed-form solution. We tried λ 2 with 0.1 / 1 / 10 / 100 and the best result comes with λ 2 = 1 Elastic Net (L1 and L2) {(y X) 2 + λ 1 1 + λ 2 2 2} The best result comes with λ1 = 0.5 and λ 2 = 0.25 Lasso regression model did not perform well for our datasets and so did Elastic Net regression model. The Ridge regression model provided the most reliable prediction while avoiding the issue of overfitting. Detailed evaluation results are shown in Table IV. B. Neural Network Even though Ridge regression model gave us the best result so far, one of our assumptions is that the correlation between our input features and the rating is often not linear. To explore other potential good-performance models, we tried with Multi-layer Perceptron, which is a class of feedforward artificial neural network. Multi-layer Perceptron is sensitive to feature scaling, so we performed extra data processing by normalizing the price values into [0, 1] to be compatible with other categorical feature values. We tuned the model with different parameter settings. Activation functions: logistic, the logistic sigmoid function, returns f(x) = 1/(1 + exp( x)) relu, the rectified linear unit function, returns f(x) = max(0, x) The solver for weight optimization: L-BFGS, refers to Limited-memory Broyden- Fletcher-Goldfarb-Shanno, is an optimizer in the family of quasi-newton methods. Like the original Broyden-Fletcher-Goldfarb- Shanno (BFGS), L-BFGS uses an estimation to the inverse Hessian matrix to steer its search through variable space, but where BFGS stores a dense nn approximation to the inverse Hessian (n being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. SGD, refers to stochastic gradient descent, which performs a weight update for each training example x i and label y i. Adam, short for Adaptive Moment Estimation, is an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update weights iterative based in training data. The classical stochastic gradient descent maintains a single learning rate for all weight updates and the learning rate does not change during training. What Adam differs from classical stochastic gradient descent is a learning rate is maintained for each weight and separately adapted as learning unfolds. Neurons: 30 / 50 / 100 / 200 / 500 Hidden Layers: 1 / 2 / 3 / 5 Max iterations: 100 / 200 / 300 / 500 The network structure, which presents the best performance, consists of two fully connected hidden layers, and 100 neurons in each layer. ReLU was used as the activation function and optimization for the squared-loss used lbfgs with max 200 iterations in total. V. RESULTS & DISCUSSION A. Visualizing Labels vs Predictions By visualizing Labels vs Predictions on the test data, we can get idea on whether there are outliers, whether we are underestimating / overestimating and how far away the predictions are from ground truth in general. In Fig 2, 3, 4 and 5, we plot the test dataset with x axis as label and y axis as prediction. The red line shows the where the perfect prediction lays. As shown in Fig2, without regularization, we have outliers with huge predicted values. They are so large that the perfect line is almost like a flat line on the chart. That also explains why we have large errors in Table IV. Fig 3 is the result for adding Ridge regularization and Fig 4 is the result with Lasso. They clears show that

CS 229 FALL 2017 4 Fig. 2. Basic Linear Regression performance Fig. 4. Linear Regression w Lasso performance regularization works much better. We no longer have huge outliers, although there are a few with prediction large than 100. By comparing the two, we can see that Ridge brings less and smaller outliers and predicts better in the lower range (label less than 85). Lasso always overestimate in the lower range and are more loosely gathered around higher range (larger than 95). Fig. 5. Neural Network performance Fig. 3. Linear Regression w Ridge performance Fig 5 shows the result of Neural Network. It has similar pattern as Ridge with a few outliers around 95. B. Quantify Quality By Metrics We defined three metrics to evaluate and compare the model performance, R 2 score, mean square error (MSE) and median absolute error (MAE). R 2, is a statistical measure of how close the data are to the fitted regression line. It is calculated by R 2 = 1 u v u and v are defined as u = (y (i) ŷ (i) ) 2 (1) v = i=1 (y (i) 1 m i=1 y (j) ) 2 (2) j=1 where ŷ is the value predicted by the model. MSE, is the sum, over all the data points, of the square of the difference between the predicted and actual label value, divided by the number of data points. MAE, is the median of all of the absolute difference between the predicted and actual label value. From these metrics as listed in TABLE IV, we can see the best performance model is the linear regression model using Ridge regularization followed by NN model with very close result. There are several observations and corresponding conclusions as listed below.

CS 229 FALL 2017 5 TABLE IV EVALUATION RESULTS FOR ALL MODELS Training Testing R 2 MSE MAE R 2 MSE MAE Basic LinearRegression 0.537 5.035 1.468-1.38E+17 1.48E+18 1.602 Linear Regression w Lasso 0.225 8.428 2.086 0.239 8.163 2.044 Linear Regression w Ridge 0.535 5.060 1.481 0.468 5.711 1.593 Linear Regression w Elastic Net 0.225 8.427 2.078 0.240 8.159 2.046 Neural Network 0.522 5.144 1.488 0.466 5.860 1.610 Training error ratio is only slightly better than the testing error ratio for the two models, Ridge regression and neural network, with the best performance. It proves that our models does not overfit due to the selected regularization mechanism and feature engineering work. The error ratio on the training data shows that wine rating can t be perfectly predicted by the price, variety, winery and location. Therefore, combining what we have with data in related work such as chemical attributes, age of the wine, review statistics may give us better result. VI. CONCLUSION & FUTURE WORK [4] Amelia Lemionet, Yi Liu, Zhenxiang Zhou. Predicting quality of wine based on chemical attributes. CS 229 project, 2015. http: //cs229.stanford.edu/proj2015/245 report.pdf [5] Eric Sebastian Soto. Using Chemical Data to Predict Wine Ratings. CS 229 project, 2012. http://cs229.stanford.edu/proj2012/ Soto-UsingChemicalDatatoPredictWineRatings.pdf [6] Fan Chao, Pengbo Li, Renxiang Yan. Predicting Review Rating for Wine Recommendation. https://cseweb.ucsd.edu/ jmcauley/ cse190/reports/fa15/020.pdf [7] Dominic Rossi. Predicting wine ratings using linear models. https://cseweb.ucsd.edu/ jmcauley/cse255/reports/wi15/ Dominic Rossi.pdf [8] Benjamin Braun, Robert Timpe. Text based rating predictions from beer and wine reviews. https://cseweb.ucsd.edu/ jmcauley/ cse255/reports/wi15/benjamin Braun Robert%20Timpe.pdf [9] https://www.kaggle.com/dansbecker/ using-categorical-data-with-one-hot-encoding We explored several different models and tuned with different parameters for each model, and had one common observation: for all the models, the performance for both training and testing data sets is not good as expected. This indicates that the review points cant be perfectly predicted by current features. In the future, to improve quality of the results, we would like to add features such as acidity, alcohol by volume, the age of the wine, reviewers to the input set. explore with other models, such as Support Vector Machine and Random Forest. investigate more on tuning neural network parameters to have better results. VII. CONTRIBUTIONS We both worked on all parts of the project. REFERENCES [1] WineEnthusiast Ratings. http://www.winemag.com/?s=&drink type=wine [2] Winc. https://www.winc.com/member-benefits [3] Kaggle wine review dataset. https://www.kaggle.com/zynicide/ wine-reviews