Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Similar documents
Predicting Wine Quality

IT 403 Project Beer Advocate Analysis

Relation between Grape Wine Quality and Related Physicochemical Indexes

Analysis of Things (AoT)

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

STAT 5302 Applied Regression Analysis. Hawkins

THE STATISTICAL SOMMELIER

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

From VOC to IPA: This Beer s For You!

OF THE VARIOUS DECIDUOUS and

Gasoline Empirical Analysis: Competition Bureau March 2005

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

Biologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name

Which of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not?

Varietal Specific Barrel Profiles

Multiple Imputation for Missing Data in KLoSA

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Report to Zespri Innovation Company Ltd. An Analysis of Zespri s 2003 Organic Kiwifruit Database: Factors Affecting Production

Imputation of multivariate continuous data with non-ignorable missingness

Determining the Optimum Time to Pick Gwen

A Note on a Test for the Sum of Ranksums*

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

Molecular Gastronomy: The Chemistry of Cooking

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Emerging Local Food Systems in the Caribbean and Southern USA July 6, 2014

Research - Strawberry Nutrition

MBA 503 Final Project Guidelines and Rubric

HW 5 SOLUTIONS Inference for Two Population Means

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years

INFLUENCE OF ENVIRONMENT - Wine evaporation from barrels By Richard M. Blazer, Enologist Sterling Vineyards Calistoga, CA

*p <.05. **p <.01. ***p <.001.

FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE

PARENTAL SCHOOL CHOICE AND ECONOMIC GROWTH IN NORTH CAROLINA

Grapes of Class. Investigative Question: What changes take place in plant material (fruit, leaf, seed) when the water inside changes state?

Investment Wines. - Risk Analysis. Prepared by: Michael Shortell & Adiam Woldetensae Date: 06/09/2015

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Name: Adapted from Mathalicious.com DOMINO EFFECT

CORRELATIONS BETWEEN CUTICLE WAX AND OIL IN AVOCADOS

Increasing Toast Character in French Oak Profiles

Panel A: Treated firm matched to one control firm. t + 1 t + 2 t + 3 Total CFO Compensation 5.03% 0.84% 10.27% [0.384] [0.892] [0.

Evaluating Population Forecast Accuracy: A Regression Approach Using County Data

Bt Corn IRM Compliance in Canada

Comparison of Multivariate Data Representations: Three Eyes are Better than One

Chapter V SUMMARY AND CONCLUSION

Experiment # Lemna minor (Duckweed) Population Growth

Analyzing Human Impacts on Population Dynamics Outdoor Lab Activity Biology

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

1) What proportion of the districts has written policies regarding vending or a la carte foods?

Lollapalooza Did Not Attend (n = 800) Attended (n = 438)

Sensory Quality Measurements

Coffee weather report November 10, 2017.

WINE GRAPE TRIAL REPORT

Flexible Imputation of Missing Data

1. Determine which types of fruit are susceptible to enzymatic browning.

wine 1 wine 2 wine 3 person person person person person

Distillation Purification of Liquids

Chemical Components and Taste of Green Tea

An application of cumulative prospect theory to travel time variability

Comparative Analysis of Fresh and Dried Fish Consumption in Ondo State, Nigeria

Temperature effect on pollen germination/tube growth in apple pistils

Whisky pricing: A dram good case study. Anirudh Kashyap General Assembly 12/22/2017 Capstone Project The Whisky Exchange

TEACHER NOTES MATH NSPIRED

Curtis Miller MATH 3080 Final Project pg. 1. The first question asks for an analysis on car data. The data was collected from the Kelly

KEY. Chemistry End of Year Cornerstone Assessment: Part A. Experimental Design

Regression Models for Saffron Yields in Iran

Buying Filberts On a Sample Basis

As described in the test schedule the wines were stored in the following container types:

F&N 453 Project Written Report. TITLE: Effect of wheat germ substituted for 10%, 20%, and 30% of all purpose flour by

Ricco.Rakotomalala

Revisiting the most recent Napa vintages

distinct category of "wines with controlled origin denomination" (DOC) was maintained and, in regard to the maturation degree of the grapes at

ECONOMICS OF COCONUT PRODUCTS AN ANALYTICAL STUDY. Coconut is an important tree crop with diverse end-uses, grown in many states of India.

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

1. Title: Identification of High Yielding, Root Rot Tolerant Sweet Corn Hybrids

The Floating Leaf Disk Assay for Investigating Photosynthesis

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4

Notes on acid adjustments:

Pasta Market in Italy to Market Size, Development, and Forecasts

The Purpose of Certificates of Analysis

GENOTYPIC AND ENVIRONMENTAL EFFECTS ON BREAD-MAKING QUALITY OF WINTER WHEAT IN ROMANIA

Problem Set #3 Key. Forecasting

A Hedonic Analysis of Retail Italian Vinegars. Summary. The Model. Vinegar. Methodology. Survey. Results. Concluding remarks.

Chemical and Sensory Differences in American Oak Toasting Profiles

Colorado State University Viticulture and Enology. Grapevine Cold Hardiness

PSYC 6140 November 16, 2005 ANOVA output in R

Zeitschrift für Soziologie, Jg., Heft 5, 2015, Online- Anhang

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Mini Project 3: Fermentation, Due Monday, October 29. For this Mini Project, please make sure you hand in the following, and only the following:

DEVELOPMENT OF A RAPID METHOD FOR THE ASSESSMENT OF PHENOLIC MATURITY IN BURGUNDY PINOT NOIR

Joseph G. Alfieri 1, William P. Kustas 1, John H. Prueger 2, Lynn G. McKee 1, Feng Gao 1 Lawrence E. Hipps 3, Sebastian Los 3

Tips for Writing the RESULTS AND DISCUSSION:

SWEET DOUGH APPLICATION RESEARCH COMPARING THE FUNCTIONALITY OF EGGS TO EGG REPLACERS IN SWEET DOUGH FORMULATIONS RESEARCH SUMMARY

Mastering Measurements

THE EGG-CITING EGG-SPERIMENT!

Moving Molecules The Kinetic Molecular Theory of Heat

Activity 2.3 Solubility test

The Roles of Social Media and Expert Reviews in the Market for High-End Goods: An Example Using Bordeaux and California Wines

Transcription:

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts When you need to understand situations that seem to defy data analysis, you may be able to use techniques such as binary logistic regression. This article details how winetasting data and binary logistic regression yielded insight into factors that were important to a panel of experienced wine-tasters. The analysis illustrates that even factors that seem hard to measure, such as taste preferences, can be assessed with statistics if you choose the right analysis. In this article, we will take a very unusual look at wine tasting. Although tastes vary from person to person and are probably unique (De Gustibus non est discutandum: In matters of taste, there can be no disputes ), some wines are better than others, and most people would probably recognize a good wine from a bad one. We are interested in using statistics to understand whether a wine that has, for instance, more sulphates or more chlorides would taste better. Based on that understanding, it could be possible to make a better wine. We will consider several variables, such as acidity, sulphur dioxide, and percentage of alcohol. We have data from a panel of oenologists who tasted several types of white and red wines and provided binary assessments of quality good (1) or poor (0) for each. Here are the variables in our data set: Variable Details Units Type red or white N/A ph Density acidity (below 7) or alkalinity (over 7) density N/A grams/cubic centimeter Sulphates potassium sulfate grams/liter Alcohol percentage alcohol % volume Residual sugar residual sugar grams/liter Chlorides sodium chloride grams/liter

Free SO2 free sulphur dioxide milligrams/liter Total SO2 total sulphur dioxide milligrams/liter Fixed acidity tartaric acid grams/liter Volatile acidity acetic acid grams/liter Citric acid citric acid grams/liter Our goal is to identify which of these many variables have a significant effect on wine quality. Preliminary Graphical Analysis Even very simple graphs can provide good indications of which variables might be important, and help us understand the structure of our data set. The bar chart below describes the relationship between types of wines (white or red) and the panel s binary quality responses. The panel tasted more white wines than red, and since we can see that there is a larger proportion of 1 ratings for white wines, we can infer that the panel seems to prefer white wines:

This is interesting information, and is something we might want to consider later, but our primary objective is to evaluate the effects of ph, density, sulphates, alcohol, residual sugar, and other factors on wine quality. Do some of these variables have a significant effect on quality? If so, which ones? We are interested in identifying variables for which there is a large change between a good wine and a bad one. These variables might be a good predictor of a good wine. The boxplots below illustrate the distribution of the variables according to good or poor wine quality. We can clearly see that we really do have a lot of variables to consider, and using graphs to select variables that have a noticeable effect on wine quality is far from easy. Using Regression to Analyze Binary Taste Data Regression analysis lets us see how multiple factors affect an outcome, so it would seem to be an ideal method to look at the wine-tasting variables. However, recall that our panel simply ranked each wine as either high- or low-quality. This means we have binary and not continuous response data, so we need to proceed with caution using a standard regression or ANOVA to analyze a binary response is generally not a good idea. Because binary data follow a binomial distribution rather than a normal, bell-shaped distribution, standard regression may result in probability predictions that are negative or larger than 100%. We might get an unnecessarily complex model, in

which some spurious interactions seem to be significant. In addition, the variance for binary data is not necessarily constant. When the average proportion is close to 0 or to 1, the variability tends to get smaller, since binary data are truncated due to the upper (1) or lower (0) limit. Therefore, effects that may seem to be larger for factorspecific settings might be due not to interactions with other factors, but to nonconstant variance. Fortunately, there s a simple solution: since we have binary response data, we simply need to use binary logistic regression. Principal Components Analysis Before jumping into a regression analysis, we can use a Principal Components (multivariate) Analysis to detect collinearity or correlation among the variables. Identifying variables that are highly collinear which can make one of the variables almost redundant in some cases can help us select the best possible binary logistic regression model. To understand whether some variables are correlated with one another, we could use a standard correlation analysis (Stat > Basic Statistics > Correlation in Minitab), but using a loading plot from a Principal Components Analysis offers a very clear visual illustration of these correlations. Such a plot is more explicit and shows whether some groups of correlated variables might be grouped together. In Minitab go to Stat > Multivariate > Principal Components, enter the variables, select Graphs, and check Loading Plot. Our data yielded the following:

The Loading plot from the Principal Components Analysis shows that : Free SO2 and total SO2 are highly collinear: the lines for these variables run in the same direction on the graph and are very close to one another. Fixed acidity and chlorides also seem to be highly collinear. Because of these strong collinearities, different models (that include different variables) may be equally acceptable in terms of prediction. This needs to be considered once a final model has been selected. Full Model Regression Analysis A standard practice in regression analysis is to start with the full model, one that includes all of the potentially significant factors for which you collected data. In this case, we begin the analysis by including all variables and all interactions between those variables and type of wine. Then we began eliminating the variables with the highest p-value. Since we know some variables are highly collinear and could influence one another, we eliminate only one variable at a time, then run a regression using the reduced model. Ultimately, this iterative process leads us to the model below. It is quite complex, with many significant Wine-Type*variable interactions:

The factors and interactions that remain in the model are statistically significant (with p values < 0.05).You might note that Alcohol and Free S02 both have high p-values, making them candidates for elimination, but since these terms are included in significant interactions, they should remain in the model. With 15 terms, this model is far too difficult to understand and explain, but it does give us a clue to how we can delve deeper into these data to better understand which factors contribute most to good-tasting wine. We have 5 significant interactions involving type in our model. This indicates that the effects of some variables differ significantly according to red or white wines. Remember also that our panel seemed to have a preference for white over red wines. Perhaps we should consider separate models for white and red wines. This would eliminate the need to include interactions between Types of Wine and other variables, which would greatly simplify the models. Regression Model for White Wines We ll analyze the white wine data first. As before, we ll start from the full model and eliminate one factor at a time according to its p-value. This leads us to the following model:

This model includes only 6 terms, and the variables that remain in the model all have low p-values (less than or very close to 0.05). This model is easier to interpret since there are no interactions. Density, for example, seems to have a negative effect on taste because it has a negative coefficient, while ph has a positive effect. But how do we know this model is acceptable? Goodness of fit tests help us assess model adequacy. See the output from Minitab below: The p-values for all three goodness-of-fit tests are well over 0.05, so we cannot reject the hypothesis that this model is adequate. That s encouraging. Another thing we can look at is the number of concordant and discordant pairs in our model. The proportion of concordant/discordant pairs is a measure of the level of agreement between the model predictions and the observations in other words, how well the model reflects the observed data). The proportion of concordant pairs is high. Again this is encouraging. A way to validate the model is to see how well the observed data match the model s predicted probabilities. The standardized Delta graph checks for large differences between predicted probabilities based on our model and observed probabilities. The graph below shows that we do have some outliers, but on the whole it looks reasonable.

Regression Model for Red Wines We followed the same process used to analyze the white-wine data iteratively eliminating variables one at a time from the full model to create a model for the red wines: With only two factors, the model is fairly simple and small. We still need to look at the goodness-of-fit tests, however.

The Pearson and deviance tests are good, but the p-value of the Hosmer-Lemeshow test is low. This suggests we might have an issue with the accuracy of this model. Once again, we ll create a standardized Delta graph to help validate the model. The graph indicates that we have an outlier in row 34, which might be causing the goodness-of-fit issue. To see if that s the case, we can eliminate row 34 and rerun the whole analysis. The new analysis, without data point 34, yields a very similar model. This revised model has the same variables, but slightly different coefficients: This time the p-values are high for all goodness-of-fits tests, so we do not have a model adequacy issue: Now let s look at what Minitab tells us about concordant and discordant pairs:

The Minitab output above shows that the proportion of concordant pairs is high. Moreover, the Delta Beta graph of residuals does not reveal any major outlying observations: Drawing Conclusions from the Regression Analyses Now that we have models for the red and white wines, we can see what the data tell us about the wine characteristics that influenced our panel s rankings. For example, this scatterplot summarizes the relationship between the variables for red wines:

The scatterplot indicates that red wines with a larger alcohol percentage and larger fixed acidity content receive higher quality rankings. Testing the Regression Model The data set we used to build our models was just part of a larger data set that we had divided in two: a training dataset to build our model, and a testing dataset to validate the model. Once we had our final models, we used the testing data to validate and test our final models. When we compared the predictions from the models for the new data with the actual panel results from the second test set of values, we found an overall number of 152 concordant results and 48 discordant results. Considering how difficult it is to analyze personal tastes, this is a very good result! So when you need to understand situations that, at least on the surface, defy data analysis, why not dig a little deeper by using techniques such as binary logistic regression? You can use a similar approach to what we did with this wine-tasting data to analyze marketing or sales data, to better understand customer preferences, and to gain insight into factors that are important even if, like taste preferences, they seem hard to measure. Bruno Scibilia