STAT 5302 Applied Regression Analysis. Hawkins

Similar documents
PSYC 6140 November 16, 2005 ANOVA output in R

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

The R survey package used in these examples is version 3.22 and was run under R v2.7 on a PC.

Homework 1 - Solutions. Problem 2

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

Missing Data Treatments

wine 1 wine 2 wine 3 person person person person person

Analysis of Things (AoT)

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

Comparing R print-outs from LM, GLM, LMM and GLMM

From VOC to IPA: This Beer s For You!

THE STATISTICAL SOMMELIER

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Weather Sensitive Adjustment Using the WSA Factor Method

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Curtis Miller MATH 3080 Final Project pg. 1. The first question asks for an analysis on car data. The data was collected from the Kelly

INSTITUTE AND FACULTY OF ACTUARIES CURRICULUM 2019 SPECIMEN SOLUTIONS. Subject CS1B Actuarial Statistics

Thought: The Great Coffee Experiment

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

The Development of a Weather-based Crop Disaster Program

Final Exam Financial Data Analysis (6 Credit points/imp Students) March 2, 2006

Credit Supply and Monetary Policy: Identifying the Bank Balance-Sheet Channel with Loan Applications. Web Appendix

IT 403 Project Beer Advocate Analysis

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

Handling Missing Data. Ashley Parker EDU 7312

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

Valuation in the Life Settlements Market

The multivariate piecewise linear growth model for ZHeight and zbmi can be expressed as:

Summary of Main Points

Regression Models for Saffron Yields in Iran

Zeitschrift für Soziologie, Jg., Heft 5, 2015, Online- Anhang

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Gasoline Empirical Analysis: Competition Bureau March 2005

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Predicting Wine Quality

Climate change may alter human physical activity patterns

Investment Wines. - Risk Analysis. Prepared by: Michael Shortell & Adiam Woldetensae Date: 06/09/2015

Panel A: Treated firm matched to one control firm. t + 1 t + 2 t + 3 Total CFO Compensation 5.03% 0.84% 10.27% [0.384] [0.892] [0.

A Hedonic Analysis of Retail Italian Vinegars. Summary. The Model. Vinegar. Methodology. Survey. Results. Concluding remarks.

Statistics 5303 Final Exam December 20, 2010 Gary W. Oehlert NAME ID#

Lollapalooza Did Not Attend (n = 800) Attended (n = 438)

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

PEEL RIVER HEALTH ASSESSMENT

STACKING CUPS STEM CATEGORY TOPIC OVERVIEW STEM LESSON FOCUS OBJECTIVES MATERIALS. Math. Linear Equations

Evaluating Population Forecast Accuracy: A Regression Approach Using County Data

February 26, The results below are generated from an R script.

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years

Statistics & Agric.Economics Deptt., Tocklai Experimental Station, Tea Research Association, Jorhat , Assam. ABSTRACT

Appendix Table A1 Number of years since deregulation

Napa Highway 29 Open Wineries

Appendix A. Table A.1: Logit Estimates for Elasticities

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

Problem Set #3 Key. Forecasting

Multiple Imputation for Missing Data in KLoSA

Flexible Working Arrangements, Collaboration, ICT and Innovation

Economics 101 Spring 2016 Answers to Homework #1 Due Tuesday, February 9, 2016

Flexible Imputation of Missing Data

Cointegration Analysis of Commodity Prices: Much Ado about the Wrong Thing? Mindy L. Mallory and Sergio H. Lence September 17, 2010

Online Appendix. for. Female Leadership and Gender Equity: Evidence from Plant Closure

After your yearly checkup, the doctor has bad news and good news.

Growth in early yyears: statistical and clinical insights

*p <.05. **p <.01. ***p <.001.

Biologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name

Notes on the Philadelphia Fed s Real-Time Data Set for Macroeconomists (RTDSM) Capacity Utilization. Last Updated: December 21, 2016

OF THE VARIOUS DECIDUOUS and

Measuring the extent of instability in foodgrains production in different districts of Karanataka INTRODUCTION. Research Paper

This appendix tabulates results summarized in Section IV of our paper, and also reports the results of additional tests.

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

Imputation of multivariate continuous data with non-ignorable missingness

Wine Rating Prediction

> Y=degre=="deces" > table(y) Y FALSE TRUE

Acetic acid dissociates immediately in solution. Reaction A does not react further following the sample taken at the end of

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT

Final Report to Delaware Soybean Board January 11, Delaware Soybean Board

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2]

Northern Region Central Region Southern Region No. % of total No. % of total No. % of total Schools Da bomb

J. Best 1 A. Tepley 2

Internet Appendix to. The Price of Street Friends: Social Networks, Informed Trading, and Shareholder Costs. Jie Cai Ralph A.

Bags not: avoiding the undesirable Laurie and Winifred Bauer

Which of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not?

Tips for Writing the RESULTS AND DISCUSSION:

MATERIALS AND METHODS

The R&D-patent relationship: An industry perspective

Fleurieu zone (other)

Buying Filberts On a Sample Basis

Archdiocese of New York Practice Items

Hybrid ARIMA-ANN Modelling for Forecasting the Price of Robusta Coffee in India

An application of cumulative prospect theory to travel time variability

USING STRUCTURAL TIME SERIES MODELS For Development of DEMAND FORECASTING FOR ELECTRICITY With Application to Resource Adequacy Analysis

COMMUNICATION II Moisture Determination of Cocoa Beans by Microwave Oven

Internet Appendix for Does Stock Liquidity Enhance or Impede Firm Innovation? *

Not to be published - available as an online Appendix only! 1.1 Discussion of Effects of Control Variables

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

Senior poverty in Canada, : A decomposition analysis of income and poverty rates

PREDICTION MODEL FOR ESTIMATING PEACH FRUIT WEIGHT AND VOLUME ON THE BASIS OF FRUIT LINEAR MEASUREMENTS DURING GROWTH

Relation between Grape Wine Quality and Related Physicochemical Indexes

Oenometrics VII Conference Reims, May 11-13, Predicting Italian wines quality from weather data and experts ratings (DRAFT)

Growth dynamics and forecasting of finger millet (Ragi) production in Karnataka

Transcription:

Homework 3 sample solution 1. MinnLand data STAT 5302 Applied Regression Analysis. Hawkins newdata <- subset(minnland, year == 2010) fit1 <- with(newdata, lm(acreprice ~ region-1)) summary(fit1) regionnorthwest 1509.20 67.86 22.24 <2e-16 *** regionwest Central 3179.42 72.87 43.63 <2e-16 *** regioncentral 4074.77 72.33 56.33 <2e-16 *** regionsouth West 4554.48 82.00 55.54 <2e-16 *** regionsouth Central 4791.19 76.40 62.72 <2e-16 *** regionsouth East 4570.69 93.57 48.85 <2e-16 *** Residual standard error: 1330 on 1817 degrees of freedom Multiple R-squared: 0.8918, Adjusted R-squared: 0.8914 F-statistic: 2496 on 6 and 1817 DF, p-value: < 2.2e-16 This initial no-intercept call gives the 2010 mean land price in each of the six regions. These range from $1509 in the Northwest to $4791 in the South Central. The regression output also gives a standard error for each of these means, something that we did not draw attention to in class. The fit with the intercept gives fit2 <- with(newdata, lm(acreprice ~ region)) summary(fit2) (Intercept) 1509.20 67.86 22.24 <2e-16 *** regionwest Central 1670.22 99.58 16.77 <2e-16 *** regioncentral 2565.57 99.18 25.87 <2e-16 *** regionsouth West 3045.28 106.44 28.61 <2e-16 *** regionsouth Central 3282.00 102.18 32.12 <2e-16 *** regionsouth East 3061.50 115.58 26.49 <2e-16 *** Residual standard error: 1330 on 1817 degrees of freedom Multiple R-squared: 0.4541, Adjusted R-squared: 0.4526 F-statistic: 302.3 on 5 and 1817 DF, p-value: < 2.2e-16 In this output, the intercept, $1509, is the mean price in the reference group Northwest. The coefficients of the subsequent terms give the difference between the mean prince in each region and this reference group So, for example, the Central region had a mean of $4074 in fit1. This is (4704 1509) = $2565 higher than the reference group, a value that we see as the coefficient of Central in the second fit. R also reports a standard error, a t and a P value. These reflect the test of equality of the means of Central and Northwest; the enormous t value, 25.87, reflects that the two regions are highly significantly different. Changing the reference group using STAT 5302 Homework 3 Page 1

region2 <- with(newdata, relevel(region, "Central")) fit3 <- lm(newdata$acreprice ~ region2) summary(fit3) gives (Intercept) 4074.77 72.33 56.334 < 2e-16 *** region2northwest -2565.57 99.18-25.867 < 2e-16 *** region2west Central -895.35 102.68-8.720 < 2e-16 *** region2south West 479.71 109.34 4.387 1.21e-05 *** region2south Central 716.43 105.21 6.810 1.32e-11 *** region2south East 495.92 118.27 4.193 2.88e-05 *** Residual standard error: 1330 on 1817 degrees of freedom Multiple R-squared: 0.4541, Adjusted R-squared: 0.4526 F-statistic: 302.3 on 5 and 1817 DF, p-value: < 2.2e-16 This fit and the preceding one have much in common the last three summary lines of output are identical, reflecting that this is the same fit expressed differently. The two sets of coefficients are also equivalent. Taking three groups, Northwest, with a mean of $1509, Central, with a mean of 4075 and say South West with a mean of $4554, in the earlier fit the coefficient was $4554-1509) = $3045, but changing the reference group to Central changes the coefficient to $479. Note that in other regions all differ significantly from Central. fit4 <- with(newdata, lm(acreprice ~ region+productivity)) summary(fit4) (Intercept) -126.95 238.99-0.531 0.595 regionwest Central 1273.41 188.08 6.771 2.28e-11 *** regioncentral 2452.59 205.84 11.915 < 2e-16 *** regionsouth West 1990.04 191.79 10.376 < 2e-16 *** regionsouth Central 2039.90 196.00 10.408 < 2e-16 *** regionsouth East 1844.25 218.45 8.442 < 2e-16 *** productivity 38.66 3.10 12.471 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1087 on 921 degrees of freedom (895 observations deleted due to missingness) Multiple R-squared: 0.3864, Adjusted R-squared: 0.3824 F-statistic: 96.66 on 6 and 921 DF, p-value: < 2.2e-16 This regression shows that, on aggregate across the regions, productivity has a highly significant effect on the land price, but the significance of the coefficients of the regions shows that it is not the root cause of the pricing differences. fit5 <- with(newdata, lm(acreprice ~region+productivity+region:productivity)) summary(fit5) Use fit5 to get the regression of acreprice on productivity separately in each region. Adding up the coefficients, we get: STAT 5302 Homework 3 Page 2

Region Intercept Slope Northwest 347.2 42.78 West Central 344.8 52.23 Central 2286.9 39.23 South West 166.8 67.74 South Central 3675.2 14.43 South East 1206.1 45.35 (As a crude direct check, on this, running chekker <- subset(newdata, region=="central") fit6 <- with(chekker, lm(acreprice ~ productivity)) summary(fit6) gives (Intercept) 2286.89 748.26 3.056 0.00282 ** productivity 39.23 10.89 3.601 0.00048 ***) The Type I analysis of variance of fit5 using anova(fit5) gives Analysis of Variance Table Response: acreprice Df Sum Sq Mean Sq F value Pr(>F) region 5 501717598 100343520 90.008 < 2.2e-16 *** productivity 1 183858826 183858826 164.920 < 2.2e-16 *** region:productivity 5 67569201 13513840 12.122 2.167e-11 *** Residuals 916 1021187412 1114833 In this problem, it is not particularly helpful; it tells us that the effect of productivity differs enormously between regions. STAT 5302 Homework 3 Page 3

2. Rolling the two steps into one, we get The plot before adding the smooth did not look wonderful, and this is reinforced when we add the smooth. The relationship seems to decelerate as we get to the higher ages. fit6 <- with(lakemary, lm(length ~ Age )) summary(fit6) (Intercept) 62.649 5.755 10.89 <2e-16 *** Age 22.312 1.537 14.51 <2e-16 *** Residual standard error: 12.51 on 76 degrees of freedom Multiple R-squared: 0.7349, Adjusted R-squared: 0.7314 F-statistic: 210.7 on 1 and 76 DF, p-value: < 2.2e-16 Quite a strong regression fit7 <- update(fit6, ~.+factor(age)) summary(fit7) (1 not defined because of singularities) STAT 5302 Homework 3 Page 4

(Intercept) 43.400 9.645 4.500 2.56e-05 *** Age 21.100 2.710 7.786 3.84e-11 *** factor(age)2 14.829 7.845 1.890 0.062767. factor(age)3 28.195 6.932 4.067 0.000120 *** factor(age)4 26.029 7.539 3.452 0.000934 *** factor(age)5 17.225 9.802 1.757 0.083124. factor(age)6 NA NA NA NA Residual standard error: 11.06 on 72 degrees of freedom Multiple R-squared: 0.8035, Adjusted R-squared: 0.7899 F-statistic: 58.9 on 5 and 72 DF, p-value: < 2.2e-16 This fit does not look particularly happy. If the regression were truly linear in Age, then adding factor(age) to the fit6 model should not produce anything significant. But actually, we see striking significances on both ages 3 and 4, in the middle of the age range. This is our indicator that the visual impression of a decelerating trend was right, and the linear model does not hold. Now adding the square and cube of age lakemary <- transform(lakemary, Agesq = Age^2, Agecu = Age^3) fit8 <- update(fit6, ~.+Agesq) (Intercept) 13.622 11.016 1.237 0.22 Age 54.049 6.489 8.330 2.81e-12 *** Agesq -4.719 0.944-4.999 3.67e-06 *** Residual standard error: 10.91 on 75 degrees of freedom Multiple R-squared: 0.8011, Adjusted R-squared: 0.7958 F-statistic: 151.1 on 2 and 75 DF, p-value: < 2.2e-16 Agesq is highly significant in this regression, confirming the lack of fit of the linear model fit9 <- update(fit8, ~.+Agecu) (Intercept) 9.8101 21.7690 0.451 0.65356 Age 58.1936 21.3868 2.721 0.00811 ** Agesq -6.0358 6.5417-0.923 0.35918 Agecu 0.1279 0.6284 0.203 0.83930 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 10.98 on 74 degrees of freedom Multiple R-squared: 0.8012, Adjusted R-squared: 0.7932 F-statistic: 99.44 on 3 and 74 DF, p-value: < 2.2e-16 But Agecu is not significant in this model. So if we want to model with a polynomial nop higher than cubic, a quadratic does it. The sequential analysis of variance is our most compact and efficient way of deciding what degree polynomial to use. anova(fit9) STAT 5302 Homework 3 Page 5

Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) Age 1 32966 32966 273.6152 < 2.2e-16 *** Agesq 1 2972 2972 24.6687 4.247e-06 *** Agecu 1 5 5 0.0414 0.8393 Residuals 74 8916 120 Trimming from the bottom, we don t need a cube, but we do need a square. So the model would be Length = 13.6 + 54.05 Age 4.72 Age 2 3. First step was to recode the variables. This is not strictly necessary we could fit the QRSM using the original units, as the book does, but I believe using the coded units makes the interpretation easier. So: cakes <- transform(cakes, x1=(x1-35)/2, x2=(x2-350)/10) cakes <- transform(cakes, x1sq=x1^2, x2sq=x2^2, prod=x1*x2) The QRSM gives (Intercept) 8.0700 0.1750 46.103 5.41e-11 *** x1 0.7351 0.1516 4.849 0.001273 ** x2 0.9639 0.1516 6.359 0.000219 *** x1sq -0.6275 0.1578-3.977 0.004079 ** x2sq -1.1950 0.1578-7.574 6.46e-05 *** prod -0.8325 0.2144-3.883 0.004654 ** Residual standard error: 0.4288 on 8 degrees of freedom Multiple R-squared: 0.9487, Adjusted R-squared: 0.9167 F-statistic: 29.6 on 5 and 8 DF, p-value: 5.864e-05 Turning to the questions: There is indeed significant curvature. Both squared terms are highly significant There is an interaction between the two. The flat spot is a maximum, as both squared term coefficients are negative. To get the estimate of the location of the flat spot, we need to solve 1.255 0.8325 x1 0.7351 x1 0.414 0.8325 2.39 x2 0.9639, giving x2 0.259 Plugging this in, we get the predicted y at the flat spot of 8.35. For completeness, here is a contour plot of the fitted QRSM (done in R, but not discussed in the text.) STAT 5302 Homework 3 Page 6

STAT 5302 Homework 3 Page 7