STAT 5302 Applied Regression Analysis. Hawkins

Homework 3 sample solution 1. MinnLand data STAT 5302 Applied Regression Analysis. Hawkins newdata <- subset(minnland, year == 2010) fit1 <- with(newdata, lm(acreprice ~ region-1)) summary(fit1) regionnorthwest 1509.20 67.86 22.24 <2e-16 *** regionwest Central 3179.42 72.87 43.63 <2e-16 *** regioncentral 4074.77 72.33 56.33 <2e-16 *** regionsouth West 4554.48 82.00 55.54 <2e-16 *** regionsouth Central 4791.19 76.40 62.72 <2e-16 *** regionsouth East 4570.69 93.57 48.85 <2e-16 *** Residual standard error: 1330 on 1817 degrees of freedom Multiple R-squared: 0.8918, Adjusted R-squared: 0.8914 F-statistic: 2496 on 6 and 1817 DF, p-value: < 2.2e-16 This initial no-intercept call gives the 2010 mean land price in each of the six regions. These range from $1509 in the Northwest to $4791 in the South Central. The regression output also gives a standard error for each of these means, something that we did not draw attention to in class. The fit with the intercept gives fit2 <- with(newdata, lm(acreprice ~ region)) summary(fit2) (Intercept) 1509.20 67.86 22.24 <2e-16 *** regionwest Central 1670.22 99.58 16.77 <2e-16 *** regioncentral 2565.57 99.18 25.87 <2e-16 *** regionsouth West 3045.28 106.44 28.61 <2e-16 *** regionsouth Central 3282.00 102.18 32.12 <2e-16 *** regionsouth East 3061.50 115.58 26.49 <2e-16 *** Residual standard error: 1330 on 1817 degrees of freedom Multiple R-squared: 0.4541, Adjusted R-squared: 0.4526 F-statistic: 302.3 on 5 and 1817 DF, p-value: < 2.2e-16 In this output, the intercept, $1509, is the mean price in the reference group Northwest. The coefficients of the subsequent terms give the difference between the mean prince in each region and this reference group So, for example, the Central region had a mean of $4074 in fit1. This is (4704 1509) = $2565 higher than the reference group, a value that we see as the coefficient of Central in the second fit. R also reports a standard error, a t and a P value. These reflect the test of equality of the means of Central and Northwest; the enormous t value, 25.87, reflects that the two regions are highly significantly different. Changing the reference group using STAT 5302 Homework 3 Page 1

region2 <- with(newdata, relevel(region, "Central")) fit3 <- lm(newdata$acreprice ~ region2) summary(fit3) gives (Intercept) 4074.77 72.33 56.334 < 2e-16 *** region2northwest -2565.57 99.18-25.867 < 2e-16 *** region2west Central -895.35 102.68-8.720 < 2e-16 *** region2south West 479.71 109.34 4.387 1.21e-05 *** region2south Central 716.43 105.21 6.810 1.32e-11 *** region2south East 495.92 118.27 4.193 2.88e-05 *** Residual standard error: 1330 on 1817 degrees of freedom Multiple R-squared: 0.4541, Adjusted R-squared: 0.4526 F-statistic: 302.3 on 5 and 1817 DF, p-value: < 2.2e-16 This fit and the preceding one have much in common the last three summary lines of output are identical, reflecting that this is the same fit expressed differently. The two sets of coefficients are also equivalent. Taking three groups, Northwest, with a mean of $1509, Central, with a mean of 4075 and say South West with a mean of $4554, in the earlier fit the coefficient was $4554-1509) = $3045, but changing the reference group to Central changes the coefficient to $479. Note that in other regions all differ significantly from Central. fit4 <- with(newdata, lm(acreprice ~ region+productivity)) summary(fit4) (Intercept) -126.95 238.99-0.531 0.595 regionwest Central 1273.41 188.08 6.771 2.28e-11 *** regioncentral 2452.59 205.84 11.915 < 2e-16 *** regionsouth West 1990.04 191.79 10.376 < 2e-16 *** regionsouth Central 2039.90 196.00 10.408 < 2e-16 *** regionsouth East 1844.25 218.45 8.442 < 2e-16 *** productivity 38.66 3.10 12.471 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1087 on 921 degrees of freedom (895 observations deleted due to missingness) Multiple R-squared: 0.3864, Adjusted R-squared: 0.3824 F-statistic: 96.66 on 6 and 921 DF, p-value: < 2.2e-16 This regression shows that, on aggregate across the regions, productivity has a highly significant effect on the land price, but the significance of the coefficients of the regions shows that it is not the root cause of the pricing differences. fit5 <- with(newdata, lm(acreprice ~region+productivity+region:productivity)) summary(fit5) Use fit5 to get the regression of acreprice on productivity separately in each region. Adding up the coefficients, we get: STAT 5302 Homework 3 Page 2

Region Intercept Slope Northwest 347.2 42.78 West Central 344.8 52.23 Central 2286.9 39.23 South West 166.8 67.74 South Central 3675.2 14.43 South East 1206.1 45.35 (As a crude direct check, on this, running chekker <- subset(newdata, region=="central") fit6 <- with(chekker, lm(acreprice ~ productivity)) summary(fit6) gives (Intercept) 2286.89 748.26 3.056 0.00282 ** productivity 39.23 10.89 3.601 0.00048 ***) The Type I analysis of variance of fit5 using anova(fit5) gives Analysis of Variance Table Response: acreprice Df Sum Sq Mean Sq F value Pr(>F) region 5 501717598 100343520 90.008 < 2.2e-16 *** productivity 1 183858826 183858826 164.920 < 2.2e-16 *** region:productivity 5 67569201 13513840 12.122 2.167e-11 *** Residuals 916 1021187412 1114833 In this problem, it is not particularly helpful; it tells us that the effect of productivity differs enormously between regions. STAT 5302 Homework 3 Page 3

2. Rolling the two steps into one, we get The plot before adding the smooth did not look wonderful, and this is reinforced when we add the smooth. The relationship seems to decelerate as we get to the higher ages. fit6 <- with(lakemary, lm(length ~ Age )) summary(fit6) (Intercept) 62.649 5.755 10.89 <2e-16 *** Age 22.312 1.537 14.51 <2e-16 *** Residual standard error: 12.51 on 76 degrees of freedom Multiple R-squared: 0.7349, Adjusted R-squared: 0.7314 F-statistic: 210.7 on 1 and 76 DF, p-value: < 2.2e-16 Quite a strong regression fit7 <- update(fit6, ~.+factor(age)) summary(fit7) (1 not defined because of singularities) STAT 5302 Homework 3 Page 4

(Intercept) 43.400 9.645 4.500 2.56e-05 *** Age 21.100 2.710 7.786 3.84e-11 *** factor(age)2 14.829 7.845 1.890 0.062767. factor(age)3 28.195 6.932 4.067 0.000120 *** factor(age)4 26.029 7.539 3.452 0.000934 *** factor(age)5 17.225 9.802 1.757 0.083124. factor(age)6 NA NA NA NA Residual standard error: 11.06 on 72 degrees of freedom Multiple R-squared: 0.8035, Adjusted R-squared: 0.7899 F-statistic: 58.9 on 5 and 72 DF, p-value: < 2.2e-16 This fit does not look particularly happy. If the regression were truly linear in Age, then adding factor(age) to the fit6 model should not produce anything significant. But actually, we see striking significances on both ages 3 and 4, in the middle of the age range. This is our indicator that the visual impression of a decelerating trend was right, and the linear model does not hold. Now adding the square and cube of age lakemary <- transform(lakemary, Agesq = Age^2, Agecu = Age^3) fit8 <- update(fit6, ~.+Agesq) (Intercept) 13.622 11.016 1.237 0.22 Age 54.049 6.489 8.330 2.81e-12 *** Agesq -4.719 0.944-4.999 3.67e-06 *** Residual standard error: 10.91 on 75 degrees of freedom Multiple R-squared: 0.8011, Adjusted R-squared: 0.7958 F-statistic: 151.1 on 2 and 75 DF, p-value: < 2.2e-16 Agesq is highly significant in this regression, confirming the lack of fit of the linear model fit9 <- update(fit8, ~.+Agecu) (Intercept) 9.8101 21.7690 0.451 0.65356 Age 58.1936 21.3868 2.721 0.00811 ** Agesq -6.0358 6.5417-0.923 0.35918 Agecu 0.1279 0.6284 0.203 0.83930 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 10.98 on 74 degrees of freedom Multiple R-squared: 0.8012, Adjusted R-squared: 0.7932 F-statistic: 99.44 on 3 and 74 DF, p-value: < 2.2e-16 But Agecu is not significant in this model. So if we want to model with a polynomial nop higher than cubic, a quadratic does it. The sequential analysis of variance is our most compact and efficient way of deciding what degree polynomial to use. anova(fit9) STAT 5302 Homework 3 Page 5

Analysis of Variance Table Response: Length Df Sum Sq Mean Sq F value Pr(>F) Age 1 32966 32966 273.6152 < 2.2e-16 *** Agesq 1 2972 2972 24.6687 4.247e-06 *** Agecu 1 5 5 0.0414 0.8393 Residuals 74 8916 120 Trimming from the bottom, we don t need a cube, but we do need a square. So the model would be Length = 13.6 + 54.05 Age 4.72 Age 2 3. First step was to recode the variables. This is not strictly necessary we could fit the QRSM using the original units, as the book does, but I believe using the coded units makes the interpretation easier. So: cakes <- transform(cakes, x1=(x1-35)/2, x2=(x2-350)/10) cakes <- transform(cakes, x1sq=x1^2, x2sq=x2^2, prod=x1*x2) The QRSM gives (Intercept) 8.0700 0.1750 46.103 5.41e-11 *** x1 0.7351 0.1516 4.849 0.001273 ** x2 0.9639 0.1516 6.359 0.000219 *** x1sq -0.6275 0.1578-3.977 0.004079 ** x2sq -1.1950 0.1578-7.574 6.46e-05 *** prod -0.8325 0.2144-3.883 0.004654 ** Residual standard error: 0.4288 on 8 degrees of freedom Multiple R-squared: 0.9487, Adjusted R-squared: 0.9167 F-statistic: 29.6 on 5 and 8 DF, p-value: 5.864e-05 Turning to the questions: There is indeed significant curvature. Both squared terms are highly significant There is an interaction between the two. The flat spot is a maximum, as both squared term coefficients are negative. To get the estimate of the location of the flat spot, we need to solve 1.255 0.8325 x1 0.7351 x1 0.414 0.8325 2.39 x2 0.9639, giving x2 0.259 Plugging this in, we get the predicted y at the flat spot of 8.35. For completeness, here is a contour plot of the fitted QRSM (done in R, but not discussed in the text.) STAT 5302 Homework 3 Page 6

STAT 5302 Homework 3 Page 7