To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Similar documents
STAT 5302 Applied Regression Analysis. Hawkins

PSYC 6140 November 16, 2005 ANOVA output in R

Predicting Wine Quality

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

Curtis Miller MATH 3080 Final Project pg. 1. The first question asks for an analysis on car data. The data was collected from the Kelly

The R survey package used in these examples is version 3.22 and was run under R v2.7 on a PC.

Comparing R print-outs from LM, GLM, LMM and GLMM

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

wine 1 wine 2 wine 3 person person person person person

Regression Models for Saffron Yields in Iran

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

IT 403 Project Beer Advocate Analysis

Panel A: Treated firm matched to one control firm. t + 1 t + 2 t + 3 Total CFO Compensation 5.03% 0.84% 10.27% [0.384] [0.892] [0.

Missing Data Treatments

Investment Wines. - Risk Analysis. Prepared by: Michael Shortell & Adiam Woldetensae Date: 06/09/2015

Gasoline Empirical Analysis: Competition Bureau March 2005

Analysis of Things (AoT)

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

THE STATISTICAL SOMMELIER

OF THE VARIOUS DECIDUOUS and

Multiple Imputation for Missing Data in KLoSA

Emerging Local Food Systems in the Caribbean and Southern USA July 6, 2014

Online Appendix to The Effect of Liquidity on Governance

Summary of Main Points

Flexible Working Arrangements, Collaboration, ICT and Innovation

Effect of SPT Hammer Energy Efficiency in the Bearing Capacity Evaluation in Sands

INFLUENCE OF ENVIRONMENT - Wine evaporation from barrels By Richard M. Blazer, Enologist Sterling Vineyards Calistoga, CA

From VOC to IPA: This Beer s For You!

Relation between Grape Wine Quality and Related Physicochemical Indexes

ONLINE APPENDIX APPENDIX A. DESCRIPTION OF U.S. NON-FARM PRIVATE SECTORS AND INDUSTRIES

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

Northern Region Central Region Southern Region No. % of total No. % of total No. % of total Schools Da bomb

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Statistics & Agric.Economics Deptt., Tocklai Experimental Station, Tea Research Association, Jorhat , Assam. ABSTRACT

Evaluating Population Forecast Accuracy: A Regression Approach Using County Data

AMERICAN ASSOCIATION OF WINE ECONOMISTS

Bt Corn IRM Compliance in Canada

Wine Rating Prediction

What makes a good muffin? Ivan Ivanov. CS229 Final Project

This appendix tabulates results summarized in Section IV of our paper, and also reports the results of additional tests.

Napa Highway 29 Open Wineries

Homework 1 - Solutions. Problem 2

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Appendix A. Table A.1: Logit Estimates for Elasticities

STACKING CUPS STEM CATEGORY TOPIC OVERVIEW STEM LESSON FOCUS OBJECTIVES MATERIALS. Math. Linear Equations

Buying Filberts On a Sample Basis

Appendix A. Table A1: Marginal effects and elasticities on the export probability

Zeitschrift für Soziologie, Jg., Heft 5, 2015, Online- Anhang

Handling Missing Data. Ashley Parker EDU 7312

Not to be published - available as an online Appendix only! 1.1 Discussion of Effects of Control Variables

Missing Data: Part 2 Implementing Multiple Imputation in STATA and SPSS. Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 4/24/13

*p <.05. **p <.01. ***p <.001.

Valuation in the Life Settlements Market

Lesson 23: Newton s Law of Cooling

Credit Supply and Monetary Policy: Identifying the Bank Balance-Sheet Channel with Loan Applications. Web Appendix

INSTITUTE AND FACULTY OF ACTUARIES CURRICULUM 2019 SPECIMEN SOLUTIONS. Subject CS1B Actuarial Statistics

Temperature effect on pollen germination/tube growth in apple pistils

Internet Appendix for Does Stock Liquidity Enhance or Impede Firm Innovation? *

Problem Set #3 Key. Forecasting

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

MBA 503 Final Project Guidelines and Rubric

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

A Hedonic Analysis of Retail Italian Vinegars. Summary. The Model. Vinegar. Methodology. Survey. Results. Concluding remarks.

Internet Appendix to. The Price of Street Friends: Social Networks, Informed Trading, and Shareholder Costs. Jie Cai Ralph A.

Structural Reforms and Agricultural Export Performance An Empirical Analysis

The Role of Calorie Content, Menu Items, and Health Beliefs on the School Lunch Perceived Health Rating

What does radical price change and choice reveal?

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

Method for the imputation of the earnings variable in the Belgian LFS

Preferred citation style

November K. J. Martijn Cremers Lubomir P. Litov Simone M. Sepe

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT

CAUTION!!! Do not eat anything (Skittles, cylinders, dishes, etc.) associated with the lab!!!

Statistics 5303 Final Exam December 20, 2010 Gary W. Oehlert NAME ID#

THE ECONOMIC IMPACT OF BEER TOURISM IN KENT COUNTY, MICHIGAN

The Development of a Weather-based Crop Disaster Program

Effects of Election Results on Stock Price Performance: Evidence from 1976 to 2008

Flexible Imputation of Missing Data

Napa County Planning Commission Board Agenda Letter

Poisson GLM, Cox PH, & degrees of freedom

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

PREDICTION MODEL FOR ESTIMATING PEACH FRUIT WEIGHT AND VOLUME ON THE BASIS OF FRUIT LINEAR MEASUREMENTS DURING GROWTH

F&N 453 Project Written Report. TITLE: Effect of wheat germ substituted for 10%, 20%, and 30% of all purpose flour by

Risk Assessment Project II Interim Report 2 Validation of a Risk Assessment Instrument by Offense Gravity Score for All Offenders

Lollapalooza Did Not Attend (n = 800) Attended (n = 438)

TEACHER NOTES MATH NSPIRED

> Y=degre=="deces" > table(y) Y FALSE TRUE

Figure S2. Measurement locations for meteorological stations. (data made available by KMI:

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2]

Rituals on the first of the month Laurie and Winifred Bauer

MAIN FACTORS THAT DETERMINE CONSUMER BEHAVIOR FOR WINE IN THE REGION OF PRIZREN, KOSOVO

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

The Financing and Growth of Firms in China and India: Evidence from Capital Markets

Health Effects due to the Reduction of Benzene Emission in Japan

Soybean Yield Loss Due to Hail Damage*

Vibration Damage to Kiwifruits during Road Transportation

Transcription:

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016 Data Preparation: 1. Separate trany variable into Manual which takes value of 1 and 0 (1 means manual, 0 means automatic), and Speed that takes category of 3-speed, 4 speed and 5 speed. 2. Change variable tcharger and scharger to dummy variables that take value of 1 and 0 (1 means tcharger, 0 means not tcharger; same with scharger ) 3. Convert displ into gallon instead of liter. 4. Convert year variable in to dummy variable after 2014 which indicates whether it is before or after 2014. 5. Generate an interaction term using after 2014 * Manual Part A: Summary: Our target is MPG. While including many factors that we think are important to MPG, we also tried different model variations such as adding interaction terms manual ## after 2014, subtracting variables one by one. Our conclusion is that variables that included in Model 1 explains MPG the best, also the interpretation of the result makes sense in the real world. Model 1: Based on my understanding of the topic, I believe there are several important factors that affects the MPG of a car: # of cylinders, engine displacement, drive type, fuel type, manual or automatic, year made, turbocharged or supercharged. Unit: all relevant variable in the model is in the unit of gallon. Target: combo08 Input: cylinders, displ_gal, drive, fueltype1, manual, speed, year, tcharger, scharger Other parameters: seed: 424565, partition: 70/30/0

MPG = -240.35 0.44 * # of cylinders 7.25 * Engine displacement {0.58 if drive4-wheel Drive, 1.58 if drive4-wheel or All-Wheel Drive, - 0.08 if driveall-wheel Drive, -2.07 if drivefront- Wheel Drive, 1.77 if drive Part-time 4-Wheel Drive, 0.17 if driverear-wheel Drive} {5.19 if fuletype1 Midgrade, 7.85 if fueltype1nuaturalgas, 6.99 if fueltype1premium, 7.1 if fueltype1regular} + {0.69 if manual, 0 if automatic} + 0.13 * # of year {1.19 if turbocharged, 0 if not} {0.84 if supercharged, 0 if not} This training output shows that all variables included in the model are significant except the drive type driverear-wheel Drive and driveall-wheel Drive. In addition, # of cylinders, Engine displacement, drive types, fueltype, turbocharged, and supercharged are all negatively associated with MPG; while # of year and transmission type is positively associated with MPG. Evaluation: The result actually makes sense, as technology getting more and more advanced, the cars that were made more recently will have higher MPG. Also, manual cars on average have higher MPG than automatic cars. The Pseudo R-square is 0.80, which means 80% of the variation in y can explained by all the predictors. Figure 1 Model 2: Considering the fact that far more cars in 2014 are automatic than manual, but that may not be true at the start of the data set, we generated a after 2014 variable which denotes the cars

are produced after 2014 if it is 1, and before 2014 if it is 0. We created an interaction term of after 2014 * manual, to see is there an additional effect of being manual car that were produced after 2014. Unit: all relevant variable in the model is in the unit of gallon. Target: combo08 Input: cylinders, displ_gal, drive, fueltype1, manual, year after 2014, tcharger, scharger, year after 2014##manual Other parameters: seed: 4245346, partition: 70/30/0 MPG = 25.98 0.46 * # of cylinders 6.24 * Engine displacement + {1.35 if drive4-wheel Drive, -0.35 if drive4-wheel or All-Wheel Drive, 1.66 if driveall-wheel Drive, 3.43 if drivefront-wheel Drive, -0.39 if drive Part-time 4-Wheel Drive, 0.17 if driverear-wheel Drive} + {0.55 if manual, 0 if automatic} + {3.28 if year after 2014, 0 if before 2014} {0.46 if turbocharged, 0 if not} {0.68 if supercharged, 0 if not} + {-0.19 if manual after 2014, 0 if not manual after 2014} Evaluation: As Figure 2 shows, the Pseudo R-square is 0.66, which is lower than Model 1. In addition, reading the actual regression equation for Model 2 above, the interpretation doesn t make as much sense as Model 1. Especially for the coefficient on the interaction term manual after 2014, it is hard to comprehend why there is a negative effect. Figure 2:

Part B: Summary: The target is Manual. We comparing different models trying to find out the one that has the highest accuracy prediction percentage based on confusion matrix. We conclude that Model 2 including Speed and vclass is the best model based on analysis, which can correctly predict 92% of validation dataset. Model 1: Unit: all relevant variable in the model is in the unit of gallon. Target: Manual Input: city08, Co2TailpipeGpm, cylinders, displ_gal, drive, fueltype1, year, highway08, tcharger, scharger Other parameters: seed: 424346, partition: 70/30/0 Manual (0,1) = 113.72 + β1 * mpg in city + β2 * Co2TailpipeGpm + β3 * # OF CYLINDERS + β4 * displacement in gallon + β5* drive type + β6 * fuel type + β7 * year + β8 * mgp in highway + β9 * turbo charged + β10 * super changed Coefficients Table:

Table 1 and 2 show that the model can correctly predict 7003 automatic cases, which is 63% of all the validation data; and it can also correctly predict 947 manual cases, which is 9% of all the validation data. For the rest of 28%, the model failed to predict. Table 1 Confusion Matrix of Model 1 Prediction on Validation dataset (count) Predicted Automatic Manual Actual Automatic 7003 441 Manual 2716 947 Table 2 Confusion Matrix of Model 1 Prediction on Validation dataset (percentage) Predicted Error Rate Automatic Manual Actual Automatic 0.63 0.04 0.06 Manual 0.24 0.09 0.74 Model 2: We include speed and vclass variable, while dropping drive and fueltype1. Unit: all relevant variable in the model is in the unit of gallon. Target: Manual Input: city08, Co2TailpipeGpm, cylinders, displ_gal, drive, fueltype1, year after 2014, highway08, tcharger, scharger Other parameters: seed: 4245346, partition: 70/30/0 Manual (0,1) = 515.82 + β1 * mpg in city + β2 * Co2TailpipeGpm + β3 * # OF CYLINDERS + β4 * displacement in gallon + β5i* speed type + β6i * EPA vehicle size class + β7 * year + β8 * mgp in highway + β9 * turbo charged + β10 * super changed

*As we can see from the coefficient table, the Speed variable is insignificant in the model. Thus, we tried to exclude Speed; however, in the confusion matrix showed a dramatic accuracy percentage drop. Thus, we decide to keep it in our model. Table 3 and 4 show that the model can correctly predict 6698 automatic cases, which is 62% of all the validation data; and it can also correctly predict 3227 manual cases, which is 30% of all the validation data. For the rest of 8%, the model failed to predict. The accuracy rate is higher than the previous model. Table 3 Confusion Matrix of Model 2 Prediction on Validation dataset (count) Predicted Automatic Manual Actual Automatic 6698 442 Manual 474 3227 Table 4 Confusion Matrix of Model 2 Prediction on Validation dataset (percentage) Predicted Error Rate Automatic Manual Actual Automatic 0.62 0.04 0.06 Manual 0.04 0.30 0.13

Part C: Summary: After running a linear regression and classifying cars according to the transmission type, we want to create a model that determines how the variables interact with other variables to play an effect on determining the outcome of the combined MPG for fuel type 1. The inclusion of these interaction terms will provide a more accurate model that assesses the relationship between the inputs and target variable. Specifically, the age of the car should be interacted with the input variables because cars are engineered to become more efficient over time. For our working model, we categorized the years into decades, where the variable names are: eighties, nineties, twothousand, and twoten. Respectively, these variables translate into these values: values from the 1980s, values from the 1990s, values from the 2000s, and values from the 2010s. Model without interaction terms Unit: all relevant variable in the model is in the unit of gallon. Target: combo08 Input: cylinders, displ_gal, drive, fueltype1, manual, year dummies, tcharger, scharger Other parameters: seed: 12345, partition: 70/30/0 Before we create models with interacted terms, we ran a model with the created year dummy variables. We will use the results from this model as our baseline case to compare our interacted models. The Rattle output is shown in Appendix A while the graphical results are shown below: A) Graphical depiction of the data distribution

B) Predicted vs. Observed Model

Model with year##manual Next, we included the interaction term between years and whether the car runs on manual transmission or not. This is an important interaction to observe because we can hypothesize that cars that manufactured in more recent years are less likely to be run on manual transmission. Thus, this pattern could be correlated with patterns observed in MPG. The model is as follows: comb08=b 0 +city08+b 1 co2tailpipegpm+ B 2 cylinders + B 3 displ_ga+ B 4 drive +.B K eighties*manual +B K+1 nineties*manual + B k+2 twothousand*manual+ B k+3 twoten*manual When the model was placed into Rattle, it produced the following results: 1 A) Predicted vs. Observed Model 1 The summary of the linear regression model is shown in Appendix B.

The model shows the Pseudo R-squared is.9952, which shows that the observed points fit the predicted model very well. Also, the regression analysis shows that two of the interactions showed statistical significance, implying that they have some effect on determining the target variable. Model with year##pv4 In addition, we created an interaction term between the 4-door passenger volume and year. This is an important interaction to observe because we hypothesize that cars with greater volume would have lower fuel efficiency as the car would need to move more weight. comb08=b 0 +city08+b 1 co2tailpipegpm+ B 2 cylinders + B 3 displ_ga+ B 4 drive +.B K eighties*pv4 +B K+1 nineties* pv4 + B k+2 twothousand* pv4+ B k+3 twoten* pv4 Rattle produced the following results: 2 A) Predicted vs. Observed Model 2 The summary of the linear regression model is shown in Appendix C.

While the pseudo r-squared stayed the same, including this particular interaction term changed the B coefficients for the model. However, none of the coefficients of the interaction terms showed statistical significance, implying that the interaction between the terms did not have an effect on the target variable. Model with year##co2tailpipgpm For our third interaction term, we interacted tailpipe CO2 in grams/mile and year. This is an important interaction to observe because newer cars that face higher emission standards tend to have lower emission of tailpipe CO2. We want to observe if this decrease in emission of CO2 also reflects a relationship with better MPG. comb08=b 0 +city08+b 1 co2tailpipegpm+ B 2 cylinders + B 3 displ_ga+ B 4 drive +.B K eighties* co2tailpipgpm +B K+1 nineties* co2tailpipgpm + B k+2 twothousand* co2tailpipgpm + B k+3 twoten* co2tailpipgpm Rattle produced the following results: 3 A) Predicted vs. Observed Model 3 The summary of the linear regression model is shown in Appendix D.

Once again, the r-squared value stayed the same, reflecting that the observed data points fit the predicted model well. In this model, all interaction terms showed high statistical significance, implying that all interaction terms had an effect on the target variable. Other Issues: In addition to the inputs provided in the vehicle data from the U.S. Department of Energy, it would be useful to have information on whether the car has air conditioning and the other technological systems (sound system) given that they also expend energy and thus would also play a role in influencing the MPG efficiency. Furthermore, it would be useful to include the total weight of the car because we would hypothesize that heavier cars consume more energy to move the car.

Appendix A Call: lm(formula = comb08 ~., data = crs$dataset[crs$train, c(crs$input, crs$target)]) Residuals: Min 1Q Median 3Q Max -1.19066-0.23276 0.01995 0.22896 1.57990 Coefficients: (26 not defined because of singularities) Estimate Std. Error t value Pr(> t ) (Intercept) -2.4217340467 2.3400979427-1.035 0.300733 city08 0.6090828411 0.0020068185 303.507 < 2e-16 co2tailpipegpm -0.0016837384 0.0002421873-6.952 3.69e-12 cylinders 0.0300722922 0.0045195960 6.654 2.92e-11 displ_ga0.24 0.4415180039 0.5490757508 0.804 0.421341 displ_ga0.26 0.6227234984 0.4299244537 1.448 0.147505 displ_ga0.29-0.1740191334 0.4461602370-0.390 0.696512 displ_ga0.32 0.1405816565 0.4333594153 0.324 0.745638 displ_ga0.34-0.0026512686 0.4300561851-0.006 0.995081 displ_ga0.37 0.0061692889 0.4314898060 0.014 0.988593 displ_ga0.40 0.0449831282 0.4293255099 0.105 0.916554 displ_ga0.42-0.0990142384 0.4291427527-0.231 0.817530 displ_ga0.45 0.1177813083 0.4324885784 0.272 0.785368 displ_ga0.48-0.1993026091 0.4292233469-0.464 0.642413 displ_ga0.50-0.1008289525 0.4296291619-0.235 0.814453 displ_ga0.53-0.2615914370 0.4291570218-0.610 0.542168 displ_ga0.55-0.2063038773 0.4317562447-0.478 0.632779 displ_ga0.58-0.2883648985 0.4294400469-0.671 0.501915 displ_ga0.61-0.3707779357 0.4295014980-0.863 0.387995 displ_ga0.63-0.2755870628 0.4294260912-0.642 0.521037 displ_ga0.66-0.3702465319 0.4294215081-0.862 0.388587 displ_ga0.69-0.3942615069 0.4300460895-0.917 0.359262 displ_ga0.71-0.3662314277 0.4299430321-0.852 0.394326 displ_ga0.74-0.4958097993 0.4298438425-1.153 0.248731 displ_ga0.77-0.2756814633 0.4305727069-0.640 0.522005 displ_ga0.79-0.4257032573 0.4296508322-0.991 0.321787 displ_ga0.82-0.5448379662 0.4304759811-1.266 0.205646 displ_ga0.85-0.4290285873 0.4299730824-0.998 0.318385 displ_ga0.87-0.4644220712 0.4301054966-1.080 0.280248 displ_ga0.90-0.4865367761 0.4301716993-1.131 0.258054 displ_ga0.92-0.4261846287 0.4297778113-0.992 0.321383 displ_ga0.95-0.4628637816 0.4299440871-1.077 0.281684 displ_ga0.98-0.4371734614 0.4300367836-1.017 0.309356 displ_ga1.00-0.4831604006 0.4299472848-1.124 0.261123 displ_ga1.03-0.3905848852 0.4304664591-0.907 0.364229 displ_ga1.06-0.4237980866 0.4299797115-0.986 0.324328 displ_ga1.08-0.4822379361 0.4325727330-1.115 0.264941 displ_ga1.11-0.4340042361 0.4302495595-1.009 0.313116 displ_ga1.14-0.4231416487 0.4299315141-0.984 0.325023 displ_ga1.16-0.4430726521 0.4308181492-1.028 0.303751 displ_ga1.19-0.3896138506 0.4326336659-0.901 0.367830 displ_ga1.22-0.4248368319 0.4304720307-0.987 0.323697 displ_ga1.24-0.5291894460 0.4307254930-1.229 0.219234

displ_ga1.27-0.4309839131 0.4309622989-1.000 0.317296 displ_ga1.29-0.4178189496 0.4304350810-0.971 0.331712 displ_ga1.32-0.3807172994 0.4303375289-0.885 0.376330 displ_ga1.37-0.2529379476 0.4306861984-0.587 0.557014 displ_ga1.40-0.4809927855 0.4304726387-1.117 0.263852 displ_ga1.43-0.2798629557 0.4309188752-0.649 0.516050 displ_ga1.45-0.3153015091 0.4317434422-0.730 0.465215 displ_ga1.48-0.3455303936 0.4314751423-0.801 0.423248 displ_ga1.51-0.3225660130 0.4304322452-0.749 0.453623 displ_ga1.53-0.2072795606 0.4315957953-0.480 0.631045 displ_ga1.56-0.0967387965 0.4310866243-0.224 0.822443 displ_ga1.59-0.4388297383 0.4310778680-1.018 0.308696 displ_ga1.61-0.3501673163 0.4406162348-0.795 0.426783 displ_ga1.64-0.5796730668 0.4307297986-1.346 0.178382 displ_ga1.66-0.2513822254 0.4477250739-0.561 0.574485 displ_ga1.69-0.7554028830 0.4368398324-1.729 0.083778 displ_ga1.72-0.4257060716 0.4328929054-0.983 0.325421 displ_ga1.74-1.1910209672 0.4409356695-2.701 0.006915 displ_ga1.77-0.0750071583 0.4342276733-0.173 0.862860 displ_ga1.80 0.2361059868 0.4319065934 0.547 0.584617 displ_ga1.85-0.6952471462 0.4524661118-1.537 0.124411 displ_ga1.96 0.0407839627 0.4635198799 0.088 0.929887 displ_ga2.11-0.3882762779 0.4452706766-0.872 0.383217 displ_ga2.19-0.7556153140 0.4575766517-1.651 0.098682 displ_ga2.22-0.6185814792 0.4499721729-1.375 0.169234 displ NA NA NA NA drive4-wheel Drive -0.0318667866 0.0300924054-1.059 0.289627 drive4-wheel or All-Wheel Drive 0.0282168216 0.0246319951 1.146 0.251999 driveall-wheel Drive -0.0518695383 0.0273701671-1.895 0.058089 drivefront-wheel Drive 0.0248945676 0.0241373636 1.031 0.302377 drivepart-time 4-Wheel Drive -0.1024583698 0.0458330699-2.235 0.025396 driverear-wheel Drive -0.0032680520 0.0231572398-0.141 0.887773 engid -0.0000007213 0.0000001799-4.009 6.13e-05 CA.model 0.0056180687 0.0158867061 0.354 0.723617 fuelcost08-0.0021230274 0.0005455026-3.892 9.97e-05 fueltype1midgrade Gasoline -0.4740609076 0.0610182925-7.769 8.21e-15 fueltype1natural Gas -0.4105540877 0.0661723638-6.204 5.58e-10 fueltype1premium Gasoline -0.0967315827 0.0378983052-2.552 0.010704 fueltype1regular Gasoline -0.2426262231 0.0223865307-10.838 < 2e-16 highway08 0.3151744603 0.0018576810 169.660 < 2e-16 pv4-0.0000105232 0.0000821307-0.128 0.898049 tranyauto (AV-S8) 0.2557313011 0.4858385280 0.526 0.598635 tranyauto (AV) 0.5001211883 0.4341691465 1.152 0.249372 tranyautomatic (A1) -0.0743128639 0.5422835873-0.137 0.891003 tranyautomatic (A6) 0.0635311951 0.3844732984 0.165 0.868755 tranyautomatic (AM5) -0.1882752770 0.4222429051-0.446 0.655678 tranyautomatic (AV-S6) 0.2151677048 0.3670105420 0.586 0.557699 tranyautomatic (AV) 0.3233230596 0.4856389045 0.666 0.505565 tranyautomatic (S4) 0.1838921008 0.3446767571 0.534 0.593678 tranyautomatic (S5) 0.1996004214 0.3438718695 0.580 0.561617 tranyautomatic (S6) 0.1653573036 0.3436987787 0.481 0.630442 tranyautomatic (S7) 0.0969365960 0.3446460811 0.281 0.778510 tranyautomatic (S8) 0.1219030812 0.3440830930 0.354 0.723129 tranyautomatic (S9) 0.1442612937 0.3644300188 0.396 0.692216 tranyautomatic (variable gear ratios) 0.2237698798 0.3439189922 0.651 0.515280 tranyautomatic 3-spd 0.1890650062 0.3438178835 0.550 0.582394

tranyautomatic 4-spd 0.1766820992 0.3437139971 0.514 0.607230 tranyautomatic 5-spd 0.1655396398 0.3437267565 0.482 0.630093 tranyautomatic 6-spd 0.1258491044 0.3438357719 0.366 0.714357 tranyautomatic 6spd 0.5999999532 0.4856950238 1.235 0.216715 tranyautomatic 7-spd 0.1399613149 0.3438959979 0.407 0.684021 tranyautomatic 8-spd 0.0935405364 0.3452224470 0.271 0.786426 tranyautomatic 9-spd 0.1689719222 0.3489465213 0.484 0.628224 tranymanual 3-spd 0.2150273413 0.3468771160 0.620 0.535333 tranymanual 4-spd 0.2194354147 0.3439336887 0.638 0.523469 tranymanual 5-spd 0.2169655996 0.3437309840 0.631 0.527911 tranymanual 6-spd 0.1721377975 0.3436955587 0.501 0.616486 tranymanual 7-spd 0.2832647875 0.3486179040 0.813 0.416492 Manual NA NA NA NA TransmissionAutomatic NA NA NA NA TransmissionManual NA NA NA NA Speed(A6) NA NA NA NA Speed(AM5) NA NA NA NA Speed(AV-S6) NA NA NA NA Speed(AV-S8) NA NA NA NA Speed(AV) NA NA NA NA Speed(S4) NA NA NA NA Speed(S5) NA NA NA NA Speed(S6) NA NA NA NA Speed(S7) NA NA NA NA Speed(S8) NA NA NA NA Speed(S9) NA NA NA NA Speed(variable NA NA NA NA Speed3-spd NA NA NA NA Speed4-spd NA NA NA NA Speed5-spd NA NA NA NA Speed6-spd NA NA NA NA Speed6spd NA NA NA NA Speed7-spd NA NA NA NA Speed8-spd NA NA NA NA Speed9-spd NA NA NA NA VClassLarge Cars -0.0085732664 0.0133195371-0.644 0.519801 VClassMidsize Cars -0.0218198510 0.0092506680-2.359 0.018345 VClassMidsize Station Wagons -0.0219022280 0.0216235051-1.013 0.311123 VClassMidsize-Large Station Wagons -0.0311604611 0.0179830179-1.733 0.083149 VClassMinicompact Cars -0.0089238577 0.0161247146-0.553 0.579976 VClassMinivan - 2WD -0.0219167856 0.0252889401-0.867 0.386140 VClassMinivan - 4WD -0.0091248337 0.0583729806-0.156 0.875783 VClassSmall Pickup Trucks 0.0071505635 0.0215443880 0.332 0.739968 VClassSmall Pickup Trucks 2WD 0.0430138488 0.0241627597 1.780 0.075060 VClassSmall Pickup Trucks 4WD 0.0468520499 0.0328094214 1.428 0.153303 VClassSmall Sport Utility Vehicle 2WD 0.0509802021 0.0263436609 1.935 0.052978 VClassSmall Sport Utility Vehicle 4WD 0.0616203600 0.0257132723 2.396 0.016563 VClassSmall Station Wagons -0.0097006935 0.0130871864-0.741 0.458558 VClassSpecial Purpose Vehicle -0.3308800415 0.3439988129-0.962 0.336128 VClassSpecial Purpose Vehicle 2WD 0.0331181023 0.0202052971 1.639 0.101210 VClassSpecial Purpose Vehicle 4WD 0.0195015867 0.0276706212 0.705 0.480956 VClassSpecial Purpose Vehicles -0.0118675557 0.0160625099-0.739 0.460014 VClassSpecial Purpose Vehicles/2wd -0.2031794718 0.2430615501-0.836 0.403209 VClassSpecial Purpose Vehicles/4wd 0.4869481677 0.3439006274 1.416 0.156801 VClassSport Utility Vehicle - 2WD -0.0315782968 0.0147310961-2.144 0.032071 VClassSport Utility Vehicle - 4WD 0.0121376849 0.0150854216 0.805 0.421060

VClassStandard Pickup Trucks -0.0162506804 0.0154491830-1.052 0.292865 VClassStandard Pickup Trucks 2WD 0.0210839112 0.0173209310 1.217 0.223521 VClassStandard Pickup Trucks 4WD 0.0295671913 0.0195013908 1.516 0.129492 VClassStandard Pickup Trucks/2wd -0.2161035755 0.2433116583-0.888 0.374455 VClassStandard Sport Utility Vehicle 2WD -0.1296779192 0.0358232486-3.620 0.000295 VClassStandard Sport Utility Vehicle 4WD -0.0136021005 0.0280767608-0.484 0.628063 VClassSubcompact Cars -0.0196238297 0.0096562414-2.032 0.042140 VClassTwo Seaters 0.0223171480 0.0141514439 1.577 0.114803 VClassVans 0.0148544860 0.0174657506 0.850 0.395060 VClassVans Passenger 0.1142571793 0.2430979585 0.470 0.638356 VClassVans, Cargo Type 0.0341493669 0.0243140042 1.405 0.160179 VClassVans, Passenger Type 0.0335305680 0.0274663308 1.221 0.222178 after.2014 0.0451878412 0.0112822924 4.005 6.21e-05 year 0.0008192440 0.0010650260 0.769 0.441768 yousavespend -0.0003376612 0.0001114336-3.030 0.002447 tcharger -0.0310274840 0.0088460455-3.507 0.000453 scharger 0.0015965937 0.0183814730 0.087 0.930784 decade 0.0031447466 0.0011324864 2.777 0.005493 Eighties 0.0700271077 0.0235398159 2.975 0.002934 Nineties 0.0143929768 0.0140589847 1.024 0.305960 Twothousand NA NA NA NA Twoten NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3431 on 24306 degrees of freedom (1463 observations deleted due to missingness) Multiple R-squared: 0.9952, Adjusted R-squared: 0.9952 F-statistic: 3.394e+04 on 150 and 24306 DF, p-value: < 2.2e-16 ==== ANOVA ==== Analysis of Variance Table Response: comb08 Df Sum Sq Mean Sq F value Pr(>F) city08 1 584032 584032 4962188.4905 < 2.2e-16 *** co2tailpipegpm 1 6217 6217 52818.0944 < 2.2e-16 *** cylinders 1 814 814 6919.1595 < 2.2e-16 *** displ_ga 64 1566 24 207.9411 < 2.2e-16 *** drive 6 410 68 580.4267 < 2.2e-16 *** engid 1 346 346 2939.5246 < 2.2e-16 *** CA.model 1 7 7 58.8118 1.800e-14 *** fuelcost08 1 18 18 153.2497 < 2.2e-16 *** fueltype1 4 434 109 922.6904 < 2.2e-16 *** highway08 1 5227 5227 44412.6484 < 2.2e-16 *** pv4 1 1 1 6.0192 0.014158 * trany 27 10 0 3.2908 1.680e-08 *** VClass 33 13 0 3.3870 1.836e-10 *** after.2014 1 3 3 27.1192 1.929e-07 *** year 1 0 0 4.1220 0.042340 * yousavespend 1 1 1 8.9761 0.002738 ** tcharger 1 1 1 9.8312 0.001718 ** scharger 1 0 0 0.0800 0.777252 decade 1 0 0 0.3279 0.566930

Eighties 1 3 3 21.4492 3.652e-06 *** Nineties 1 0 0 1.0481 0.305960 Residuals 24306 2861 0 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 [1] "\n" Time taken: 1.81 secs Rattle timestamp: 2016-04-12 17:30:01 seungkookang Appendix B Call: lm(formula = comb08 ~., data = crs$dataset[crs$train, c(crs$input, crs$target)]) Residuals: Min 1Q Median 3Q Max -1.17706-0.24987 0.05533 0.25276 2.55808 Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(> t ) (Intercept) 3.1954909050 0.4467419732 7.153 8.73e-13 *** city08 0.6427607886 0.0014711425 436.913 < 2e-16 *** co2tailpipegpm -0.0012357748 0.0000838742-14.734 < 2e-16 *** cylinders 0.0077209069 0.0033522699 2.303 0.021276 * displ -0.0016039293 0.0047443164-0.338 0.735310 engid 0.0000003545 0.0000001654 2.143 0.032145 * camodel -0.0010733426 0.0159615113-0.067 0.946387 fuelcost08-0.0015585928 0.0003494274-4.460 8.22e-06 *** highway08 0.3308656241 0.0014524238 227.802 < 2e-16 *** pv4-0.0000042527 0.0000663520-0.064 0.948897 manual 0.0562400288 0.0120178053 4.680 2.89e-06 *** after2014 0.0254553553 0.0096299110 2.643 0.008214 ** yousavespend -0.0002894438 0.0000702708-4.119 3.82e-05 *** tcharger 0.0264131146 0.0076658533 3.446 0.000571 *** scharger 0.0009885123 0.0178359165 0.055 0.955802 eighties 0.0542186073 0.0110431385 4.910 9.18e-07 *** nineties 0.0294131844 0.0097203029 3.026 0.002481 ** twothousand 0.0314295099 0.0087056403 3.610 0.000307 *** twoten NA NA NA NA drive2-0.0035878072 0.0014659451-2.447 0.014394 * vclass2 0.0011035693 0.0002602736 4.240 2.24e-05 *** manual_eighties -0.0510219812 0.0157425601-3.241 0.001193 ** manual_nineties -0.0413133618 0.0150074304-2.753 0.005912 ** manual_twothousand -0.0132139148 0.0150172224-0.880 0.378913 manual_twoten NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3596 on 24984 degrees of freedom (913 observations deleted due to missingness) Multiple R-squared: 0.995, Adjusted R-squared: 0.9949 F-statistic: 2.238e+05 on 22 and 24984 DF, p-value: < 2.2e-16

==== ANOVA ==== Analysis of Variance Table Response: comb08 Df Sum Sq Mean Sq F value Pr(>F) city08 1 620941 620941 4802791.6066 < 2.2e-16 *** co2tailpipegpm 1 6545 6545 50626.4060 < 2.2e-16 *** cylinders 1 793 793 6130.7913 < 2.2e-16 *** displ 1 4 4 31.7813 0.000000017441 *** engid 1 249 249 1924.2543 < 2.2e-16 *** camodel 1 19 19 149.5094 < 2.2e-16 *** fuelcost08 1 41 41 316.1082 < 2.2e-16 *** highway08 1 7919 7919 61247.5166 < 2.2e-16 *** pv4 1 1 1 8.6551 0.0032645 ** manual 1 5 5 35.7265 0.000000002301 *** after2014 1 0 0 1.6807 0.1948397 yousavespend 1 2 2 16.6191 0.000045829379 *** tcharger 1 1 1 10.3929 0.0012666 ** scharger 1 0 0 0.0015 0.9688293 eighties 1 1 1 6.0919 0.0135871 * nineties 1 0 0 0.4910 0.4834671 twothousand 1 2 2 12.9609 0.0003187 *** drive2 1 0 0 3.6298 0.0567666. vclass2 1 2 2 17.0331 0.000036853093 *** manual_eighties 1 1 1 6.2586 0.0123655 * manual_nineties 1 1 1 7.9475 0.0048191 ** manual_twothousand 1 0 0 0.7743 0.3789125 Residuals 24984 3230 0 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 [1] "\n" Time taken: 0.16 secs Rattle timestamp: 2016-04-12 19:02:48 seungkookang Appendix C Call: lm(formula = comb08 ~., data = crs$dataset[crs$train, c(crs$input, crs$target)]) Residuals: Min 1Q Median 3Q Max -1.17824-0.24869 0.05627 0.25044 2.55458 Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(> t ) (Intercept) 3.2272922820 0.4468372799 7.223 5.25e-13 *** city08 0.6427759517 0.0014738486 436.121 < 2e-16 *** co2tailpipegpm -0.0012329050 0.0000842435-14.635 < 2e-16 *** cylinders 0.0072352669 0.0033529847 2.158 0.03095 * displ -0.0024375377 0.0047468297-0.514 0.60760

engid 0.0000002992 0.0000001650 1.814 0.06970. camodel -0.0014902465 0.0159773616-0.093 0.92569 fuelcost08-0.0015790754 0.0003494875-4.518 6.26e-06 *** highway08 0.3308121388 0.0014528274 227.702 < 2e-16 *** pv4 0.0000669432 0.0001075188 0.623 0.53354 manual 0.0275425500 0.0052528650 5.243 1.59e-07 *** after2014 0.0242161186 0.0096159931 2.518 0.01180 * yousavespend -0.0002939711 0.0000702829-4.183 2.89e-05 *** tcharger 0.0258103639 0.0076893961 3.357 0.00079 *** scharger -0.0001941384 0.0178426566-0.011 0.99132 eighties 0.0431405710 0.0108952110 3.960 7.53e-05 *** nineties 0.0212642384 0.0099760462 2.132 0.03306 * twothousand 0.0299242747 0.0092199242 3.246 0.00117 ** twoten NA NA NA NA drive2-0.0030513643 0.0014636570-2.085 0.03710 * vclass2 0.0010801695 0.0002602231 4.151 3.32e-05 *** pv4_eighties -0.0001779384 0.0001559518-1.141 0.25389 pv4_nineties -0.0000853722 0.0001412213-0.605 0.54550 pv4_twothousand -0.0000276421 0.0001340777-0.206 0.83666 pv4_twoten NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3597 on 24984 degrees of freedom (913 observations deleted due to missingness) Multiple R-squared: 0.9949, Adjusted R-squared: 0.9949 F-statistic: 2.237e+05 on 22 and 24984 DF, p-value: < 2.2e-16 ==== ANOVA ==== Analysis of Variance Table Response: comb08 Df Sum Sq Mean Sq F value Pr(>F) city08 1 620941 620941 4800204.7357 < 2.2e-16 *** co2tailpipegpm 1 6545 6545 50599.1377 < 2.2e-16 *** cylinders 1 793 793 6127.4891 < 2.2e-16 *** displ 1 4 4 31.7642 0.000000017595 *** engid 1 249 249 1923.2178 < 2.2e-16 *** camodel 1 19 19 149.4288 < 2.2e-16 *** fuelcost08 1 41 41 315.9380 < 2.2e-16 *** highway08 1 7919 7919 61214.5276 < 2.2e-16 *** pv4 1 1 1 8.6504 0.0032729 ** manual 1 5 5 35.7073 0.000000002324 *** after2014 1 0 0 1.6798 0.1949600 yousavespend 1 2 2 16.6102 0.000046046040 *** tcharger 1 1 1 10.3873 0.0012705 ** scharger 1 0 0 0.0015 0.9688377 eighties 1 1 1 6.0886 0.0136123 * nineties 1 0 0 0.4908 0.4835850 twothousand 1 2 2 12.9540 0.0003199 *** drive2 1 0 0 3.6278 0.0568333. vclass2 1 2 2 17.0239 0.000037031459 *** pv4_eighties 1 0 0 1.1323 0.2872931 pv4_nineties 1 0 0 0.3407 0.5594526 pv4_twothousand 1 0 0 0.0425 0.8366640

Residuals 24984 3232 0 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 [1] "\n" Time taken: 0.15 secs Rattle timestamp: 2016-04-12 19:05:26 seungkookang ====================================================================== Appendix D Call: lm(formula = comb08 ~., data = crs$dataset[crs$train, c(crs$input, crs$target)]) Residuals: Min 1Q Median 3Q Max -1.19842-0.25012 0.05325 0.25507 2.60744 Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(> t ) (Intercept) 4.5214029735 0.4577515874 9.877 < 2e-16 *** city08 0.6386045447 0.0015038259 424.653 < 2e-16 *** co2tailpipegpm -0.0020604433 0.0001067956-19.293 < 2e-16 *** cylinders 0.0141782988 0.0033792704 4.196 2.73e-05 *** displ -0.0014896710 0.0047308524-0.315 0.752852 engid 0.0000007499 0.0000001676 4.475 7.68e-06 *** camodel -0.0012997322 0.0159072961-0.082 0.934881 fuelcost08-0.0022685356 0.0003528810-6.429 1.31e-10 *** highway08 0.3294072346 0.0014506165 227.081 < 2e-16 *** pv4 0.0000027545 0.0000661010 0.042 0.966762 manual 0.0294139636 0.0052208719 5.634 1.78e-08 *** after2014 0.0102421362 0.0096595491 1.060 0.289013 yousavespend -0.0004340549 0.0000709699-6.116 9.73e-10 *** tcharger 0.0234235601 0.0076492885 3.062 0.002200 ** scharger 0.0095584364 0.0177890690 0.537 0.591051 eighties -0.3483295443 0.0313131942-11.124 < 2e-16 *** nineties -0.2663131469 0.0302353996-8.808 < 2e-16 *** twothousand -0.1041053862 0.0296096084-3.516 0.000439 *** twoten NA NA NA NA drive2-0.0038898558 0.0014577052-2.668 0.007624 ** co2_eighties 0.0008298540 0.0000642096 12.924 < 2e-16 *** co2_nineties 0.0006210918 0.0000619338 10.028 < 2e-16 *** co2_twothousand 0.0003173050 0.0000618303 5.132 2.89e-07 *** co2_twoten NA NA NA NA vclass2 0.0014707108 0.0002610411 5.634 1.78e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3583 on 24984 degrees of freedom (913 observations deleted due to missingness) Multiple R-squared: 0.995, Adjusted R-squared: 0.995 F-statistic: 2.253e+05 on 22 and 24984 DF, p-value: < 2.2e-16 ==== ANOVA ====

Analysis of Variance Table Response: comb08 Df Sum Sq Mean Sq F value Pr(>F) city08 1 620941 620941 4835985.1962 < 2.2e-16 *** co2tailpipegpm 1 6545 6545 50976.3009 < 2.2e-16 *** cylinders 1 793 793 6173.1631 < 2.2e-16 *** displ 1 4 4 32.0010 1.558e-08 *** engid 1 249 249 1937.5534 < 2.2e-16 *** camodel 1 19 19 150.5427 < 2.2e-16 *** fuelcost08 1 41 41 318.2929 < 2.2e-16 *** highway08 1 7919 7919 61670.8173 < 2.2e-16 *** pv4 1 1 1 8.7149 0.0031591 ** manual 1 5 5 35.9735 2.028e-09 *** after2014 1 0 0 1.6923 0.1933043 yousavespend 1 2 2 16.7340 4.314e-05 *** tcharger 1 1 1 10.4647 0.0012183 ** scharger 1 0 0 0.0015 0.9687218 eighties 1 1 1 6.1340 0.0132673 * nineties 1 0 0 0.4944 0.4819596 twothousand 1 2 2 13.0505 0.0003038 *** drive2 1 0 0 3.6548 0.0559179. co2_eighties 1 11 11 82.7279 < 2.2e-16 *** co2_nineties 1 9 9 66.6638 3.371e-16 *** co2_twothousand 1 3 3 23.7731 1.091e-06 *** vclass2 1 4 4 31.7422 1.780e-08 *** Residuals 24984 3208 0 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 [1] "\n" Time taken: 0.14 secs Rattle timestamp: 2016-04-12 19:07:00 seungkookang ======================================================================