Missing Data Treatments

Similar documents
Handling Missing Data. Ashley Parker EDU 7312

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Multiple Imputation for Missing Data in KLoSA

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

Flexible Imputation of Missing Data

Missing Data: Part 2 Implementing Multiple Imputation in STATA and SPSS. Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 4/24/13

Method for the imputation of the earnings variable in the Belgian LFS

Missing data in political science

Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Imputation of multivariate continuous data with non-ignorable missingness

wine 1 wine 2 wine 3 person person person person person

Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

A Comparison of Approximate Bayesian Bootstrap and Weighted Sequential Hot Deck for Multiple Imputation

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

The R survey package used in these examples is version 3.22 and was run under R v2.7 on a PC.

INSTITUTE AND FACULTY OF ACTUARIES CURRICULUM 2019 SPECIMEN SOLUTIONS. Subject CS1B Actuarial Statistics

Comparing R print-outs from LM, GLM, LMM and GLMM

Computerized Models for Shelf Life Prediction of Post-Harvest Coffee Sterilized Milk Drink

STAT 5302 Applied Regression Analysis. Hawkins

OF THE VARIOUS DECIDUOUS and

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

A Comparison of Imputation Methods in the 2012 Behavioral Risk Factor Surveillance Survey

Buying Filberts On a Sample Basis

Summary of Main Points

The Wild Bean Population: Estimating Population Size Using the Mark and Recapture Method

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Flexible Working Arrangements, Collaboration, ICT and Innovation

Climate change may alter human physical activity patterns

Zeitschrift für Soziologie, Jg., Heft 5, 2015, Online- Anhang

Final Exam Financial Data Analysis (6 Credit points/imp Students) March 2, 2006

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Power and Priorities: Gender, Caste, and Household Bargaining in India

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

The multivariate piecewise linear growth model for ZHeight and zbmi can be expressed as:

PROBIT AND ORDERED PROBIT ANALYSIS OF THE DEMAND FOR FRESH SWEET CORN

Table A.1: Use of funds by frequency of ROSCA meetings in 9 research sites (Note multiple answers are allowed per respondent)

Appendix A. Table A.1: Logit Estimates for Elasticities

Relation between Grape Wine Quality and Related Physicochemical Indexes

PSYC 6140 November 16, 2005 ANOVA output in R

The Development of a Weather-based Crop Disaster Program

Return to wine: A comparison of the hedonic, repeat sales, and hybrid approaches

Citrus Attributes: Do Consumers Really Care Only About Seeds? Lisa A. House 1 and Zhifeng Gao

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method

Transportation demand management in a deprived territory: A case study in the North of France

Predicting Wine Quality

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa

Effects of Information and Country of Origin on Chinese Consumer Preferences for Wine: An Experimental Approach in the Field

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

What are the Driving Forces for Arts and Culture Related Activities in Japan?

Influence of Service Quality, Corporate Image and Perceived Value on Customer Behavioral Responses: CFA and Measurement Model

Imputation Procedures for Missing Data in Clinical Research

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Preferred citation style

Gender and Firm-size: Evidence from Africa

Regression Models for Saffron Yields in Iran

RESEARCH UPDATE from Texas Wine Marketing Research Institute by Natalia Kolyesnikova, PhD Tim Dodd, PhD THANK YOU SPONSORS

Poisson GLM, Cox PH, & degrees of freedom

Credit Supply and Monetary Policy: Identifying the Bank Balance-Sheet Channel with Loan Applications. Web Appendix

PARENTAL SCHOOL CHOICE AND ECONOMIC GROWTH IN NORTH CAROLINA

PREDICTION MODEL FOR ESTIMATING PEACH FRUIT WEIGHT AND VOLUME ON THE BASIS OF FRUIT LINEAR MEASUREMENTS DURING GROWTH

The age of reproduction The effect of university tuition fees on enrolment in Quebec and Ontario,

Online Appendix to The Effect of Liquidity on Governance

Online Appendix for. To Buy or Not to Buy: Consumer Constraints in the Housing Market

Community differences in availability of prepared, readyto-eat foods in U.S. food stores

THE STATISTICAL SOMMELIER

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

Evaluation of Alternative Imputation Methods for 2017 Economic Census Products 1 Jeremy Knutson and Jared Martin

Investment Wines. - Risk Analysis. Prepared by: Michael Shortell & Adiam Woldetensae Date: 06/09/2015

What makes a good muffin? Ivan Ivanov. CS229 Final Project

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT

The Elasticity of Substitution between Land and Capital: Evidence from Chicago, Berlin, and Pittsburgh

Perspective of the Labor Market for security guards in Israel in time of terror attacks

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

A.P. Environmental Science. Partners. Mark and Recapture Lab addi. Estimating Population Size

Trip Generation at Fast Food Restaurants

Chained equations and more in multiple imputation in Stata 12

The impact of a continuous care intervention for treatment of type 2 diabetes on health care system utilization

A Comparison of Price Imputation Methods under Large Samples and Different Levels of Censoring.

Multiple Imputation Scheme for Overcoming the Missing Values and Variability Issues in ITS Data

Comparative Analysis of Fresh and Dried Fish Consumption in Ondo State, Nigeria

Feeding habits of range-shifting herbivores: tropical surgeonfishes in a temperate environment

R Analysis Example Replication C10

PROCEDURE million pounds of pecans annually with an average

Improving Capacity for Crime Repor3ng: Data Quality and Imputa3on Methods Using State Incident- Based Repor3ng System Data

Protest Campaigns and Movement Success: Desegregating the U.S. South in the Early 1960s

Gasoline Empirical Analysis: Competition Bureau March 2005

Algebra 2: Sample Items

The dawn of reproductive change in north east Italy. A microanalysis

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2]

Demographic, Seasonal, and Housing Characteristics Associated with Residential Energy Consumption in Texas, 2010

The Role of Calorie Content, Menu Items, and Health Beliefs on the School Lunch Perceived Health Rating

Effect of Inocucor on strawberry plants growth and production

November K. J. Martijn Cremers Lubomir P. Litov Simone M. Sepe

MAIN FACTORS THAT DETERMINE CONSUMER BEHAVIOR FOR WINE IN THE REGION OF PRIZREN, KOSOVO

Transcription:

Missing Data Treatments Lindsey Perry EDU7312: Spring 2012 Presentation Outline Types of Missing Data Listwise Deletion Pairwise Deletion Single Imputation Methods Mean Imputation Hot Deck Imputation Multiple Imputation Data Simulation

Types of Missing Data Missing Completely At Random (MCAR) Missing At Random (MAR) Missing Not At Random (MNAR) Missing Completely At Random (MCAR) No relationship between the data and any variables Probability of ness is independent of all other variables Every observation is as equally likely to be as any another observation. Most data treatments can be performed on datasets with data MCAR without introducing bias. Example: A student oversleeps and does not arrive in time to take the first section of a test

Missing At Random (MAR) No relationship between the data and the independent variable where the ness occurs However, the likelihood of ness is related to another variable in the dataset. Examples: Women report their weight on a survey less frequently than males One ethnicity reports income on a questionnaire less frequently than another ethnicity Missing Not At Random (MNAR) The probability of an observation being depends on its measured variable. This is the most troublesome type of data and is often termed non-ignorable. Examples: People who are poor are more likely not to report income on a survey. Struggling readers are more likely to skip questions on a reading test.

Listwise Deletion Process: if any observation is for any participant, delete all of the data for that participant. Listwise deletion assumes the data are MCAR. Pros Very easy procedure Cons Decreases the sample size & statistical power Increases standard error & widens confidence intervals Listwise Deletion Example: dv iv1 iv2 iv3 iv4 80 50 NA NA 85 95 45 53 100 75 70 30 65 110 78 NA 42 67 105 92

Listwise Deletion Example: dv iv1 iv2 iv3 iv4 95 45 53 100 75 70 30 65 110 78 Pairwise Deletion Process: remove cases that have data only when it pertains to a certain calculation. This is also referred to as available case analysis. Pairwise deletion assumes the data are MCAR. Pros Retains more data compared with listwise deletion Cons Can introduce bias if data are not MCAR

Pairwise Deletion Example: If weight is not being used in the analysis, the cases where weight is would not be removed. If weight is a variable in the analysis, those cases would be removed. dv age weight height 80 50 NA 58 95 45 100 62 70 30 110 NA 110 NA 105 68 Pairwise Deletion Example: If weight is not being used in the analysis, the cases where weight is would not be removed. If weight is a variable in the analysis, those cases would be removed. dv age weight height 95 45 100 62 70 30 110 NA 110 NA 105 68

Single Imputation Techniques Imputation: substituting a value for a observation Single Imputation: each value is filled in with one plausible value Single Imputation Techniques Mean Imputation Hot Deck Imputation Mean Imputation This techniques imputes the mean of a variable for the observations for that variable. Pros Retains sample size Cons Decreases standard deviation and standard errors Creates smaller confidence intervals, increasing the probability of Type 1 errors

Mean Imputation Example: dv iv1 iv2 iv3 iv4 80 50 NA NA 86 95 45 54 100 76 70 30 65 110 78 NA 43 67 105 92 Mean Imputation Example: dv iv1 iv2 iv3 iv4 80 50 62 105 86 95 45 54 100 76 70 30 65 110 78 82 43 67 105 92 Means: 82 42 62 105 83

Hot Deck Imputation Process: for each value, find an observation with similar values in the X and take its Y value. If multiple matching values are found, the mean of those values is imputed. This can also be referred to as matching. Hot deck imputation utilizes the current dataset to find matches. Cold deck imputation utilizes an existing dataset to find matches. Hot Deck Imputation Pros Retains size of dataset Cons Difficult to do when there are multiple variables with data Reduces standard errors by underestimating the variability of the variable

Hot Deck Imputation Example: dv iv dv iv 90 4 NA 3 64 3.5 100 5 88 4 NA 6 90 4 64 3 64 3.5 100 5 88 4 100 6 Multiple Imputation Process: each value is replaced with multiple plausible values. This creates multiple possible datasets. Then, these datasets are pooled together to come up with one result Impute Creates multiple possible datasets Analyze Run analysis on each dataset Pool Find average of estimates

Multiple Imputation Multiple methods for computing values Predictive Mean Matching (pmm) Bayesian Linear Regression (norm) Logistic Regression (logreg) Linear Discriminant Analysis (lda) Random sample from observed values (sample) Many others Multiple Imputation Pros Imputes multiple plausible values - reduces possibility for bias Cons Difficult to compute

Practice in R - Setting up Data Y X 5 1 Create this data frame in R and name it example Run regression with Y as the DV and X as the IV Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.6867 0.9870 4.748 0.00209 ** x 0.1379 0.1615 0.854 0.42150 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.445 on 7 degrees of freedom (3 observations deleted due to ness) Multiple R-squared: 0.09431,! Adjusted R-squared: -0.03508 F-statistic: 0.7289 on 1 and 7 DF, p-value: 0.4215 4 2 4.5 3 6 4 7 5 4.3 6 5 NA 2 NA 6.7 NA 8 8 4 9 6 10 Practice in R - Listwise Deletion Listwise Deletion (examplelistwise<-na.omit(example)) Run regression with y as DV and x as IV Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.6867 0.9870 4.748 0.00209 ** x 0.1379 0.1615 0.854 0.42150 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.445 on 7 degrees of freedom Multiple R-squared: 0.09431,! Adjusted R-squared: -0.03508 F-statistic: 0.7289 on 1 and 7 DF, p-value: 0.4215

Practice in R - Mean Imputation Mean Imputation library(hmisc) examplemean<-example examplemean$x<-impute(examplemean$x, mean) Run regression with y as DV and x as IV Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.4728 1.1004 4.065 0.00227 ** x 0.1379 0.1857 0.743 0.47476 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.661 on 10 degrees of freedom Multiple R-squared: 0.05227,! Adjusted R-squared: -0.0425 F-statistic: 0.5516 on 1 and 10 DF, p-value: 0.4748 Practice in R - Hot Deck Imputation Hot Deck Imputation library(rrp) examplehd<-rrp.impute(example) examplehdd<-examplehd$new.data Run regression with y as DV and x as IV Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.2215 0.8437 5.003 0.000535 *** x 0.2115 0.1528 1.384 0.196413 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.563 on 10 degrees of freedom Multiple R-squared: 0.1608,! Adjusted R-squared: 0.07687 F-statistic: 1.916 on 1 and 10 DF, p-value: 0.1964

Practice in R - Multiple Imputation Multiple Imputation library(mice) examplemi<-mice(example, meth=c("","pmm"), maxit=1) examplemi2<-with(examplemi, lm(y~x)) mipooled<-pool(examplemi2) mipooled Run regression with y as DV and x as IV est se t df Pr(> t ) (Intercept) 5.15015978 1.1108854 4.63608574 7.679074 0.00186596 x 0.01100627 0.1777149 0.06193217 7.486365 0.95223815 Practice in R - Comparing Methods Listwise: grey Mean Imputation: black Hot Deck: blue Multiple Imputation: purple

Simulation in R Population = 100,000 Variables: DV, IV1, IV2, IV3 Randomly sampled 5 subsets, n = 5,000 Created 3 datasets from each subsets with 5%, 10%, and 20% ness on IV1 Performed Listwise Deletion, Mean Imputation, Hot Deck Imputation, and Multiple Imputation on each dataset Calculated regression estimates Calculated Percent Relative Parameter Bias and Relative Standard Error Bias Simulation in R Population = 100,000 5,000 5,000 5,000 5,000 5,000-5% -10% -20% -5% -10% -20% -5% -10% -20% -5% -10% -20% -5% -10% -20% LW Mean HD MI LW Mean HD MI LW Mean HD MI LW Mean HD MI LW Mean HD MI

Comparing Methods - PRPB Percent Relative Parameter Bias (PRPB) Measures the amount of bias introduced under a specific set of conditions (e.g., data treatments) : mean of the pth parameter for x estimates : corresponding population parameter Produces standardized metric to examine the size and direction of the bias Values above 5% or below -5% are considered unacceptable Comparing Methods - PRPB Listwise'Dele*on'PRPB Intercept IV1 IV2 IV3 Hot'Deck'Imputa*on'PRPB Intercept IV1 IV2 IV3 5%' 10%' 20%' <1.569 <0.064 2.640 <4.672 <1.602 <0.315 1.743 <2.645 <1.581 <0.243 3.823 <3.991 5%' 10%' 20%' <1.688 2.749 2.561 2.562 <1.700 5.856 0.525 3.288 <1.762 12.544 0.569 7.024 Mean'Imputa*on'PRPB Mul*ple'Imputa*on'PRPB Intercept IV1 IV2 IV3 Intercept IV1 IV2 IV3 5%' 10%' 20%' <1.723 <0.169 5.743 4.658 <1.462 <0.502 5.058 <11.168 <0.877 <0.771 5.454 <46.752 5%' 10%' 20%' <1.658 <0.281 3.331 0.692 <1.544 <0.046 2.142 <6.233 <1.519 <0.507 3.378 <7.736

Comparing Methods - PRPB 5% : Grey 10% : Black 20% : Blue Comparing Methods - PRPB 5% : Grey 10% : Black 20% : Blue

Comparing Methods - PRPB 5% : Grey 10% : Black 20% : Blue Comparing Methods - PRPB 5% : Grey 10% : Black 20% : Blue

Comparing Methods - RSEB Relative Standard Error Bias (RSEB) Measures the amount of bias in standard error estimates : mean of the standard errors of the intercepts : standard deviation of the intercepts Produces standardized metric to examine the size and direction of the bias Values above 10% or below -10% are considered unacceptable Comparing Methods - RSEB Rela*ve'Standard'Error'Bias Listwise Mean Imputation Hot Deck Imputation Multiple Imputation 5% 82.47 102.55 85.38 107.45 10% 68.77 86.43 55.62 39.48 20% 51.54 39.62 7.06 66.21

Comparing Methods - RSEB Listwise: grey Mean Imputation: black Hot Deck: blue Multiple Imputation: purple Conclusions Prevent data If data is, attempt to determine why it is. No silver bullet treatment method

References Alemdar, M. (2009). A monte carlo study: The impact of data in crossclassification random effects models. Georgia State University). ProQuest Dissertations and Theses, http://search.proquest.com/docview/304890975?accountid=6667 Allison, P.D. (2003). Missing data techniques for structural equation modeling. Journal of Abnormal Psychology, 112(4), 545-557. Batista, G. E. A. P. A., & Monard, M. C. (2003). An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence, 17(5), 519-533. Howell, D.C. (2008) The analysis of data. In Outhwaite, W. & Turner, S. Handbook of Social Science Methodology. London: Sage. Lynch, S.M. (2003). Missing data. Retrieved from http://www.princeton.edu/~slynch/ soc504/data.pdf Scheffer, J. (2002). Dealing with data. Res. Lett. Inf. Math. Sci., 3, 153-160.