Victoria SAS Users Group November 26, 2013 Missing value imputation in SAS: an intro to Proc MI and MIANALYZE Sylvain Tremblay SAS Canada Education Copyright 2010 SAS Institute Inc. All rights reserved.
Thanks for having me in BC! 2
Missing Values 1978 The objective is to develop procedures that are useful in practice 3
Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 4
Why should you care about missing values? SAS/STAT Procs: Complete Case Analysis (CCA) Observations for which any variable used in the analysis are missing are deleted Impact of CCA: Reduction in sample size Inadequately estimate standard error and/or parameter estimates 5
Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 6
Exploring missing data patterns Get to know the data Exploratory data analysis How much data are missing? Is there any patterns in the missing values? Are there a lot of missing values for certain variables? Is there a group of obs with very little information available? 7
Exploring missing data patterns Monotone Arbitrary 8
Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 9
Understanding missing data mechanisms What is the process that generates the missing values? Missing At Random (MAR) given the observed data, the missingness mechanism does not depend on the unobserved data other variables (but not the variable itself) in the dataset can be used to predict missingness on a given variable Example, in surveys, men may be more likely to decline to answer some questions then women Missing Completely At Random (MCAR) Special case of MAR the probability of an observation being missing does not depend on observed or unobserved measurements Fairly strong assumption relatively rare Example: miscoded values, accidental loss of data under MCAR, the analysis of only those units with complete data (CCA) gives valid inferences Missing Not At Random (MNAR) When neither MCAR nor MAR hold data that is missing for a specific reason the value of the unobserved variable itself predicts missingness Example: certain question on a questionnaire tend to be skipped deliberately by participants with certain characteristics 10
Understanding missing data mechanisms Missing at Random (MAR) This is equivalent to saying that the behaviour of two units who share observed values have the same statistical behaviour on the other observations, whether observed or not. 11
Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 12
Common imputation strategies Imputation: Replace missing values with some other value Mean imputation replacing missing values with the sample mean assumes MCAR producing distributions that have far too many cases at the mean reducing the variance of the variable leading to biased estimates Conditional mean imputation using the mean from cases that are similar to the case with the missing values assumes MAR Decision Tree imputation replacing missing values with predicted values from a regression analysis of the complete data sharing similar problems with mean substitution 13
Common imputation strategies Issues with these simple strategies Mean substitution Conditional mean imputation The imputed values are completely determined by a model applied to the observed data they contain no error This tend to reduce the variance and can distort relationships among variables 14
Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 15
Multiple Imputation Three steps process 1. Creating a series of m imputed data sets by running an imputation model based on chosen variables and an imputation method 2. Carrying out the analysis model on each of the imputed data sets 3. Combining the parameter estimates from each imputed data set to get a final single set of parameter estimates 16
Multiple Imputation Selecting the number of imputations (m) Historically m was between 3 to 5 Now (because of computing power), m should be Between 5 to 20 for low fractions of missing information as large as 50 (or more) when the proportion of missing data is relatively high 17
Multiple Imputation - Proc MI Selected Statements m = number of imputations Imputation Methods Markov Chain Monte Carlo (MCMC) generate pseudorandom draws from multidimensional probability distributions via Markov chains. Assumptions - arbitrary missing pattern - multivariate normal distribution Assumptions - monotone missing pattern 18
Multiple Imputation (MI) In choosing the variables for the VAR statement, you should include Variables you want to impute Variables that are potentially related to the imputed variables Variables that are potentially related to the missingness of the imputed variables 19
Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 20
Conclusion You should you care about missing values! Explore missing data patterns Understand the missing data mechanism Select an imputation method that takes in consideration the missing data pattern If your dataset is too large for MI, an alternative is maximum likelihood estimation 21
Multiple Imputation (MI) MAR is the primary assumption of MI methods There is no standard statistical test to determine if missing data is MAR MI is a more superior method to single imputation (mean imputation, conditional mean imputation) because it takes into account the uncertainty of what the true values of the unknown data should be 22
References Multiple Imputation in SAS http://www.ats.ucla.edu/stat/sas/seminars/missing_data/part1.htm Multiple Imputation for Missing Data: Concepts and New Development http://www2.sas.com/proceedings/sugi25/25/st/25p267.pdf Knowledge (of your missing data) is power: handling missing values in your SAS dataset http://support.sas.com/resources/papers/proceedings12/319-2012.pdf 23
Questions? THANK YOU! Sylvain.Tremblay@sas.com Copyright 2010 SAS Institute Inc. All rights reserved.