Missing Data: Part 2 Implementing Multiple Imputation in STATA and SPSS. Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 4/24/13

Similar documents
Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Handling Missing Data. Ashley Parker EDU 7312

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Multiple Imputation for Missing Data in KLoSA

Missing Data Treatments

Chained equations and more in multiple imputation in Stata 12

Flexible Imputation of Missing Data

Imputation of multivariate continuous data with non-ignorable missingness

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

Missing data in political science

Method for the imputation of the earnings variable in the Belgian LFS

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Comparing R print-outs from LM, GLM, LMM and GLMM

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

IT 403 Project Beer Advocate Analysis

Flexible Working Arrangements, Collaboration, ICT and Innovation

Imputation Procedures for Missing Data in Clinical Research

Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

The R survey package used in these examples is version 3.22 and was run under R v2.7 on a PC.

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Predicting Wine Quality

Appendix A. Table A.1: Logit Estimates for Elasticities

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

The multivariate piecewise linear growth model for ZHeight and zbmi can be expressed as:

*p <.05. **p <.01. ***p <.001.

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

OF THE VARIOUS DECIDUOUS and

Relation between Grape Wine Quality and Related Physicochemical Indexes

Climate change may alter human physical activity patterns

Credit Supply and Monetary Policy: Identifying the Bank Balance-Sheet Channel with Loan Applications. Web Appendix

Enquiring About Tolerance (EAT) Study. Randomised controlled trial of early introduction of allergenic foods to induce tolerance in infants

Summary of Main Points

PSYC 6140 November 16, 2005 ANOVA output in R

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN

Much ado about nothing: methods and implementations to estim. regression models

Final Exam Financial Data Analysis (6 Credit points/imp Students) March 2, 2006

wine 1 wine 2 wine 3 person person person person person

MAIN FACTORS THAT DETERMINE CONSUMER BEHAVIOR FOR WINE IN THE REGION OF PRIZREN, KOSOVO

Transportation demand management in a deprived territory: A case study in the North of France

An application of cumulative prospect theory to travel time variability

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

MBA 503 Final Project Guidelines and Rubric

Zeitschrift für Soziologie, Jg., Heft 5, 2015, Online- Anhang

Valuation in the Life Settlements Market

Statistics & Agric.Economics Deptt., Tocklai Experimental Station, Tea Research Association, Jorhat , Assam. ABSTRACT

ESTIMATING ANIMAL POPULATIONS ACTIVITY

Power and Priorities: Gender, Caste, and Household Bargaining in India

Internet Appendix for CEO Personal Risk-taking and Corporate Policies TABLE IA.1 Pilot CEOs and Firm Risk (Controlling for High Performance Pay)

Protest Campaigns and Movement Success: Desegregating the U.S. South in the Early 1960s

A Comparison of Imputation Methods in the 2012 Behavioral Risk Factor Surveillance Survey

Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Return to wine: A comparison of the hedonic, repeat sales, and hybrid approaches

A Comparison of Approximate Bayesian Bootstrap and Weighted Sequential Hot Deck for Multiple Imputation

Emerging Local Food Systems in the Caribbean and Southern USA July 6, 2014

Not to be published - available as an online Appendix only! 1.1 Discussion of Effects of Control Variables

STAT 5302 Applied Regression Analysis. Hawkins

From VOC to IPA: This Beer s For You!

Starbucks Geography Summary

Demographic, Seasonal, and Housing Characteristics Associated with Residential Energy Consumption in Texas, 2010

Population Trends 139 Spring 2010

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

Panel A: Treated firm matched to one control firm. t + 1 t + 2 t + 3 Total CFO Compensation 5.03% 0.84% 10.27% [0.384] [0.892] [0.

Gender and Firm-size: Evidence from Africa

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa

THE STATISTICAL SOMMELIER

What makes a good muffin? Ivan Ivanov. CS229 Final Project

Cloud Computing CS

GLOBAL TILT AND LUMBAR LORDOSIS INDEX Two parameters to understand posi0ve balance analysis

Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand

Improving Capacity for Crime Repor3ng: Data Quality and Imputa3on Methods Using State Incident- Based Repor3ng System Data

Appendix A. Table A1: Marginal effects and elasticities on the export probability

Beer bitterness and testing

The R&D-patent relationship: An industry perspective

Risk Assessment Project II Interim Report 2 Validation of a Risk Assessment Instrument by Offense Gravity Score for All Offenders

Online Appendix for. To Buy or Not to Buy: Consumer Constraints in the Housing Market

This is a repository copy of Poverty and Participation in Twenty-First Century Multicultural Britain.

Olea Tumor Basic VPMC-13988A

The dawn of reproductive change in north east Italy. A microanalysis

Wideband HF Channel Availability Measurement Techniques and Results W.N. Furman, J.W. Nieto, W.M. Batts

Case Study 8. Topic. Basic Concepts. Team Activity. Develop conceptual design of a coffee maker. Perform the following:

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

TEACHER NOTES MATH NSPIRED

Online Appendix to The Effect of Liquidity on Governance

Senior poverty in Canada, : A decomposition analysis of income and poverty rates

Appendix Table A1 Number of years since deregulation

Olea Head and Neck DCE VPMC-14290A

The Financing and Growth of Firms in China and India: Evidence from Capital Markets

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

WS Atkins plc (ATK) - Financial and Strategic SWOT Analysis Review

Curtis Miller MATH 3080 Final Project pg. 1. The first question asks for an analysis on car data. The data was collected from the Kelly

A latent class approach for estimating energy demands and efficiency in transport:

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT

Customs Policies and Trade Efficiency

ASSESSING THE HEALTHFULNESS OF FOOD PURCHASES AMONG LOW-INCOME AREA SHOPPERS IN THE NORTHEAST

Consumer preferences for organic and welfare labelled meat A natural field experiment conducted in a high class restaurant

Comparative Analysis of Dispersion Parameter Estimates in Loglinear Modeling

Transcription:

Missing Data: Part 2 Implementing Multiple Imputation in STATA and SPSS Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 4/24/13

Overview Reminder Steps in Multiple Imputation Implementation in STATA Implementation in SPSS 4/24/13 SON_BB_MissingData_part2_20130424 2

Statistical Analysis & Missingness Sampling process Population to which inference is to be made Inference process Sample used for inference (assume representative) Part of sample with missing data Is sample with missing data still representative enough to make appropriate inferences to population of interest????? 4/24/13 SON_BB_MissingData_part2_20130424 3

Missingness Mechanisms Process by which observations become missing Mechanism types Missing Completely at Random (MCAR) Missing at Random (MAR) Missing Not at Random (MNAR) Using Multiple Imputation Mostly for MAR Likely not for MNAR 4/24/13 SON_BB_MissingData_part2_20130424 4

Multiple Imputation (MI) When used correctly, produces estimates: Approximately unbiased, better as sample size increases Asymptotically normal when data are MAR Can construct CIs and p-values 4/24/13 SON_BB_MissingData_part2_20130424 5

Advantages MI cont d Can be used with virtually any kind of data, any kind of model, and with unmodified conventional software Disadvantages Can be cumbersome to implement, Is easy to do wrong, Produces different estimates every time it is used (hopefully small differences) 4/24/13 SON_BB_MissingData_part2_20130424 6

MI Process Repeat the random imputation process more than once (5 times is generally enough) Each imputation process represents random sample from distribution of plausible values for missing values Important for imputation processes to be independent large number of iterations between each saved data set Analyze data set from each imputation process as if no missing data 4/24/13 SON_BB_MissingData_part2_20130424 7

MI Process Pooling Estimates Calculate mean of estimates Calculate mean of squared std errors Calculate variance of estimates Calculate square root of mean of variances plus variance of estimates Can be used with any parameter 4/24/13 SON_BB_MissingData_part2_20130424 8

MI Example (Howell- Part 2) 4/24/13 SON_BB_MissingData_part2_20130424 9

Additional Rules of Thumb (Allison) Dependent Variable (DV) should always be included in imputation regression analysis Impute missing values on DV if: There are auxiliary variables strongly correlated with DV. Don t impute DV if: No missing predictor data or auxiliary variables No auxiliary variables and missing predictor data 4/24/13 SON_BB_MissingData_part2_20130424 10

Preparation Explore missing data patterns Determine missingness mechanism and appropriateness for MI Assign missing codes in data set to missing designation.,.a through.z in STATA Missing Values command in SPSS Determine variables to be included in MI process not just those included in model 4/24/13 SON_BB_MissingData_part2_20130424 11

MI in STATA Data set Data set: use http://www.stata-press.com/data/r11/mheart5 Fictional heart attack data; bmi and age missing 12 cases with missing age 28 cases with missing bmi Variables: attack (binary, dependent variable) smokes (binary) age (continuous) bmi (continuous) female (binary) hsgrad (binary) 4/24/13 SON_BB_MissingData_part2_20130424 12

MI in STATA Set up/review Declare data to be mi set mi mlong mlong is most memory efficient Explore missing patterns mi misstable sum (other options) Register variables mi register type varlist imputed required passive - variable that is function of imputed variable(s) regular neither imputed nor passive Confirm mi data set up mi describe 4/24/13 SON_BB_MissingData_part2_20130424 13

MI in STATA Imputation Step Set seed for reproducibility or in mi impute command set seed 29390 Create imputed data sets mi impute method, options Set up and options differ by method mi impute mvn age bmi = attack smokes hsgrad female, rseed(29390) add(10) Creates 10 imputation data sets with seed 29390 using multivariate normal regression The more missing data, the more imputations needed. 4/24/13 SON_BB_MissingData_part2_20130424 14

MI in STATA Imputation Step 11 data sets Original data set (numbered as 0) with missing data Imputed data sets (numbered as 1-10) Review imputed data sets Show summary statistics for imputed variables mi xeq 0 1 3 6 10, summarize age bmi 4/24/13 SON_BB_MissingData_part2_20130424 15

MI in STATA Estimation Step Run estimation model mi estimate, options: estimation command always provides estimates as coefficients mi estimate: logistic attack smokes age bmi hsgrad female Get estimate in terms of odds ratios mi estimate, or 4/24/13 SON_BB_MissingData_part2_20130424 16

MI in STATA Compare estimates complete data only M=5 M=10 M=20 OR se p OR se p OR se p OR se p smokes 4.54 1.84 0 3.46 1.28 0.001 3.43 1.25 0.001 3.33 1.21 0.001 age 1.031 0.018 0.088 1.034 0.017 0.052 1.033 0.017 0.042 1.031 0.017 0.064 bmi 1.1 0.055 0.047 1.12 0.06 0.035 1.12 0.056 0.024 1.11 0.059 0.061 hsgrad 1.38 0.616 0.469 1.21 0.495 0.647 1.2 0.488 0.66 1.17 0.476 0.696 female 1.32 0.615 0.549 0.92 0.389 0.845 0.91 0.382 0.823 0.917 0.38 0.835 4/24/13 SON_BB_MissingData_part2_20130424 17

MI in SPSS Data Set CancerHead_DCHowell_SPSS.xls Child behavior problems when parent has cancer All variables have missing data (value = -9) Variables: SexP, SexChild (binary) DeptP, DeptS (continuous) AnxtP, AnxtS (continuous) GSItP, GSItS (continuous) Totbpt (continuous, dependent variable) 4/24/13 SON_BB_MissingData_part2_20130424 18

MI in SPSS Set up/review Assign missing values for all variables MISSING VALUES SexP DeptP AnxtP GSItP DeptS AnxtS GSItS SexChild Totbpt (-9). Missing Value Analysis Summary statistics listwise (non-missing cases) and all cases Missing patterns by variables Analyze Missing Values Analysis MVA Analysis of Missing Value Patterns Analyze Multiple Imputation Analyze Patterns Multiple Imputation. 4/24/13 SON_BB_MissingData_part2_20130424 19

MI in SPSS Imputation Step Set seed for imputation (separate from imputation command) Set SEED 29390. Multiple Imputations Analyze Multiple Imputation Impute Missing Values MULTIPLE IMPUTATION SexP DeptP AnxtP GSItP DeptS AnxtS GSItS SexChild Totbpt /IMPUTE METHOD=AUTO NIMPUTATIONS=5 MAXPCTMISSING=NONE /MISSINGSUMMARIES NONE /IMPUTATIONSUMMARIES MODELS DESCRIPTIVES /OUTFILE IMPUTATIONS=SPSSImputations. - Set up method, # imputations, resulting summaries, and data set in SPSS session to contain imputations (here SPSSImputations; can also save to an SPSS file) 4/24/13 SON_BB_MissingData_part2_20130424 20

MI in SPSS Imputation Step SPSSImputations includes variable Imputation_ Window SPSSImputations Data Set 0 represent original data set 1-5 represents imputed data sets Imputed values are highlighted Output shows summary statistics for original data set, and imputed cases, and all data with imputed values by imputation 4/24/13 SON_BB_MissingData_part2_20130424 21

MI in SPSS Estimation Step Select analysis from Analyze menu Can only impute if icon shows Specify imputed data sets to be used in analysis DATASET ACTIVATE SPSSImputations. REGRESSION /DESCRIPTIVES MEAN STDDEV CORR SIG N /SELECT=Imputation_ GE 1. /DEPENDENT Totbpt /METHOD=ENTER SexP DeptP AnxtP GSItP DeptS AnxtS GSItS SexChild. Shows summary statistics/analysis for original data, each imputation, and pooled estimates 4/24/13 SON_BB_MissingData_part2_20130424 22

References - SPSS http://www.uvm.edu/~dhowell/statpages/more_stuff/missing_data/miss ingdataspss.html - Howell, DC. Multiple Imputation Using SPSS http://www.gmw.rug.nl/~huisman/md/md5_imputation_2011.pdf - Huisman, M. Missing Data Session 5 Imputation (SPSS) http://www.appliedmissingdata.com/spss_multiple_imputation.pdf - Enders, CK. Excerpt from Applied Missing Data Analysis (mostly for Mplus, some for SPSS) http://www.unt.edu/rss/class/jon/spss_sc/module6/spss_m6_2.htm - University of North Texas University IT Part of SPSS workshop. ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statisti cs/20.0/en/client/manuals/ibm_spss_missing_values.pdf - SPSS Missing Values Manual for V20 4/24/13 SON_BB_MissingData_part2_20130424 23

References - STATA http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mi_in_stata_p t1.htm - UCLA Statistical Computing Seminars part 1 (using mi) http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mi_in_stata_p t2.htm - UCLA Statistical Computing Seminars part 2 (using chained equations with ice) http://biostat.mc.vanderbilt.edu/wiki/pub/main/qingxiachen/mi_stata.p df - Marchenko Y. 2009 UK Stata Users Group Meeting (v 11) http://www.ssc.wisc.edu/sscc/pubs/stata_mi_intro.htm - Beginning of series of MI topics STATA manual for Multiple-Imputation available from Help menu PDF documentaiton 4/24/13 SON_BB_MissingData_part2_20130424 24

**MI_SPSS_20130424.sps in C:\CBThompson\SON\Brown_Bag\Missing_Data_p2_20130424. **Test MI from DC Howell - Multiple Imputation Using SPSS. **Copy.CancerHead-9.dat to Excel spreadsheet and relabel last few columns to be consistent with documentation. **Data set contains bvariables related to child behavior problems among kids who have a parent with cancer. GET DATA /TYPE=XLS /FILE='C:\CBThompson\SON\Brown_Bag\Missing_Data_p2_20130424\CancerHead_DCHowell_SPSS. xls' /SHEET=name 'Sheet1' /CELLRANGE=full /READNAMES=on /ASSUMEDSTRWIDTH=32767. EXECUTE. DATASET NAME DataSet1 WINDOW=FRONT. ** add descriptors to variables. VARIABLE LABELS SexP "Sex Parent" / DeptP "Parent's Depression T score"/ AnxtP "Parent's Anxiety T score"/ GSItP "Parent's Global Symptom Index T score"/ DeptS "Spouse's Depression T score"/ AnxtS "Spouse's Anxiety T score" / GSItS "Spouse's Global Symptom Index T score"/ SexChild "Sex Child"/ Totbpt " Total Behavior Problem T score for child". **Assign missing values to variables. MISSING VALUES SexP DeptP AnxtP GSItP DeptS AnxtS GSItS SexChild Totbpt (-9). **Missing Values Analysis. MVA VARIABLES=DeptP AnxtP GSItP DeptS AnxtS GSItS Totbpt SexP SexChild /MAXCAT=25 /CATEGORICAL=SexP SexChild /MISMATCH PERCENT=5 /TPATTERN PERCENT=1 DESCRIBE=DeptP AnxtP GSItP DeptS AnxtS GSItS Totbpt SexP SexChild /LISTWISE. *Analyze Patterns of Missing Values. MULTIPLE IMPUTATION SexP DeptP AnxtP GSItP DeptS AnxtS GSItS SexChild Totbpt /IMPUTE METHOD=NONE /MISSINGSUMMARIES OVERALL VARIABLES (MAXVARS=25 MINPCTMISSING=10) PATTERNS. **Set Seed. Set SEED 29390. **Impute Missing Data Values - 5 iterations. DATASET DECLARE SPSSImputations. MULTIPLE IMPUTATION SexP DeptP AnxtP GSItP DeptS AnxtS GSItS SexChild Totbpt /IMPUTE METHOD=AUTO NIMPUTATIONS=5 MAXPCTMISSING=NONE /MISSINGSUMMARIES NONE /IMPUTATIONSUMMARIES MODELS DESCRIPTIVES 4/23/2013 11:20 AM p 1 of 2

/OUTFILE IMPUTATIONS=SPSSImputations. ***Regression Analysis on each imputation and Pooled across imputation estimates. DATASET ACTIVATE SPSSImputations. REGRESSION /DESCRIPTIVES MEAN STDDEV CORR SIG N /SELECT=Imputation_ GE 1 /MISSING LISTWISE /STATISTICS COEFF OUTS CI(95) R ANOVA CHANGE /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Totbpt /METHOD=ENTER SexP DeptP AnxtP GSItP DeptS AnxtS GSItS SexChild. 4/23/2013 11:20 AM p 2 of 2

capture log close log using "C:\CBThompson\SON\Brown_Bag\Missing_Data_p2_20130424\MI_STATA_20130424.log", replace **MI_STATA_20130424.do in C:\CBThompson\SON\Brown_Bag\Missing_Data_p2_20130424 ****Based on STATA version 12******* clear set more off *Sec A - "A really simple example" from STATA MI intro **Fictional heart attack data set use http://www.stata-press.com/data/r11/mheart5, clear **A.1 data set describe misstable summarize **Sec A.2 - Basic analysis -- excludes missing data logit attack smokes age bmi hsgrad female logistic attack smokes age bmi hsgrad female **Sec A.3 - Set up data set for MI preserve mi set mlong mi register imputed age bmi mi misstable summarize **Sec A.4 - set seed for reproduciliblity or include in mi impute command ***set seed 29390 **Sec A.5 - run imputation model with 10 imputations and check resulting imputed data **impute with nultivariate normal regression mi impute mvn age bmi = attack smokes hsgrad female, add(10) rseed(29390) mi describe mi xeq 0 1 3 6 10: summarize age bmi **Sec A.6 - run analysis model based on 10 sets of imputed values mi estimate: logistic attack smokes age bmi hsgrad female mi estimate, or restore **Sec A.7 - run imputation model with 5 imputations and then analysis model **set seed 29390 preserve mi set mlong mi register imputed age bmi mi impute mvn age bmi = attack smokes hsgrad female, add(5) rseed(29390) mi estimate: logistic attack smokes age bmi hsgrad female mi estimate, or restore **Sec A.8 - run imputation model with 20 imputations and then analysis model **set seed 29390 preserve mi set mlong mi register imputed age bmi mi impute mvn age bmi = attack smokes hsgrad female, add(20) rseed(29390) mi estimate: logistic attack smokes age bmi hsgrad female 4/23/2013 11:20 AM p 1 of 2

mi estimate, or restore **Sec B. continuous outcome data set use http://www.stata-press.com/data/r11/mheart0, clear generate lnbmi = ln(bmi) mi set mlong mi register imputed lnbmi **impute with linear regression -- relies on normality of model mi impute regress lnbmi age attack smokes age hsgrad female, add(20) rseed(2232) **bmi will be function of original bmi - thus needs to be registered as passive mi register passive bmi quietly mi passive: replace bmi = exp(lnbmi) mi estimate, dots: logit attack smokes age bmi hsgrad female mi estimate, or log close 4/23/2013 11:20 AM p 2 of 2