Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Similar documents
Multiple Imputation for Missing Data in KLoSA

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Handling Missing Data. Ashley Parker EDU 7312

Missing Data Treatments

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

Imputation of multivariate continuous data with non-ignorable missingness

Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Missing Data: Part 2 Implementing Multiple Imputation in STATA and SPSS. Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 4/24/13

Method for the imputation of the earnings variable in the Belgian LFS

Flexible Imputation of Missing Data

Missing data in political science

Effects of Information and Country of Origin on Chinese Consumer Preferences for Wine: An Experimental Approach in the Field

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

A Comparison of Approximate Bayesian Bootstrap and Weighted Sequential Hot Deck for Multiple Imputation

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

The R survey package used in these examples is version 3.22 and was run under R v2.7 on a PC.

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE

Buying Filberts On a Sample Basis

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Imputation Procedures for Missing Data in Clinical Research

IT 403 Project Beer Advocate Analysis

Multiple Imputation Scheme for Overcoming the Missing Values and Variability Issues in ITS Data

Much ado about nothing: methods and implementations to estim. regression models

A Study on Consumer Attitude Towards Café Coffee Day. Gonsalves Samuel and Dias Franklyn. Abstract

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

Flexible Working Arrangements, Collaboration, ICT and Innovation

Regression Models for Saffron Yields in Iran

wine 1 wine 2 wine 3 person person person person person

Predicting Wine Quality

OC Curves in QC Applied to Sampling for Mycotoxins in Coffee

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

OF THE VARIOUS DECIDUOUS and

Table A.1: Use of funds by frequency of ROSCA meetings in 9 research sites (Note multiple answers are allowed per respondent)

Bizualem Assefa. (M.Sc in ABVM)

A Comparison of Imputation Methods in the 2012 Behavioral Risk Factor Surveillance Survey

MBA 503 Final Project Guidelines and Rubric

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

Learning Connectivity Networks from High-Dimensional Point Processes

An application of cumulative prospect theory to travel time variability

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

DETERMINANTS OF DINER RESPONSE TO ORIENTAL CUISINE IN SPECIALITY RESTAURANTS AND SELECTED CLASSIFIED HOTELS IN NAIROBI COUNTY, KENYA

FAST FOOD PROJECT WAVE 1 CAMPAIGN: PREPARED FOR: "La Plazza" PREPARED BY: "Your Company Name" CREATED ON: 26 May 2014

STUDY REGARDING THE RATIONALE OF COFFEE CONSUMPTION ACCORDING TO GENDER AND AGE GROUPS

Archdiocese of New York Practice Items

STAT 5302 Applied Regression Analysis. Hawkins

7 th Annual Conference AAWE, Stellenbosch, Jun 2013

Results from the First North Carolina Wine Industry Tracker Survey

A Hedonic Analysis of Retail Italian Vinegars. Summary. The Model. Vinegar. Methodology. Survey. Results. Concluding remarks.

Influence of Service Quality, Corporate Image and Perceived Value on Customer Behavioral Responses: CFA and Measurement Model

Appendix A. Table A.1: Logit Estimates for Elasticities

Classification Bias in Commercial Business Lists for Retail Food Outlets in the U.S

Final Exam Financial Data Analysis (6 Credit points/imp Students) March 2, 2006

Canada Portraits. P re p a re d b y W i n e I n t e l l i ge n c e. Wine Intelligence 2018

New from Packaged Facts!

Chained equations and more in multiple imputation in Stata 12

From VOC to IPA: This Beer s For You!

Analysis of Things (AoT)

Gasoline Empirical Analysis: Competition Bureau March 2005

A.P. Environmental Science. Partners. Mark and Recapture Lab addi. Estimating Population Size

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method

MGEX Spring Wheat 2013

Bt Corn IRM Compliance in Canada

Relation between Grape Wine Quality and Related Physicochemical Indexes

CONSUMER PREFERENCES FOR CSR WINES:

PROCEDURE million pounds of pecans annually with an average

A study on consumer perception about soft drink products

The age of reproduction The effect of university tuition fees on enrolment in Quebec and Ontario,

Monitoring Ready-to-Eat Foods Contamination by Listeria monocytogenes in France

ECONOMICS OF COCONUT PRODUCTS AN ANALYTICAL STUDY. Coconut is an important tree crop with diverse end-uses, grown in many states of India.

The dawn of reproductive change in north east Italy. A microanalysis

Compare Measures and Bake Cookies

KALLAS, Z.; ESCOBAR, C. & GIL, J.M.

HW 5 SOLUTIONS Inference for Two Population Means

Previous analysis of Syrah

Mobility tools and use: Accessibility s role in Switzerland

The Sources of Risk Spillovers among REITs: Asset Similarities and Regional Proximity

Structural Reforms and Agricultural Export Performance An Empirical Analysis

RESTAURANT AND FOOD SERVICE MANAGEMENT SERIES EVENT PARTICIPANT INSTRUCTIONS

Colorado State University Viticulture and Enology. Grapevine Cold Hardiness

Accuracy of imputation using the most common sires as reference population in layer chickens

Citrus Attributes: Do Consumers Really Care Only About Seeds? Lisa A. House 1 and Zhifeng Gao

The Financing and Growth of Firms in China and India: Evidence from Capital Markets

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

Sponsored by: Center For Clinical Investigation and Cleveland CTSC

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Investigation 1: Ratios and Proportions and Investigation 2: Comparing and Scaling Rates

INTERNATIONAL UNDERGRADUATE PROGRAM BINA NUSANTARA UNIVERSITY. Major Marketing Sarjana Ekonomi Thesis Odd semester year 2007

Population Trends 139 Spring 2010

Gender and Firm-size: Evidence from Africa

Update : Consumer Attitudes

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa

Why PAM Works. An In-Depth Look at Scoring Matrices and Algorithms. Michael Darling Nazareth College. The Origin: Sequence Alignment

Hybrid ARIMA-ANN Modelling for Forecasting the Price of Robusta Coffee in India

1. Describe the effect of stirring and kneading dough on the formation of gluten.

QUICK SERVE RESTAURANT MANAGEMENT SERIES EVENT PARTICIPANT INSTRUCTIONS

Transcription:

Victoria SAS Users Group November 26, 2013 Missing value imputation in SAS: an intro to Proc MI and MIANALYZE Sylvain Tremblay SAS Canada Education Copyright 2010 SAS Institute Inc. All rights reserved.

Thanks for having me in BC! 2

Missing Values 1978 The objective is to develop procedures that are useful in practice 3

Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 4

Why should you care about missing values? SAS/STAT Procs: Complete Case Analysis (CCA) Observations for which any variable used in the analysis are missing are deleted Impact of CCA: Reduction in sample size Inadequately estimate standard error and/or parameter estimates 5

Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 6

Exploring missing data patterns Get to know the data Exploratory data analysis How much data are missing? Is there any patterns in the missing values? Are there a lot of missing values for certain variables? Is there a group of obs with very little information available? 7

Exploring missing data patterns Monotone Arbitrary 8

Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 9

Understanding missing data mechanisms What is the process that generates the missing values? Missing At Random (MAR) given the observed data, the missingness mechanism does not depend on the unobserved data other variables (but not the variable itself) in the dataset can be used to predict missingness on a given variable Example, in surveys, men may be more likely to decline to answer some questions then women Missing Completely At Random (MCAR) Special case of MAR the probability of an observation being missing does not depend on observed or unobserved measurements Fairly strong assumption relatively rare Example: miscoded values, accidental loss of data under MCAR, the analysis of only those units with complete data (CCA) gives valid inferences Missing Not At Random (MNAR) When neither MCAR nor MAR hold data that is missing for a specific reason the value of the unobserved variable itself predicts missingness Example: certain question on a questionnaire tend to be skipped deliberately by participants with certain characteristics 10

Understanding missing data mechanisms Missing at Random (MAR) This is equivalent to saying that the behaviour of two units who share observed values have the same statistical behaviour on the other observations, whether observed or not. 11

Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 12

Common imputation strategies Imputation: Replace missing values with some other value Mean imputation replacing missing values with the sample mean assumes MCAR producing distributions that have far too many cases at the mean reducing the variance of the variable leading to biased estimates Conditional mean imputation using the mean from cases that are similar to the case with the missing values assumes MAR Decision Tree imputation replacing missing values with predicted values from a regression analysis of the complete data sharing similar problems with mean substitution 13

Common imputation strategies Issues with these simple strategies Mean substitution Conditional mean imputation The imputed values are completely determined by a model applied to the observed data they contain no error This tend to reduce the variance and can distort relationships among variables 14

Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 15

Multiple Imputation Three steps process 1. Creating a series of m imputed data sets by running an imputation model based on chosen variables and an imputation method 2. Carrying out the analysis model on each of the imputed data sets 3. Combining the parameter estimates from each imputed data set to get a final single set of parameter estimates 16

Multiple Imputation Selecting the number of imputations (m) Historically m was between 3 to 5 Now (because of computing power), m should be Between 5 to 20 for low fractions of missing information as large as 50 (or more) when the proportion of missing data is relatively high 17

Multiple Imputation - Proc MI Selected Statements m = number of imputations Imputation Methods Markov Chain Monte Carlo (MCMC) generate pseudorandom draws from multidimensional probability distributions via Markov chains. Assumptions - arbitrary missing pattern - multivariate normal distribution Assumptions - monotone missing pattern 18

Multiple Imputation (MI) In choosing the variables for the VAR statement, you should include Variables you want to impute Variables that are potentially related to the imputed variables Variables that are potentially related to the missingness of the imputed variables 19

Agenda Why should you care about missing values? Exploring missing data patterns Understanding missing data mechanisms Common imputation strategies Multiple Imputation References / Conclusion / Questions 20

Conclusion You should you care about missing values! Explore missing data patterns Understand the missing data mechanism Select an imputation method that takes in consideration the missing data pattern If your dataset is too large for MI, an alternative is maximum likelihood estimation 21

Multiple Imputation (MI) MAR is the primary assumption of MI methods There is no standard statistical test to determine if missing data is MAR MI is a more superior method to single imputation (mean imputation, conditional mean imputation) because it takes into account the uncertainty of what the true values of the unknown data should be 22

References Multiple Imputation in SAS http://www.ats.ucla.edu/stat/sas/seminars/missing_data/part1.htm Multiple Imputation for Missing Data: Concepts and New Development http://www2.sas.com/proceedings/sugi25/25/st/25p267.pdf Knowledge (of your missing data) is power: handling missing values in your SAS dataset http://support.sas.com/resources/papers/proceedings12/319-2012.pdf 23

Questions? THANK YOU! Sylvain.Tremblay@sas.com Copyright 2010 SAS Institute Inc. All rights reserved.