Imputation of multivariate continuous data with non-ignorable missingness

Similar documents
Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Multiple Imputation for Missing Data in KLoSA

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

Missing Data Treatments

Flexible Working Arrangements, Collaboration, ICT and Innovation

Handling Missing Data. Ashley Parker EDU 7312

Missing Data: Part 2 Implementing Multiple Imputation in STATA and SPSS. Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 4/24/13

Flexible Imputation of Missing Data

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

The multivariate piecewise linear growth model for ZHeight and zbmi can be expressed as:

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

wine 1 wine 2 wine 3 person person person person person

Table A.1: Use of funds by frequency of ROSCA meetings in 9 research sites (Note multiple answers are allowed per respondent)

Valuation in the Life Settlements Market

Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

Lack of Credibility, Inflation Persistence and Disinflation in Colombia

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

IT 403 Project Beer Advocate Analysis

A Comparison of Approximate Bayesian Bootstrap and Weighted Sequential Hot Deck for Multiple Imputation

Appendix A. Table A.1: Logit Estimates for Elasticities

Final Exam Financial Data Analysis (6 Credit points/imp Students) March 2, 2006

An application of cumulative prospect theory to travel time variability

Imputation Procedures for Missing Data in Clinical Research

Predicting Wine Quality

Learning Connectivity Networks from High-Dimensional Point Processes

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

Chained equations and more in multiple imputation in Stata 12

The R&D-patent relationship: An industry perspective

Missing data in political science

Evaluation of Alternative Imputation Methods for 2017 Economic Census Products 1 Jeremy Knutson and Jared Martin

Summary of Main Points

Dietary Diversity in Urban and Rural China: An Endogenous Variety Approach

The Financing and Growth of Firms in China and India: Evidence from Capital Markets

Relation between Grape Wine Quality and Related Physicochemical Indexes

Mobility tools and use: Accessibility s role in Switzerland

Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand

Internet Appendix to. The Price of Street Friends: Social Networks, Informed Trading, and Shareholder Costs. Jie Cai Ralph A.

Valuing Health Risk Reductions from Air Quality Improvement: Evidence from a New Discrete Choice Experiment (DCE) in China

Internet Appendix. For. Birds of a feather: Value implications of political alignment between top management and directors

Online Appendix to The Effect of Liquidity on Governance

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Fair Trade and Free Entry: Can a Disequilibrium Market Serve as a Development Tool? Online Appendix September 2014

This appendix tabulates results summarized in Section IV of our paper, and also reports the results of additional tests.

Cost of Establishment and Operation Cold-Hardy Grapes in the Thousand Islands Region

Measuring economic value of whale conservation

From VOC to IPA: This Beer s For You!

OF THE VARIOUS DECIDUOUS and

The premium for organic wines

A Note on a Test for the Sum of Ranksums*

Targeting Influential Nodes for Recovery in Bootstrap Percolation on Hyperbolic Networks

Investment Wines. - Risk Analysis. Prepared by: Michael Shortell & Adiam Woldetensae Date: 06/09/2015

The Sources of Risk Spillovers among REITs: Asset Similarities and Regional Proximity

PRIVATE AND PUBLIC MERGER WAVES

Effects of Information and Country of Origin on Chinese Consumer Preferences for Wine: An Experimental Approach in the Field

ECONOMIC IMPACT OF LEGALIZING RETAIL ALCOHOL SALES IN BENTON COUNTY. Produced for: Keep Dollars in Benton County

HW 5 SOLUTIONS Inference for Two Population Means

The R survey package used in these examples is version 3.22 and was run under R v2.7 on a PC.

Gender and Firm-size: Evidence from Africa

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa

A study on consumer perception about soft drink products

Cointegration Analysis of Commodity Prices: Much Ado about the Wrong Thing? Mindy L. Mallory and Sergio H. Lence September 17, 2010

Comparing R print-outs from LM, GLM, LMM and GLMM

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

Method for the imputation of the earnings variable in the Belgian LFS

Online Appendix to Voluntary Disclosure and Information Asymmetry: Evidence from the 2005 Securities Offering Reform

Appendix Table A1 Number of years since deregulation

Comparative Analysis of Fresh and Dried Fish Consumption in Ondo State, Nigeria

Statistics & Agric.Economics Deptt., Tocklai Experimental Station, Tea Research Association, Jorhat , Assam. ABSTRACT

Estimating the Greening Effect on Florida Citrus

Effects of Election Results on Stock Price Performance: Evidence from 1976 to 2008

Background & Literature Review The Research Main Results Conclusions & Managerial Implications

Introduction to Management Science Midterm Exam October 29, 2002

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Structural Reforms and Agricultural Export Performance An Empirical Analysis

A Comparison of Imputation Methods in the 2012 Behavioral Risk Factor Surveillance Survey

Accuracy of imputation using the most common sires as reference population in layer chickens

Preferred citation style

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

Forecasting the Value of Fine Wines

Analysis of Fruit Consumption in the U.S. with a Quadratic AIDS Model

Economic Contributions of the Florida Citrus Industry in and for Reduced Production

Online Appendix for. Inattention and Inertia in Household Finance: Evidence from the Danish Mortgage Market,

Internet Appendix for Does Stock Liquidity Enhance or Impede Firm Innovation? *

Comparison of Multivariate Data Representations: Three Eyes are Better than One

Identification of Adulteration or origins of whisky and alcohol with the Electronic Nose

STACKING CUPS STEM CATEGORY TOPIC OVERVIEW STEM LESSON FOCUS OBJECTIVES MATERIALS. Math. Linear Equations

Computerized Models for Shelf Life Prediction of Post-Harvest Coffee Sterilized Milk Drink

THE STATISTICAL SOMMELIER

DETERMINANTS OF GROWTH

Online Appendix for. To Buy or Not to Buy: Consumer Constraints in the Housing Market

OC Curves in QC Applied to Sampling for Mycotoxins in Coffee

A CELLAR FULL OF COLLATERAL: BORDEAUX v NAPA IN THE SEARCH FOR OENOLOGICAL GOLD

Evaluation and Analysis Model of Wine Quality Based on Mathematical Model

U.S. Demand for Fresh Fruit Imports

Transcription:

Imputation of multivariate continuous data with non-ignorable missingness Thais Paiva Jerry Reiter Department of Statistical Science Duke University NCRN Meeting Spring 2014 May 23, 2014 Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 1 / 29

Outline 1 Introduction 2 Methodology 3 Simulation Study 4 Real Data application 5 Conclusions Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 2 / 29

Motivation Adaptive Design In an ongoing survey, decide to: 1 stop the data collection or 2 invest on collecting more data. Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 3 / 29

Adaptive Design 1 If decide to stop, impute the missing data based on the observed data. Respondents D R n R Imputation Non-respondents D NR n NR N = n R + n NR Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 4 / 29

Adaptive Design 2 If decide to continue, collect an extra wave and impute the remaining. Respondents D R n R Follow-up Sample D FUS n FUS Imputation Non-respondents D NR n NR N = n R + n FUS + n NR Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 5 / 29

Adaptive Design 2 If decide to continue, collect an extra wave and impute the remaining. Respondents D R n R Follow-up Sample Imputation D FUS n FUS Non-respondents D NR n NR N = n R + n FUS + n NR Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 5 / 29

Decision rule How to decide to stop or not? Information measure How different is the non-respondents distribution from the respondents? Cost measure How much does it cost to collect more data and what is the budget? Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 6 / 29

Missing Not At Random Information measure How different is the non-respondents distribution from the respondents? We need to consider the hypothesis that the non-respondents are Missing Not At Random. Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 7 / 29

Imputation under MNAR Assume that the non-respondents have a different distribution than the respondents. For now, we are considering only unit non-response, but the method could be adapted to deal with item non-response as well. Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 8 / 29

Methodology Model for the observed data Continuous multivariate data The variables are likely correlated and with heavily skewed distributions The model has to be flexible to capture any distributional features from the data Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 9 / 29

Methodology Model for the observed data Mixture of multivariate normal distributions Dirichlet Process prior to allow for more flexibility and better density estimation (Ishwaran and James, 2001) Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 10 / 29

Dirichlet Process Mixture Model Y n = y 1,..., y n z i 1,..., K n complete p-dimensional observations. Assume each variable is standardized. component indicator of i-th observation, with probability π k = P(z i = k) Each component k follows a MVN distribution N(µ k, Σ k ) Mixture model: y i z i, µ, Σ N(y i µ zi, Σ zi ) z i π Multinomial(π 1,..., π K ) K Marginal mixture model: p(y i µ, Σ, π) = π k N(y i µ k, Σ k ) k=1 Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 11 / 29

Prior specification With conjugate priors (Kim et al., 2014), the posterior samples can be obtained using a Gibbs sampler (Ishwaran and James, 2001). Components: µ k Σ k N(µ 0, h 1 Σ k ) Σ k IW(f, Φ) Φ = [ φ 1 0... 0 φ p ] with φ j Gamma(a φ, b φ ) a φ = b φ = 0.25 µ 0 = 0 df: f = p + 1 h = 1 Stick-breaking representation for the weights: π k = v k g<k (1 v g ) for k = 1,..., K v k Beta(1, α) for k = 1,..., K 1; v K = 1 α Gamma(a α, b α ) a α = b α = 0.25 Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 12 / 29

Imputation under MNAR MAR Generate impute data from the posterior predictive distribution µ Σ Respondents D R mixture model π Non-respondents D NR Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 13 / 29

Imputation under MNAR MNAR Generate impute data from the altered posterior predictive distribution µ Σ Respondents D R mixture model ππ reflect a hypothesis for the non-respondents pattern Non-respondents D NR Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 13 / 29

Imputation under MNAR MNAR Generate impute data from the altered posterior predictive distribution µ Σ Respondents D R mixture model ππ Non-respondents D NR Imputation Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 13 / 29

Ranking the components If MNAR is being considered, it is likely that the missing data will have more extreme values than the observed. We need to leverage the weights of the clusters on the tails. Rank the components based on the distance to the origin µ µ - post-simulation - only non-empty components are considered Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 14 / 29

Changing the mixture weights Many ways to choose the new weights π : set to fixed values; rescale based on the posterior samples; sampled from a random distribution; incorporate information from auxiliary variables, etc. With a moderate number of components: fix the new values for π As the number of components increases, it becomes harder: choose a subset of components Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 15 / 29

Selecting posterior samples Multiple Imputation: Select m samples from the MCMC iterations If the cluster allocations are similar across the m samples, specify overall probabilities and proceed with standard MI methods. Otherwise, summarize the samples by selecting the sample that has the largest posterior value (Fraley and Raftery, 2007). Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 16 / 29

Simulation Study Toy example: The true complete data distribution can be recovered if the missing data mechanism is known. Repeat 500 times: 1 Generate complete data (observed and missing) 2 Fit the mixture model to the observed data 3 Set π to the true missing proportions 4 Generate m = 5 imputed data sets under MAR and MNAR 0 5 10 15 0 5 10 15 y 1 y 2 observed missing Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 17 / 29

Toy example Inference on: complete complete original data sets (no missing data) observed original data sets with just the observed responses MNAR observed + multiple imputed data sets under MNAR (combining rules from Reiter (2003)) MAR observed + multiple imputed data sets under MNAR (combining rules from Reiter (2003)) Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 18 / 29

Toy example Inference on: Marginal means Linear regression coefficients Coverage rates: ȳ 1 ȳ 2 ˆβ 0 ˆβ 1 complete 0.96 0.96 0.97 0.92 MNAR 0.99 0.99 0.96 0.89 observed 0.00 0.00 0.97 0.95 MAR 0.00 0.00 0.93 0.93 Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 19 / 29

y 2 5 0 5 10 15 20 Truth 5 0 5 10 15 20 y 1 0.014 0.012 0.010 0.008 0.006 0.004 0.002 0.000 5 0 5 10 15 20 5 0 5 10 15 20 Complete Observed 5 0 5 10 15 20 MNAR MAR 5 0 5 10 15 20 5 0 5 10 15 20 5 0 5 10 15 20 Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 20 / 29

Real Data application Colombian Annual Manufacturing Survey in 1991 (N=6609) Variables: RVA (real value-added), RMU (real material used in products) and CAP (capital in real terms). Missing data indicator: R i Bern(θ i ) where θ i = logit 1 (β 0 + β 1 Y i ) β 0 and β 1 are fixed such that the plants with larger quantities are more likely to not respond. Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 21 / 29

The values are log transformed and positively standardized. Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 22 / 29

Clusters from the iteration with maximum posterior with default priors Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 23 / 29

Imputed data from the top cluster only Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 24 / 29

Results with default prior are not flexible enough Change prior (fix covariance matrices to enforce smaller clusters) Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 25 / 29

Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 26 / 29

Conclusions Imputation under MNAR: Flexible model that is able to capture different features of the data Under MNAR, the missing data distribution is unknown. The method works for different levels of prior information Interface to facilitate Sensitivity Analysis Next steps: Adaptive Design Information measure: based on propensity scores to compare data sets imputed under different scenarios Cost function and Stopping rule Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 27 / 29

Thank you! tvp@stat.duke.edu Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 28 / 29

References Fraley, C. and Raftery, A. E. (2007). Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification, 24(2):155 181. Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453). Kim, H. J., Reiter, J. P., Wang, Q., Cox, L. H., and Karr, A. F. (2014). Multiple imputation of missing or faulty values under linear constraints. (forthcoming) Journal of Business and Economic Statistics. Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology, 29(2):181 188. Thais Paiva, Jerry Reiter Imputation with non-ignorable missingness May 23, 2014 29 / 29