RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

Similar documents
Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

A Comparison of Approximate Bayesian Bootstrap and Weighted Sequential Hot Deck for Multiple Imputation

Missing Data Treatments

Flexible Imputation of Missing Data

Multiple Imputation for Missing Data in KLoSA

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Handling Missing Data. Ashley Parker EDU 7312

Imputation of multivariate continuous data with non-ignorable missingness

Missing data in political science

Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Missing Data: Part 2 Implementing Multiple Imputation in STATA and SPSS. Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 4/24/13

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Relation between Grape Wine Quality and Related Physicochemical Indexes

Imputation Procedures for Missing Data in Clinical Research

Multiple Imputation Scheme for Overcoming the Missing Values and Variability Issues in ITS Data

A Note on a Test for the Sum of Ranksums*

Multiple Imputation of Turnover in EDINET Data: Toward the Improvement of Imputation for the Economic Census

7 th Annual Conference AAWE, Stellenbosch, Jun 2013

wine 1 wine 2 wine 3 person person person person person

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT

Effects of Information and Country of Origin on Chinese Consumer Preferences for Wine: An Experimental Approach in the Field

Method for the imputation of the earnings variable in the Belgian LFS

References. BEAUMONT, J.F., An estimation method for nonignorable nonresponse, Survey Methodology, 26, , 2000.

Processing Conditions on Performance of Manually Operated Tomato Slicer

Flexible Working Arrangements, Collaboration, ICT and Innovation

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

COMPARISON OF EMPLOYMENT PROBLEMS OF URBANIZATION IN DISTRICT HEADQUARTERS OF HYDERABAD KARNATAKA REGION A CROSS SECTIONAL STUDY

An application of cumulative prospect theory to travel time variability

Statistics & Agric.Economics Deptt., Tocklai Experimental Station, Tea Research Association, Jorhat , Assam. ABSTRACT

Regression Models for Saffron Yields in Iran

Table A.1: Use of funds by frequency of ROSCA meetings in 9 research sites (Note multiple answers are allowed per respondent)

Comparison of standard penetration test methods on bearing capacity of shallow foundations on sand

The aim of the thesis is to determine the economic efficiency of production factors utilization in S.C. AGROINDUSTRIALA BUCIUM S.A.

Appendix A. Table A.1: Logit Estimates for Elasticities

Internet Appendix for Does Stock Liquidity Enhance or Impede Firm Innovation? *

Buying Filberts On a Sample Basis

Evaluation of Alternative Imputation Methods for 2017 Economic Census Products 1 Jeremy Knutson and Jared Martin

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa

IMPACT OF PRICING POLICY ON DOMESTIC PRICES OF SUGAR IN INDIA

Much ado about nothing: methods and implementations to estim. regression models

Return to wine: A comparison of the hedonic, repeat sales, and hybrid approaches

Final Exam Financial Data Analysis (6 Credit points/imp Students) March 2, 2006

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Gender and Firm-size: Evidence from Africa

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

Zeitschrift für Soziologie, Jg., Heft 5, 2015, Online- Anhang

Atis (Annona Squamosa) Tea

ARE THERE SKILLS PAYOFFS IN LOW AND MIDDLE-INCOME COUNTRIES?

IT 403 Project Beer Advocate Analysis

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

A Comparison of Imputation Methods in the 2012 Behavioral Risk Factor Surveillance Survey

Influence of Service Quality, Corporate Image and Perceived Value on Customer Behavioral Responses: CFA and Measurement Model

Selection bias in innovation studies: A simple test

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years

HW 5 SOLUTIONS Inference for Two Population Means

A COMPARATIVE STUDY OF THE CAFFEINE PROFILE OF MATURE TEA LEAVES AND PROCESSED TEA MARKETED IN SONITPUR DISTRICT OF ASSAM, INDIA.

A Comparison of Price Imputation Methods under Large Samples and Different Levels of Censoring.

An Examination of operating costs within a state s restaurant industry

Predicting Wine Quality

INSTITUTE AND FACULTY OF ACTUARIES CURRICULUM 2019 SPECIMEN SOLUTIONS. Subject CS1B Actuarial Statistics

Comparative Analysis of Fresh and Dried Fish Consumption in Ondo State, Nigeria

Background & Literature Review The Research Main Results Conclusions & Managerial Implications

2. Materials and methods. 1. Introduction. Abstract

A Comparison of X, Y, and Boomer Generation Wine Consumers in California

Lack of Credibility, Inflation Persistence and Disinflation in Colombia

A Study on Consumer Attitude Towards Café Coffee Day. Gonsalves Samuel and Dias Franklyn. Abstract

ECONOMICS OF COCONUT PRODUCTS AN ANALYTICAL STUDY. Coconut is an important tree crop with diverse end-uses, grown in many states of India.

STUDY REGARDING THE RATIONALE OF COFFEE CONSUMPTION ACCORDING TO GENDER AND AGE GROUPS

Napa County Planning Commission Board Agenda Letter

Predictors of Repeat Winery Visitation in North Carolina

Chained equations and more in multiple imputation in Stata 12

A New Approach for Smoothing Soil Grain Size Curve Determined by Hydrometer

BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER ECONOMETRIC ANALYSIS

A study on consumer perception about soft drink products

Emerging Local Food Systems in the Caribbean and Southern USA July 6, 2014

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Learning Connectivity Networks from High-Dimensional Point Processes

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

The R survey package used in these examples is version 3.22 and was run under R v2.7 on a PC.

INFLUENCES ON WINE PURCHASES: A COMPARISON BETWEEN MILLENNIALS AND PRIOR GENERATIONS. Presented to the. Faculty of the Agribusiness Department

The Effect of Almond Flour on Texture and Palatability of Chocolate Chip Cookies. Joclyn Wallace FN 453 Dr. Daniel

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

Citrus Attributes: Do Consumers Really Care Only About Seeds? Lisa A. House 1 and Zhifeng Gao

Archdiocese of New York Practice Items

1) What proportion of the districts has written policies regarding vending or a la carte foods?

Fairtrade Buying Behaviour: We Know What They Think, But Do We Know What They Do?

The Hungarian simulation model of wine sector and wine market

International Journal of Business and Commerce Vol. 3, No.8: Apr 2014[01-10] (ISSN: )

IMPACT OF RAINFALL AND TEMPERATURE ON TEA PRODUCTION IN UNDIVIDED SIVASAGAR DISTRICT

Hybrid ARIMA-ANN Modelling for Forecasting the Price of Robusta Coffee in India

OF THE VARIOUS DECIDUOUS and

How consumers from the Old World and New World evaluate traditional and new wine attributes

STABILITY IN THE SOCIAL PERCOLATION MODELS FOR TWO TO FOUR DIMENSIONS

OC Curves in QC Applied to Sampling for Mycotoxins in Coffee

Structural Reforms and Agricultural Export Performance An Empirical Analysis

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

Transcription:

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS Nwakuya, M. T. (Ph.D) Department of Mathematics/Statistics University of Port Harcourt P.M.B 5323, Port Harcourt Rivers State NIGERIA Nwabueze Joy C. Department of Statistics Michael Okpara University of Agriculture Umudike P.M.B.7267, Abia State NIGERIA ABSTRACT Most researchers have faced the problem of estimation when data points are missing. The mostly adopt easy to implement procedures without considering the efficiency of their estimates. In this paper we looked at the relative efficiency of estimates in Multiple Imputation analysis, based on percentages of missing data using 3 different imputation numbers; 7, 5 and 3 on four different simulated data sets with 50%, 45%, 25% and 10% missing values. The variance of each data set with different percentages of missing value for each imputation number was computed using a proposed method. This proposed method was seen to yield lower variances compared to an existing method. The program was written and implemented in R. The pooled variance of the estimates was also computed based on the percentages of missing values in the different data sets. The relative efficiency were computed and compared among the 3 different imputation numbers using the T-test for paired sample test in SPSS. From the results it was observed that when the missingness was 50% the estimates from data set gotten from imputation number 7 was most efficient when compared to estimates from data sets gotten from imputation numbers 5 and 3. When the missingness was 10% and 25% the estimates from data set gotten from imputation number 5 were found to be most efficient followed by estimates from data sets gotten from imputation number 7 and then 3. The relative efficiency for 40% missingness compared among the 3 imputation numbers showed that estimates from imputation number 3were most efficient. Keywords: Multiple Imputation,, Imputation Variance, Missing Values and Shrinkage Parameter. INTRODUCTION Missing data is defined as data value that should have been recorded but for some reasons was not, Molenberg G, Verbeke G. (2005). Most researchers have faced the problem of missing quantitative data at some point in their work. Missing data is a potential source of bias in every analysis according to the European Agency for Evaluation of Medical Products (2001). Missing data leave us with the decision of how to analyse data when we do not have complete information from all informants. When information is missing in a sample, some researches employ any easy to administer method without checking the efficiency of their estimates. This paper considers the relative efficiency of estimates from data imputed using 3 different imputation numbers in a multiple imputation analysis. We will focus on these sets of data with different percentages of missing values. Multiple Imputation is a principled missing data method that provides valid statistical inferences under Missing at Random condition, Rubin (1978), Tanner and Wong (1987), Rubin and Schenker (1986) and Schafer s (1997). We applied a proposed Shrinkage estimator in this analysis that yielded lower variances compared to Ordinary least square estimates. In this paper the missing data pattern applied is Progressive Academic Publishing, UK Page 63 www.idpublications.org

the Multivariate non-monotone missing pattern; this is a situation where data points are missing randomly from more than one variable. LITERATURE REVIEW Missing data concept There are three main missing data mechanism described by Rubin (1976) namely Missing Completely At Random (MCAR), this is when the probability of an observation being missing is independent of the responses; Missing At Random (MAR), this is said to be a condition in which the probability that data are missing depends only on the observed values, but not the missing values, after controlling for the observed and Missing Not At Random (MNAR), here the probability of a measurement being missing depends on unobserved data. Dong and Peng (2013), stated that there are three patterns of missing data, namely: univariate, monotone and non-monotone (arbitrary) missing patterns. Suppose there are m variables denoted as,, a data set is said to have a univariate missing pattern if the missing data is from only one of the m variables and if in more than one variable, it is multivariate missing pattern. A data set is said to have a monotone missing data pattern, if the variables can be arranged in such a way that, when is missing are also missing as well. Non-monotone missing data pattern occurs when more than one of the m variables has missing data points in a random manner. Many researchers use ad hoc methods such as complete case analysis, available case analysis (pairwise deletion), or single-value imputation. Though these methods are easily implemented, they require assumptions about the data that rarely hold in practice T.D. Pigott, (2001). Multiple Imputation According to Rubin (1987), Multiple Imputation analysis involves three stages namely: The missing values are filled in M times to generate M complete data sets; The M complete data sets are analyzed by using standard procedures;the results from the M analyses are combined into a single inference. According to Carpenter J. R. and Kenward M. G. (2013), also Va Burren (2012), in other to reduce the effect of the simulation error we need to increase M (number of imputations). Estimators Tony ke, (2012), gave an insight on measuring the goodness of an estimator. He said that intuitively an estimator is good, if it is close to the unknown parameter of interest or the estimator error is small. In the context of estimating regression coefficients Stein (1956) proposed ashrinkage estimator that dominates the ordinary least squares. Anchoring on Stein s discovery Ohtani (2009), compared a shrinkage estimator and OLS estimator for regression coefficient.. Lebanon G, (2006) stated that the relative efficiency of two unbiased estimators is the ratio of their variances. The quality of two estimators can be compared by looking at the ratio of their MSE. If two estimators are unbiased it is equivalent to the ratio of the variances which is defined as the relative efficiency, Lebanon, G. (2006). METHODOLOGY Our motivation stems from the use of high imputation numbers in other to reduce the effect of simulation error in multiple imputation analysis as proposed by Carpenter J. R. and Kenward M. G. (2013), and also from the regression coefficient estimator with a shrinkage Progressive Academic Publishing, UK Page 64 www.idpublications.org

parameter proposed by Ohtani K. (2009). We essentially restrict our data distribution to be normally distributed with multivariate non-monotone missingness. Proposed method This regression coefficient proposed by Ohtani K. (2006) is given by;,.. (1) Where, Our proposed shrinkage estimator is given by; We introduced a parameter into equation (1),. Procedure A program was written in R to implement this new approach. Four different data sets of sample size n = 30, 500,1000, 5000 &10000 were simulated with 10% 25%, 40% and 50% missing values. The missimg data points were imputed using imputation numbers 3, 5 and 7 for each sample size. The proposed estimator was applied in Multiple Imputation analysis to obtain the total imputation variances which were lower than the ones from ordinary least square estimates. We then applied the relative efficiency given by G.(2006)....(2) Where we have, The pooled variance is given by then has a lower variance thus more efficient than., Lebanon (3) Given that ; k=5 (number of sample sizes) and are the individual variances. We used the T test for comparison of paired means in SPSS software to compare the variances gotten from estimates from data sets imputed using the three imputation numbers. Progressive Academic Publishing, UK Page 65 www.idpublications.org

RESULTS Table 3.1: Total imputation variances for each imputation number TOTAL VARIANCES FROM THE PROPOSED METHOD IMPUTATION IMPUTATION IMPUTATION NUMBER 7 NUMBER 5 NUMBER 3 TOTAL VARIANCES FROM THE METHOD IMPUTATION NUMBER 7 IMPUTATION NUMBER 5 IMPUTATION NUMBER 3 40190 43386.7 71266.93 40192.82 43389.95 71274.14 27242.11 23131.12 21023.61 27243.12 23131.57 21023.76 27054.68 24217.74 21881.38 27055.65 24218.32 21881.62 22490.76 22373.06 24115.34 22491.04 22373.32 24115.85 61293.45 68699.28 48008.12 61298.89 68699.28 48010.02 63499.24 55041.41 55699.91 63505.37 55045.45 55703.77 50019.25 50023.33 72463.67 50021.57 50025.71 72471.63 47941.48 45748.96 46393.13 47943.29 45750.1 46394.47 272058.4 300450.8 303103.9 272146.8 300578.2 303234.4 258556.4 290739.4 207539.1 258653.1 290889.3 207560.2 236626.8 251912 274405.7 236689.5 252001.3 274529.7 232796.9 231889.7 235831.4 232836.9 231928.5 235876.7 814832.6 697774.5 465811 815943.8 698587.1 465820.6 453141.4 476997.8 438336.6 453420 477322.4 438514.2 409747.6 425971.1 365321.6 409887.2 426139.8 365329.1 383069 378149.5 375767.8 3831012 378162.1 375775.4 57207.16 52034.29 99600.58 58134.3 52985.4 101515.5 12770.14 11665.19 19530.32 12779.86 11700.43 19565.53 3288.36 3645.03 5580.666 3333.9 3704.12 5701.69 1947.402 1991.24 1937.624 1947.3 1991.24 1937.59 Table 3.2: Comparison of the total imputation variances among the 3 imputation numbers Paired Sample T test Paired Differences t df Sig. (2- Mean Std. Deviation Std. Error Mean 95% Confidence Interval of the Difference Lower Upper tailed) VarImp7 - VarImp3 16107.738 81918.309 18317.491-22231.2110 54446.686.879 19.390 VarImp5 - VarImp3 15111.189 58902.509 13171.002-12456.034 42678.411 1.147 19.265 VarImp7 - VarImp5 996.54910 29744.597 6651.0941-12924.3519 14917.449.150 19.882 Progressive Academic Publishing, UK Page 66 www.idpublications.org

Table 3.3: Pooled variances Imputation Numbers Pooled Variances for all percentages of missingness 50% missingness 40% mssingnes 25% missingness 10% mssingness 7 84,012.864 65,029.488 58,185.5107 53,755.9199 5 86,360.069 63,010.16 57,884.7281 52,818.1006 3 90,209.956 55,388.079 62,791.3112 54,233.491 Table 3.4: for 50% missingness VarImp7 & VarImp5 = 0.9728 = 0.9313 =.9573 Table 3.5 for 40% missingness VarImp7 & VarImp5 = 1.0321 = 1.1741 = 1.1376 Table 3.6: for 25% missingness VarImp7 & VarImp5 = 1.005 =0.9267 = 0.9219 Table 3.7: for 10% missingness VarImp7 & VarImp5 = 1.0177 =0.9912 = 0.9739 Progressive Academic Publishing, UK Page 67 www.idpublications.org

DISCUSSION We begin with the imputation variances. Looking at table3.1, we observe that the new imputation variance from our proposed method is seen to be lower than that from the ordinary least square method. From the paired t-test in table 3.2, we discovered that there is no significant difference between the new total variances from all the three number of imputations. This goes to show that the reduction in the total variance was not due to increase in number of imputations but can be attributed to the improved method, irrespective of the number of imputations. From the relative efficiency results it was observed that when the missingness was 50% the estimates from data set gotten from imputation number 7 was most efficient when compared to estimates from data sets gotten from imputation numbers 5 and 3. When the missingness was 10% and 25% the estimates from data set gotten from imputation number 5 were found to be most efficient followed by estimates from data sets gotten from imputation number 7 and then 3. The relative efficiency for 40% missingness compared among the 3 imputation numbers showed that estimates from imputation number 3were most efficient. CONCLUSIONS In conclusion, generally our proposed method produced lower variances compared to the ordinary least square method and we observed that this reduction is not due to any increase in the number of imputations but it was based on the new approach. We found out that for large sample sizes with moderate missing values, imputation number 7 was most appropriate for achieving efficient estimates, while for low missing values imputation numbers 5 and 3 can be used. REFERENCES Carpenter J. R. and Kenward M. G. (2013), Multiple Imputation and its Application, John Wiley and Sons, Ltd. Publication, 37-73. Dong Y. and Peng C J. (2013), Principled Missing Data Methods for Researchers. Springer Plus, 2:22. http:www.springerplus.com/content/2/1/222. European Agency for the Evaluation of medicinal products, 2001, Evaluation of Medicines for Human Use. www.ema.europa.eu/ema/pages/includes/documents/open_document.jsp?... Lebanon G. (2006),, Efficiency and the Fisher Information, www.cc.gatech.edu/~lebanon/notes/efficiency.pdf Molenberghs G and Verbeke G (2005), Models for Discrete Longitudinal Data, Springer- Verlag, NY, 567-578. Ohtani K (2009), Comparison of some shrinkage estimators and OLS estimator for regression coefficients under the Pitman nearness criterion: A Monte Carlo Study, Kobe University Economic Reviews, 55. Pigott T. D. (2001), A Review of Methods of Missing Data, Educational Research & Evaluation. Taylor & Francis, 100-112,353-383. Rubin D.B. (1976), Inference and Missing Data, Biometrika, 63. 581-592. Rubin D. B. (1978), Multiple Imputation in sample surveys- a phenomenological Bayesain approach to nonresponse. In imputation and editing of Faulty or Missing Survey Data. Washington D C: US Department of Commerce. Rubin D.B. (1987), Multiple Imputation for Non-response in Surveys, JohnWiley and Sons, New York, 546-550. Progressive Academic Publishing, UK Page 68 www.idpublications.org

Rubin D. B. and Schenker N. (1986), Multiple Imputation of Interval estimation from Simple Random samples with ignorable nonresponse, Journal of the American Statistical Association, pp 97-102. Schafer J. L. (1997), Analysis of Incomplete Multivariate Data, Chapman & Hall, London, pp 87-95. Stein C. (1956), Inadmissibility of the usual estimator for the mean of a multivariate normal distribution, Proceeding of the third Berkeley Symposium on Mathematical Statistics and Probability,1, Berkeley University of California press, vol1, 197-206. Tanner and Wong (1987), The Calculation of Posterior Distribution by Data Augmentation, Journal of American Statistical Association, 82, 528-550. Tony ke (2012), James Stein Estimator, www.ieor.berkeley.edu/~kete/uploads/1/2/4/0/12408873/js.pdf Progressive Academic Publishing, UK Page 69 www.idpublications.org