Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Missing Data Imputation Method Comparison in Ohio University Student Retention Database A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Dyah A. Hening November 2009 2009 Dyah A. Hening. All Rights Reserved

2 This thesis titled Missing Data Imputation Method Comparison in Ohio University Student Retention Database by DYAH A HENING has been approved for the Department of Industrial and Systems Engineering and the Russ College of Engineering and Technology by David A. Koonce Associate Professor of Industrial and Systems Engineering Dennis Irwin Dean, Russ College of Engineering and Technology

3 ABSTRACT HENING, DYAH A., M.S., November 2009, Industrial and Systems Engineering Missing Data Imputation Method Comparison in Ohio University Student Retention Database (74 pp.) Director of Thesis: David A. Koonce Ohio University has been conducting research on first-year-student retention to prevent dropouts (OU Office of Institutional Research, First-Year Students Retention, 2008). Yet, the data sets have more than 20% of missing values, which can lead to bias in prediction. Missing data affects on the ability to generalize results to the target population. This study categorizes the missing data in variables into one of three types of missing data: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). After the missing data is identified, the proper method of handling it is discussed. The proposed method is validated through developed and tested models. The goal of this work is to explore the methods of imputation missing data, and apply them to the Ohio University student retention dataset. Approved: David A. Koonce Associate Professor of Industrial and Systems Engineering

4 TABLE OF CONTENTS Page ABSTRACT... 3 LIST OF TABLES... 6 LIST OF FIGURES... 8 CHAPTER 1: INTRODUCTION... 9 Problem Statement... 10 CHAPTER 2: BACKGROUND... 12 Literature Review... 12 Missing Values Categories... 13 Methods of Handling Missing Data... 15 Data deletion... 15 Single imputation... 16 Multiple imputation... 21 CHAPTER 3: METHODOLOGY... 25 Summary of the Original Dataset... 25 Data Cleaning... 30 Summary of Imputation Method Procedure... 31 CHAPTER 4: IMPUTATION METHODS COMPARISON... 34 Random Number Generation... 34 Data Deletion... 36 Imputation Result... 37

5 SAT... 38 ACTC... 40 ACTM... 43 High School Size... 46 High School Rank... 49 Accuracy Evaluation... 52 CHAPTER 5: PREDICTION MODEL... 57 CHAPTER 6: CONCLUSION AND DISCUSSION... 65 Recommendations for Future Research... 69 REFERENCES... 72

6 LIST OF TABLES Page Table 1: Rubin's Multiple Imputation Efficiency... 22 Table 2: Retention Dataset Variables (Roth, 2008)... 26 Table 3: Variables and Number of Students with Null Values... 31 Table 4 : Original Dataset Distribution Analysis... 34 Table 5: ANOVA Test Results for Equal Variance... 35 Table 6: Mean Imputation Results for SAT... 38 Table 7: Standard Deviation Imputation Results for SAT... 39 Table 8: Mean Imputation Results for ACTC... 41 Table 9: Standard Deviation Imputation Results for ACTC... 42 Table 10: Mean Imputation Results for ACTM... 44 Table 11: Standard Deviation Imputation Results for ACTM... 44 Table 12: Mean Imputation Results for High School Size... 47 Table 13: Standard Deviation Imputation Results for High School Size... 47 Table 14: Mean Imputation Results for High School Rank... 49 Table 15: Standard Deviation Imputation Results for High School Rank... 50 Table 16: RMSE for SAT... 52 Table 17: RMSE for ACTC... 53 Table 18: RMSE for ACTM... 54 Table 19: RMSE for High School Size... 55 Table 20: RMSE for High School Size... 56

7 Table 21: Variables in predicting Fall Enrollment from Winter model... 58 Table 22: Roth s Predicted Fall Enrollment from Winter Alternating Decision Tree vs. Actual Fall Enrollment... 59 Table 23: Roth s Predicted Fall Enrollment from Winter Logistic Regression vs. Actual Fall Enrollment... 59 Table 24: Roth s Predicted Fall Enrollment from Winter Linear Regression vs. Actual Fall Enrollment for Individual Cases... 60 Table 25: Predicted Fall Enrollment from Winter Alternating Decision Tree vs. Actual Fall Enrollment... 60 Table 26: Predicted Fall Enrollment from Winter Logistic Regression vs. Actual Fall Enrollment... 63 Table 27: Predicted Fall Enrollment from Winter Linear Regression vs. Actual Fall Enrollment for Individual Cases... 63

8 LIST OF FIGURES Page Figure 1: OU Overall First-Year Student Retention.... 10 Figure 2: SAT Imputation Mean Comparison Results... 40 Figure 3: SAT Imputation Standard Deviation Comparison Results... 40 Figure 4: ACTC Imputation Mean Comparison Results... 42 Figure 5: ACTC Imputation Standard Deviation Comparison Results... 43 Figure 6: ACTM Imputation Mean Comparison Results... 45 Figure 7: ACTM Imputation Standard Deviation Comparison Results... 45 Figure 8: High School Size Imputation Mean Comparison Results... 48 Figure 9: High School Size Imputation Standard Deviation Comparison Results... 48 Figure 10: High School Rank Imputation Mean Comparison Results... 51 Figure 11: High School Rank Imputation Standard Deviation Comparison Results... 51 Figure 12: Winter Alternating Decision Tree Predicting Fall Enrollment... 62

9 CHAPTER 1: INTRODUCTION Quality is very important for many organizations as one of the parameters of success. Total Quality Management (TQM) is one of the quality control tools that initiate continuous improvement, and it has been implemented by profit and higher education organizations. Yet, the implementation of TQM in non-profit organizations has had very little impact especially on education quality for higher education institutions (Koch, 2003). This limited impact is caused by the special challenges and difficulties faced by higher education institutions as non-profit service-oriented organizations. The most important phase of the TQM implementation is customer identification. In higher education institutions, students are considered customers (Sirvanci, 2004). Customer identification leads to tailoring systems for customer satisfaction, which has been the focus of TQM implementation. Aware of the importance of the students as their customers, many universities are striving to provide quality education by putting serious efforts into preventing student dropouts. One such institution is Ohio University (OU), which has been conducting research on first-year student retention in an effort to improve its quality of service to the student body. Still, the rate of students retention at OU has been declining over the last six years, which can be seen in Figure 1 (OU Office, 2008).

10 Figure 1: OU Overall first-year student retention. Several research studies related to predicting student retention have been conducted. The research has led to some theoretical models on student retention development and highlighted the significant factors affecting student retention. Tinto (1975) synthesized his theory from social psychology and education economics in describing the interaction between individuals and college institutions, and his model led to the development of an attrition model. This model involves a cost versus benefit analysis and includes factors about students family backgrounds, various individual characteristics, past educational experiences, and goal commitment levels. Problem Statement In previous research, Roth (2008) developed a model to predict the retention of students based on the datasets of OU student behavior. However, large numbers of values are missing, especially numbers linked to attributes relevant to the prediction model. Roth used simple mean value imputation to fill in missing values. It can be easily shown that mean value imputation changes the distribution of a given variable. It needs to be pointed

11 out here that most predictive analysis techniques do not include a specific method of handling missing data, hence it is necessary to address this problem. Current literature contains many proposed methods in dealing with missing data; yet, these techniques are not applicable to every situation and condition of missing values. The purpose of this research is to develop a better understanding of how to handle missing data, especially for developing models to predict student attrition in OU student retention datasets. A comparison of the missing data imputation methods will provide evidence of the most appropriate method to be used to fill in data for predicting student attrition.

12 CHAPTER 2: BACKGROUND Literature Review Missing data are a nuisance for statistical analysis. The main threat for institutional research is that the missing data can affect the legitimacy of a study s internal validity. In addition, missing data may have an effect on a study s external validity and limit its generalizability across a target population. Therefore, it is important to explore and identify ways to deal with missing data. According to Cohen et al. (2003), even when investigators employ conventionally appropriate strategies for coping with missing data, different approaches may lead to significantly different conclusions. The following section will briefly discuss missing data imputation methods used in this research, which are commonly. First, an overview of missing values categories is introduced. Next, the methods of handling missing values: data deletion, single imputation and multiple imputation are discussed. To address missing data appropriately, it is helpful to understand the types and characteristics of the missing data. The most commonly occurring reason for missing data is non-response to items which, according to Umbach (2005), can stem from a variety of reasons. For instance, errors that might emerge during coding or data entry, respondents inability to answer the survey questions, and the limitation of the study design can elicit responses (Umbach, 2005).

13 Missing Values Categories Gelman and Hill (2007) posit several reasons data may be missing. They group missing data into four types: missing completely at random (MCAR), missing at random (MAR), missing that depends on unobserved predictors, and missing that depends on the missing value itself. Missing values that depend on unobserved predictors and missing values that depend on the missing value itself can also be considered missing not at random (MNAR). These categories are meant to identify the characteristics of the data that will be missing, not the missing value itself. MCAR occurs when any values of a variable have the same probability of being missing. In other words, this is the case when the data values in the dataset will be randomly missing and there will be no reason why a specific value is missing. An example of this would be if a respondent decides not to answer a certain question in a survey by rolling a die and letting the decision be based on a certain number on the die. The occurrences of non-response are quite common in sampling surveys, and in Gelman and Hill s study the mechanism of non-response is assumed as MCAR, (for more detailed explanation, see Rubin, 1987). When missing data are MCAR, no specific clue could be derived from the other responses as to what the missing value should be. MAR, or missing at random, can be considered to be semi-mcar. It occurs when the probability of any variable instance to be missing is the same for all units. However, what distinguishes MAR from MCAR is that with MAR the variable can be predicted from other available data. When data are MAR, omitting cases with missing data is accepted because doing so will reduce the bias of the inferences.

14 The last type of missing data is missing not at random. MNAR can be subcategorized into: missing values which depend on unobserved predictors and missing values that depend on the missing value itself. In these cases the likelihood of a value being missing is dependent on some value. A good example of this comes from medical studies--when any particular treatment causes discomfort to a patient, the likelihood of that patient walking out or dropping out will increase (Rubin, 1987). Another consideration that Rubin (1987) has put into his classification method is whether the missing data are ignorable or not. By ignorable he means that the whole variable can be omitted or disregarded in the model building. In cases of MAR, the ignorable missing data mechanism occurs when variables are less important or less related to the model than are other variables. This assumption has the same underlying philosophy as the causal framework, in which ignoring something can be done if sufficient evidence and information have already been gathered. So, in these cases, few correlated variables can be omitted. For example, suppose we want to predict someone s athletic capability or performance. The variable of favorite color would most likely not be related to the prediction model; therefore, excluding this variable will probably have few negative effects on the prediction model's accuracy. The three main categories of missing data are: MCAR or missing completely at random, MNAR or missing not at random and MAR or missing at random. Having examined the primary characteristics of missing data, methods for handling missing data will be discussed next.

15 Methods of Handling Missing Data Data deletion The simplest mechanism for handling a missing value is to discard the observation with the missing value. However, with large dimension data sets (common in data mining) a significant portion of the observations may have missing values. In addition, the discarding-data approach can lead to biased estimation and can cause larger standard errors due to reduced sample size. According to Gelman and Hill (2007), the discarding-data approach can be divided into three categories: complete analysis, available-case analysis and non-response weighting. Complete analysis refers to excluding any missing values of either input or output data. This method can cause bias to the analysis when the missing units differ systematically from the completely observed cases. Consider the case in which study participants are less likely to report their weight if they are obese, deleting all missing value observations from the set would then render the study biased towards non-obese participants as they are more likely to provide that data. To maintain the size of the data set, and possibly remove the bias from missing values, missing data can be modeled and the missing data values imputed. Yet, there are still difficulties in identifying whether the missing data are really missing at random (MAR) or the missing data depend on unobserved predictors or the missing data themselves (MNAR). According to Mcknightet al. (2007), the assumption regarding whether the missing data are MAR or MNAR is crucial to determining the best course of action. Since determining whether the data are MNAR is a subjective task, researchers

16 should check other studies, conduct follow-up surveys, interview participants, or remeasure the sample units. It is important to note that while checking other references is the approach of choice in solving this matter, not every research can provide good references. Single imputation A simple method for supplying missing values is single imputation. There are three types of single imputation, based on the types of values: constant, randomly selected, and non-randomly derived values (Mcknight et al., 2007). Constant substitution refers to replacing the missing values with constant values such as mean substitution (either the arithmetic mean or the estimated mean of the population), median substitution, or zero imputation. Random imputation, which uses random values, consists of two major divergences: hot-deck and cold-deck imputation. The non-random imputations are derived values from regression, conditional imputation, or data that have been previously recorded from a subject. Despite different types of single variable imputations, these methods all have something in common--they assume that the standard error for the estimate is low. Constant subtitution. Due to the ease and simplicity of the following single imputation method, the most used type for supplying missing values is constant replacement (Mcknight et al., 2007). One constant replacement method is mean imputation, which consists of predicting the missing observation by simply filling in the missing values with the mean of the observed values. However, this method is less desirable because it tends to underrepresent extreme values, which biases the analysis by

17 yielding a variable with greater central tendency than should be expected. This invalidates the estimates of variance and covariance, affecting the internal validity of the work. Another single imputation method is ML estimated mean substitution, which is based on the maximum likelihood (ML) algorithm. The arithmetic of this method slightly enhances the traditional mean imputation method regarding its sensitivity to the outliers values. The methods draw on the assumption of normal distribution of the data. Although the ML substitution provides an estimate mean of the population (µ) instead of the sample mean, this method is still considered a less desirable method because the substantial deviations from the assumed normal distribution provide poor estimation. The third constant replacement method, median substitution, is used when the data are not normally distributed, in which case the curve can be skewed, flat or peaked and cannot be represented by the mean replacement. Median imputation tends to produce larger standard errors, which is not optimal to avoid type I error. However, compared to the two previous constant imputation methods, it is better at reducing type II error (Mcknight et al., 2007). The last common type of constant replacement method is replacing the missing values with a value of 0 based on logical rules. If the missing data happen to be in the outcome variable and the probability of the predictors fully depends on recorded variables, then the missing values can be modeled by adding another parameter having value of 0 or 1. The added parameter will have value 1 for recorded data and 0 for missing data. For example, in the Ohio University student retention datasets, one data

18 element is the accumulated GPA. If the value of the current GPA is missing, the rule allows a substitution from the previous quarter's GPA. If there were no recorded GPAs in the previous quarters, the accumulative GPA value would be set at zero. Random imputation. Random imputation of a single variable is needed when more than a small fraction of data has missing values. Random imputation involves replacing the missing values with randomly generated values. Randomly generated values can come from the available values in the current dataset, also known as hot-deck imputation, or from similar datasets containing matching variables, also known as cold-deck imputation. In random imputation, the estimation of suitable values for replacing the missing values is generated based on the available data. According to McKnight et al. (2007), there are different strategies for hot-deck imputation. The first strategy is simple random imputation, by imputing the missing value of any missing variable with randomized values based on the available data. If the missing data is MCAR, then there is no method for defining the missing value. Thus, if the observable values occur in the same proportion as the sampled population, supplying missing values from this predicted population will not introduce bias to the variable. This approach is considered to be a good starting point for preliminary data analysis. The strategy is hot-deck within adjustment cells that is, blocking the relevant covariates and imputing the missing data based on the randomly generated values of the available data. Yet another approach uses the nearest neighbor s value in order to replace the data. This method imputes the missing value with the closest criteria from the available data. For

19 example, if the ethnicity of a participant is missing from a group with a similar ethnicity, the missing values will be imputed with the particular ethnicity in that group. Matching and hot-deck imputation determine each missing unit (y) with a value from similar value of predictors (x) in the observed data. Matching can become challenging when the matching vectors need to be built with a small amount of available data. To solve this problem, random imputation of the five closest resolved cases or other available information can be used. One can also predict the missing values based on several other variables that are fully observed; thus, the predicted data can be matched and imputed to the datasets. The most common problem that arises from this method is that it underestimates the standard errors due to the decreased variability. This is caused by the missing data being imputed by values that already exist in the dataset. According to Seastorm, Kaufman, & Lee (2002), hot-deck imputation preserves the distribution of the original data and increases the variance compared to mean imputation. Consequently, according to Mundform & Whitcomb (1998), the estimate of the prediction accuracy would be too dependent to the randomly selected value, due to its variation from one selection value to another. In their research, Mundform & Whitcomb were running 1000 repetitions for hot-deck imputation and took the average value of the 1000 results of each 99 entries to obtain the value used in his research. Cold-deck imputation is similar to hot-deck imputation, but another set or sample is used to impute the data. Although the purpose of this method is to solve the problem that occurs in hot-deck imputation, it still may increase the probability of type I error due to the small standard error (McKnight et al., 2007).

20 Nonrandom imputation. Nonrandom imputation can be divided into single condition and multiple conditions. Single condition methods consist of: conditional mean, last value carried forward, and next value carried backward. Multiple conditions are used when there are more than one single variable needed to provide more information for each missing value case. Conditional mean imputation is based on a single condition and uses a classification variable to estimate the mean to substitute the missing values. It emphasizes the relationship between the classification variable and the missing data. If the relationships are weak, the mean imputation resembles the method used in the hot- deck imputation. Last value carried forward (LVCF) replaces the missing data value with the previous available data, from the same subject or research participant under a certain time. This is based on the assumption that the most recent available observation is the best guess for subsequent missing values. To use this method, a prior observed value for the observation must be available. For example, if the academic record of a student is missing the GPA for a term, we would substitute the GPA of the most recent term. Next value carried backward (NVCB) uses a similar process as the LVCF, where the imputation of the missing values on the early observation can be filled behind with the next available data. The use of these methods is limited to a subject s own data that are observed continuously in a certain time period. Multiple condition nonrandom value imputation uses regression and error. A better result can be produced if the value for the missing variable can be predicted with a

21 regression against the observed cases. Random regression imputation uses a regression model to predict the missing values. This strategy uses uncertainty by adding the prediction error into the regression. Overall, although the single imputation method can be easily implemented, several of its weaknesses could lead to distortion of the variable distribution. This distortion could then lead to underestimation of the standard deviation, which in turn would result in underestimation of the standard errors, and thus increased type I errors. Multiple imputation Multiple imputation is a method of supplying multiple values for a missing value. By utilizing Markov Chain Monte Carlo (MCMC) simulation, multiple values can be generated (Mcknight et al., 2007). MCMC is using computer simulation of Markov chains where the posterior distribution of the statistical inference problem is the asymptotic( (Muller, 2003). The imputed values can be analyzed for mean and variation. These statistics can then be used to derive expected values and associated confidence intervals. Two common methods of multiple imputation using MCMC-method-derived Bayesian estimated values are routine multivariate imputation and iterative regression imputation. In routine multivariate imputation, a fitted multivariate model is built using all the variables containing missing values. The predictors (x) and the outcome (y) are considered vectors. This method has some difficulties, one of them being that much effort is required to set up a reasonable multivariate regression model. The t-distribution or multivariate normal distribution is commonly used for continuous outcomes, while the

multinomial distribution is used for discrete outcomes. According to Rubin (1978), the efficiency of an estimate (relative efficiency in %) based on m imputations is shown by: 22 1 (1) γ = rate of missing information for estimated quantity. The multiple imputation efficiencies for various values of m and γ are shown in Table 1. Table 1: Rubin's Multiple Imputation Efficiency γ m 0.1 0.3 0.5 0.7 0.9 3 97 91 86 81 77 5 98 94 91 88 85 10 99 97 95 93 92 20 100 99 98 97 96 According to Rubin(1987), if the rate of missing information is not very high, there is little advantage in producing and analyzing only a few imputed datasets. Due the missing data, multiple imputation performance across the imputed date sets could reflect statistical uncertainty. Rubin estimated rate of missing information in order to provide some diagnostic measures for the multiple imputation procedure that point out how strongly the estimated quantity is influenced by missing data. The estimated rate of missing information (γ) is / (2)

23 where (3) = variance increase due nonresponse. The rate of missing information ( and the number of imputations m, verifies the relative efficiency of the MI inference (Rubin, 1987). Multiple imputation has three steps: imputation, routine analysis, and parameter estimation from the results. The first step, the imputation process is similar to single imputation. Yet, what makes multiple imputation different from single imputation is that there is no necessary restriction on selecting which single imputation procedure to use (McKnight, McKnight, Sidana, & Fiqueredo, 2007). The values may be imputed using random normal values, hot-deck values, or MCMC-method-derived Bayesian estimated values. However, it is recommended to only use a single imputation method for multiple imputation, since each of the single imputation yield different results. The second step is to analyze the complete data sets provided after imputation. In the literatures, there are no specific preferences of any types of statistical analyses that can be performed on the multiple imputation datasets. The analyses used in this research are means, standard deviation and variance. Following the statistical analysis, there are several steps for the parameter estimation to compute the overall standard errors. First, within-imputation variance must be computed, which is the mean of the standard errors related to all the parameters of

24 interest in the statistical model. Each parameter estimate is referred to as. The withinimputation variance, referred to as, is to have an average of the total standard error or variance. represents the variability of standard error or variance that is calculated within each of imputations. The next step is to compute the between-imputation variance, referred to as B in Rubin s, 1987, nomenclature. The formula for calculating B is given by Equation 4. (4) Between-imputation variance is basically the sum of the squared deviations for each estimates or divided by the number of imputed data sets minus 1. Next, total variance is the sum of within-variance and between-variance B. Yet, according Rubin, the between-imputation variance needs to be weighted according to the number of imputations performed. Thus, the total variance (T) is calculated by the following equation. 1 (5)

25 CHAPTER 3: METHODOLOGY The purpose of this research is to develop a protocol for determining how to handle missing data, especially in the student data retention dataset. The dataset to be used in this research is the 2006 freshman class, provided by the Ohio University Office of Institutional Research. The following sections discuss the contents of the dataset, the procedures to clean it by identifying the missing values, and a summary of the imputation methods comparison procedure. Summary of the Original Dataset In order to conduct this research, a dataset from the Ohio University Office of Institutional Research containing admissions and involvement data from the 2006 freshman class was retrieved. This dataset has been used by Roth (2008) to create a model predicting first-year Ohio University student attrition. The original data were retrieved from four resources: student applications to Ohio University, the Student Information System (SIS), the students financial aid records, and the students involvement survey carried out by the Office of Institutional Research. The total number of variables after unification is 66, with 4061 students. Roth has already created a list of the variables included in the original dataset, which can be seen in Table 2. The table includes each variable s description, the variable type, the source of the data, and when in the school year timeline the variable is available.

Table 2: Retention Dataset Variables (Roth, 2008) 26

Table 3: (continued) 27

Table 4: (continued) 28

29 Table 5: (continued) Key : Bin = Binary, Dec = Decimal, Int = Integer, Nom= Nominal, A=Application, S= SIS, F= Financial Aid Record, I = Involvement Survey As can be seen in Table 2, the original data were retrieved from four resources: student applications to Ohio University, the Student Information System (SIS), the students financial aid records, and students involvement survey carried out by the Office of Institutional Research. Student applications contain students demographics,

30 high school information, and standardized test scores. SIS, the second source, is a software program that Ohio University uses to manage student information. It provides students registration information from the past until present academic information. The third source is the information of students financial aid records. These variables were entered into university databases through student s Free Application for Federal Student Aid, or FAFSA. The last source was a student involvement survey conducted by the Office of Institutional Research. This survey was conducted at the end of Winter quarter and provides information on students attitudes and behaviors related to social and academic involvement in their first year at Ohio University. Data Cleaning The dataset from Institutional Research described above was received as one dataset that combined the information from all four sources. Yet, the dataset indicated a large amount of missing data. Roth (2008) had taken several steps to clean the data by keeping the valid data in preparation for the data modeling in predicting student attrition. Her second step after data cleaning was data imputation. Several simple imputation methods, such as mean and zero imputation, were utilized. Due to the biases from the decreased number of entries in sample sizes that can be created from applying completecase analysis, or the elimination of any entry with a missing data point, complete-case analysis was not utilized (Gelman & Hill, 2007). In this research, data imputation techniques comparison will be the main focus in order to find out the best imputation technique to be utilized in creating the predicting model. Table 3 shows a summary of

the variables containing missing data points and the number of students missing information in each category. 31 Table 6: Variables and Number of Students with Null Values % of missing out # of Variables of 4061 Students observations Comments HS Size 829 20.41% taken HS Percentile Rank 830 20.43% taken HS GPA 128 3.15% ignored State 14 0.35% ignored County Code 375 9.24% ignored ACT Composite 488 12.02% taken ACT Math 488 12.02% taken SAT Total 1847 45.49% taken Expected Family Contribution 1219 30.02% ignored Involvement Survey Variables(28 Variables) 815 20.07% ignored After the variables with the missing values are identified, the number of student entries is analyzed. The analysis indicated a stopout behavior after comparing student enrollment statuses in winter, spring and sophomore fall quarters, which resulted in student entries reduction. A stopout student is one that demonstrates non-permanent attrition behavior, or who drops out one quarter, only to return in a following quarter (Roth, 2008). Since the purpose of the model is permanent attrition, 22 stopout students were removed from the dataset, resulting in a dataset of 4,039 entries for the model. Summary of Imputation Method Procedure For the imputation methods comparison, five variables with the largest number of missing values from the original data were chosen. As can be seen in Table 3, Expected

32 Family Contribution has the highest rate of missing information with a total of 1219 missing values. This variable is considered as missing not at random because the information was missing from due to students not filling out a FAFSA. So this variable was excluded from this research. Another variable with high missing values that was excluded from the imputation method comparison is the Involvement Survey with 815 missing values. The involvement survey data was conducted at the end of Winter Quarter and this can be considered too late for any typical retention intervention, which usually takes place at the time of fall quarter pre-registration. Then, the rest of the variables with missing data were chosen based on data available from the beginning of the fall quarter, 2007. The five variables with missing data chosen for the research are: SAT Total (1847 missing values), High School Percentile Rank (830 missing values), High School Size (829 missing values), ACT Composite (488 missing values) and ACT Math (488 missing values). For each of these variables, a new dataset of 10 replications with similar distributional characteristics was randomly generated. For each of these sets, values were removed according to MCAR. For each variable, the missing values were imputed with one of the five different methods. These five methods are: mean imputation, median imputation, zero value imputation, hot-deck imputation and multiple imputation. After imputing the missing values, each set was analyzed for accuracy in the imputed values. First, each imputed value was compared to the removed value. The mean and variance of each imputed variable set was compared to the original to determine if the mean or variance has been affected.

33 Based on the comparison results, for each variable in the Ohio University student retention dataset, domain knowledge is used to classify the reasons for the missing values and the best imputation method will be implemented. Finally, a prediction model of student retention was built. This model then is compared with the Roth's model (2008) for prediction accuracy.

34 CHAPTER 4: IMPUTATION METHODS COMPARISON In this research, a new dataset with similar distributional characteristics was generated randomly with 10 replications. Before a dataset can be generated, the distributional characteristics of the five variables need to be determined. MINITAB was used to fit in the distribution characteristics of the variables; this can be seen in Table 4. Table 7 : Original Dataset Distribution Analysis Variable MINITAB distribution SAT Lognormal ACTC Lognormal ACTM Beta HsSIze 3 parameter Lognormal HSRank 3 parameter Weibull Random Number Generation After fitting the distributional characteristics, the random number for each variable was then generated using MINITAB. This random number dataset was tested to see whether it was valid for the research. The first test, ANOVA, verified any statistical differences between the original dataset and the replicated random number generated dataset. The purpose of this test is to see whether they have mean differences using variance. ARENA was considered for use for this research, but, the dataset that ARENA generated did not past the first test. Most of the results of each generated values compared to the original data had p-values less than 0.05, except for ACTM. The hypothesis of H 0 of equal variance was rejected, due the alpha value of 0.05.

35 At the beginning of the research, ten replications decided upon. Then the number of replications was increased to 100, with the 10 highest Bartlett s p-values for test variance chosen. This was done in MINITAB. The result of the ANOVA test for equal variance with the 10 highest p-values for each variable can be seen in Table 5. Table 8: ANOVA Test Results for Equal Variance Random number Bartlett test of p values SAT ACTC ACTM Hssize HsRank RNM1 0.931 0.831 0.785 0.96 0.882 RNM2 0.994 0.703 0.918 0.905 0.965 RNM3 0.904 0.868 0.973 0.917 0.95 RNM4 0.985 0.838 0.953 0.889 0.927 RNM5 0.948 0.862 0.593 0.988 0.788 RNM6 0.825 0.961 0.991 0.886 0.985 RNM7 0.948 0.948 0.719 0.995 0.929 RNM8 0.996 0.706 0.631 0.842 0.957 RNM9 0.857 0.746 0.847 0.832 0.826 RNM10 0.837 0.755 0.822 0.861 0.964 Bartlett's test (Snedecor and Cochran, 1983) is used to test if k samples have equal variances. Since the characteristic distributions of the variables are non-normal, a Bartlett s test was used due its sensitivity to departures from the normal distribution for the variables. The null hypothesis (H 0 ) is if the original data has the same variance as the random generated number dataset. The confidence level used is α = 0.05. The result in Table 5 shows that the hypothesis is accepted. It means that these random number dataset have no statistical differences to the original available data.

36 Data Deletion After the ten sets of random numbers were verified, the next step was data deletion according to the three characteristics of missing data; MCAR, MAR and MNAR. The characteristics were determined based on the reason why the data are missing. All the variables with missing data contain discrete data, except high school rank. The classification of each variable s characteristics is discussed next. The missing data in variables ACTC, ACTM and SAT scores are considered as MAR due the reason of each missing values. Because high school students have a choice of taking either the SAT or the ACT or taking both, the probability of missing values is quite high. Because many students likely take only one of the tests, the information for the test that is not taken was missing from the original dataset. The probability of SAT or ACT scores missing is not mutually dependent, whereas neither of the tests taken is dependent to the score, regardless the score result. The reason only one test or even both were taken or even both is unclear; the students choice of the test taken cannot be determined. The missing values for high school rank and high school size have the characteristics of MCAR, or Missing Completely at Random. Because not all of high school information was provided for the dataset, and the missing values in the high school rank and high school size are not related to any variables in the dataset, they are considered as MCAR. After identifying the characteristic of each variable, the data then were randomly deleted from each new generated dataset according to the number of missing data. For SAT, the total deleted from each new generated dataset was 1847. The total deleted

37 values for other variables were: 830 values for High School Percentile Rank, 829 values for High School Size, 488 values for both ACT Composite and ACT Math. In this research, the values were removed from the complete generated dataset using the MCAR methods. The MCAR was used because of the variables are treated as independent variables due the lack of information whether they are MAR or MNAR methods. Since the level of the values that are missing was independent for each variable, the MCAR method was considered appropriate to use for the deletion method. Imputation Result After the deleted data were generated, the imputation methods are utilized in this study. The tables show a summary of each imputation result for each random number set. Two statistical factors compared to the complete dataset are the mean and standard deviation. The tables for each variable are in percentage differences between the mean and standard deviation of the complete dataset and the imputed dataset. Yet, due to the random factor, the numbers of the dataset can be completely unexpected and different for each dataset. To avoid bias in the analysis, total difference average was calculated to show the average differences between the random numbers which generated the dataset. A small difference percentage means that the difference can be considered small enough to be accepted as the appropriate imputation method. It can be seen in the tables that zero imputation has a very large percentage difference for all variables. Thus, zero imputation is excluded in the graphical summary to illustrate the result effectively and avoid bias analysis due to misinterpretation.

38 SAT Tables 6 and 7 show a summary of variable SAT imputation for the 10 sets. In mean comparison, as expected, mean imputation has the lowest total percentage difference between the initial values. The second lowest total percentage difference is multiple imputation, and then hot-deck imputation with 0.205% and 0.265% of difference. Although mean imputation has the lowest value of total mean average difference, it has a large standard deviation percentage difference. Multiple imputation and hot-deck imputation are superior to mean and median imputation for preserving the standard deviation values. Multiple imputation has the lowest percentage difference for standard deviation values with a 1.61% average of difference. Table 9: Mean Imputation Results for SAT Mean Dataset (initial mean imputation median zero hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % mean delta % RN1 1093.3 1092.3 0.091% 1087.6 0.521% 595.53 45.529% 1096.7 0.311% 1094.6 0.119% RN2 1100.8 1102.9 0.191% 1099.7 0.100% 601.27 45.379% 1101.1 0.027% 1100.04 0.069% RN3 1095.4 1100.5 0.466% 1095.3 0.009% 599.99 45.226% 1098.5 0.283% 1099 0.329% RN4 1096.2 1096.7 0.046% 1092.7 0.319% 597.88 45.459% 1096.5 0.027% 1096.88 0.062% RN5 1095.5 1097.2 0.155% 1091.6 0.356% 598.17 45.398% 1084 1.050% 1085.88 0.878% RN6 1091.6 1094.6 0.275% 1085.5 0.559% 596.78 45.330% 1094.6 0.275% 1094.62 0.277% RN7 1099 1099.3 0.027% 1096 0.273% 599.35 45.464% 1096 0.273% 1098.56 0.040% RN8 1095.2 1096.7 0.137% 1093.2 0.183% 597.93 45.404% 1099.1 0.356% 1096.98 0.163% RN9 1098 1098 0.000% 1094.8 0.291% 598.6 45.483% 1098.5 0.046% 1097.36 0.058% RN10 1095.8 1095.9 0.009% 1090.9 0.447% 597.45 45.478% 1095.8 0.000% 1096.42 0.057% average 0.140% 0.306% 45.415% 0.265% 0.205%

Table 10: Standard Deviation Imputation Results for SAT Dataset (initial value) Standard Deviation mean imputation median zero hot deck imputation MI StDev delta % StDev delta % StDev delta % StDev delta % StDev delta % RN1 144.6 108.5 24.97% 108.6 24.90% 554.71 283.62% 148.1 2.42% 146.0941 1.03% RN2 149.3 109.4 26.72% 109.5 26.66% 560.04 275.11% 146 2.21% 147.2277 1.39% RN3 142.7 107.2 24.88% 107.4 24.74% 558.47 291.36% 144.2 1.05% 146.1817 2.44% RN4 144 104.7 27.29% 104.8 27.22% 556.1 286.18% 144.7 0.49% 144.2399 0.17% RN5 146.1 109.9 24.78% 110.1 24.64% 557.35 281.49% 130.8 10.47% 135.0646 7.55% RN6 155 117.6 24.13% 118.1 23.81% 557.69 259.80% 157.7 1.74% 157.4017 1.55% RN7 145.1 108.6 25.16% 108.7 25.09% 558.17 284.68% 147.3 1.52% 146.2845 0.82% RN8 147.8 110.4 25.30% 110.5 25.24% 557.24 277.02% 149.5 1.15% 147.6167 0.12% RN9 147.9 109.8 25.76% 109.8 25.76% 557.71 277.09% 146.2 1.15% 147.4026 0.34% RN10 146.4 109.3 25.34% 109.4 25.27% 556.59 280.18% 145.5 0.61% 147.4545 0.72% average 25.43% 25.33% 279.65% 2.28% 1.61% 39 In Figure 2, the mean for each imputation is different due the randomness of the random number generated dataset. With mean imputation, small differences between the imputed mean and the complete dataset are expected. This is indicated by the low values that mean imputation has throughout the generated dataset, except in Random Number generated dataset 3 or RN3 that has a high value of mean compared to other imputed random number generated datasets. Yet, the overall performance of mean imputation, compared to the dataset or initial dataset mean value, is superior to other imputation methods.

40 1.200% 1.000% 0.800% 0.600% 0.400% 0.200% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 2: SAT Imputation Mean Comparison Results 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 3: SAT Imputation Standard Deviation Comparison Results Mean imputation and median imputation seem to have similar behavior in standard deviation comparison. This can be seen in figure 3, where mean and median imputations perform poorly compared to hot-deck imputation and multiple imputation. ACTC Tables 8 and 9 show a summary of variable ACTC imputation results for 10 random number generated datasets. In Table 9, it can be seen that the multiple imputation method has the lowest average difference in standard deviation compared to other imputation methods. In the mean imputation result, as expected, the mean

imputation method has the lowest percentage difference. Yet, the difference between the total mean average for multiple imputation and hot-deck imputation is only 0.011%. Multiple imputation has the second lowest total average mean difference, 0.084%. And the third lowest total average mean difference is hot-deck imputation with a total difference of 0.093%. Multiple imputation still outperformed the other imputation methods in standard deviation difference by having the lowest total standard deviation difference (0.41%). Table 11: Mean Imputation Results for ACTC Mean Dataset (initial mean imputation median hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % RN1 23.47 23.431 0.166% 23.379 0.388% 23.443 0.115% 23.4256 0.189% RN2 23.437 23.452 0.064% 23.397 0.171% 23.422 0.064% 23.4548 0.076% RN3 23.396 23.353 0.184% 23.311 0.363% 23.369 0.115% 23.368 0.120% RN4 23.549 23.529 0.085% 23.465 0.357% 23.502 0.200% 23.5254 0.100% RN5 23.354 23.355 0.004% 23.312 0.180% 23.371 0.073% 23.3604 0.027% RN6 23.442 23.456 0.060% 23.401 0.175% 23.424 0.077% 23.453 0.047% RN7 23.474 23.469 0.021% 23.413 0.260% 23.471 0.013% 23.4592 0.063% RN8 23.467 23.448 0.081% 23.394 0.311% 23.459 0.034% 23.4404 0.113% RN9 23.527 23.528 0.004% 23.464 0.268% 23.488 0.166% 23.5216 0.023% RN10 23.529 23.515 0.060% 23.453 0.323% 23.512 0.072% 23.511 0.077% average 0.073% 0.279% 0.093% 0.084% 41

Table 12: Standard Deviation Imputation Results for ACTC Standard Deviation Dataset (initial mean imputation median hot deck imputation MI value) StDev delta % StDev delta % StDev delta % StDev delta % RN1 3.486 3.268 6.25% 3.271 6.17% 3.49 0.11% 3.477889 0.23% RN2 3.451 3.234 6.29% 3.238 6.17% 3.44 0.32% 3.449418 0.05% RN3 3.565 3.333 6.51% 3.335 6.45% 3.528 1.04% 3.54766 0.49% RN4 3.579 3.358 6.17% 3.362 6.06% 3.555 0.67% 3.575497 0.10% RN5 3.443 3.207 6.85% 3.209 6.80% 3.415 0.81% 3.422241 0.60% RN6 3.48 3.259 6.35% 3.262 6.26% 3.46 0.57% 3.460935 0.55% RN7 3.498 3.292 5.89% 3.296 5.77% 3.491 0.20% 3.502247 0.12% RN8 3.577 3.344 6.51% 3.347 6.43% 3.56 0.48% 3.54803 0.81% RN9 3.554 3.317 6.67% 3.321 6.56% 3.517 1.04% 3.526938 0.76% RN10 3.528 3.326 5.73% 3.33 5.61% 3.521 0.20% 3.540803 0.36% average 6.32% 6.23% 0.54% 0.41% 42 0.500% 0.400% 0.300% 0.200% 0.100% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 4: ACTC Imputation Mean Comparison Results As can be seen in Figure 4, the mean for each imputation is different due to the randomness of the random number generated dataset. It can be seen that the median imputation has the highest mean difference compared to the other methods.

43 8.00% 6.00% 4.00% 2.00% 0.00% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 5: ACTC Imputation Standard Deviation Comparison Results As can be seen in Figure 5, the standard deviations for all imputations tend to show similar behavior. Multiple imputation and hot-deck imputation seems to have very low standard deviations compared to mean and median imputation. Figure 5 shows that in seven out of ten results of the imputation, multiple imputation has the smallest value of standard deviation compared to hot-deck, mean, and median imputations. ACTM Tables 10 and 11 show summaries of variable ACTM imputation results for ten random number generated datasets. In the table below, as with the results in ACTC, it can be seen that the multiple imputation method has the lowest average difference in standard deviation compared to other imputation methods. Again, mean imputation, as expected, has the lowest percentage difference in mean imputation comparison results. Yet, the hot-deck imputation method has the same low percentage difference as the mean imputation method, which is 0.0568%. Multiple imputation has the third lowest total average mean difference, 0.082%. Multiple imputation still outperformed the other

imputation methods in standard deviation difference by having the lowest total standard deviation difference (0.367%). 44 Table 13: Mean Imputation Results for ACTM Mean Dataset (initial mean imputation median hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % RN1 22.924 22.922 0.009% 22.931 0.031% 22.923 0.004% 22.9046 0.085% RN2 22.898 22.923 0.109% 22.932 0.148% 22.911 0.057% 22.911 0.057% RN3 22.81 22.802 0.035% 22.826 0.070% 22.81 0.000% 22.8052 0.021% RN4 23.568 23.569 0.004% 23.501 0.284% 23.532 0.153% 23.567 0.004% RN5 22.86 22.885 0.109% 22.899 0.171% 22.869 0.039% 22.909 0.214% RN6 22.867 22.866 0.004% 22.882 0.066% 22.874 0.031% 22.8638 0.014% RN7 22.909 22.899 0.044% 22.911 0.009% 22.927 0.079% 22.879 0.131% RN8 22.942 22.95 0.035% 22.956 0.061% 22.948 0.026% 22.9306 0.050% RN9 22.881 22.839 0.184% 22.859 0.096% 22.846 0.153% 22.8344 0.204% RN10 22.82 22.828 0.035% 22.848 0.123% 22.827 0.031% 22.8296 0.042% average 0.0568% 0.106% 0.0572% 0.082% Table 14: Standard Deviation Imputation Results for ACTM Standard Deviation Dataset (initial mean imputation median hot deck imputation MI value) StDev delta % StDev delta % StDev delta % StDev delta % RN1 4.035 3.772 6.518% 3.772 6.518% 4.028 0.173% 4.02225 0.316% RN2 4.059 3.796 6.479% 3.796 6.479% 4.035 0.591% 4.048428 0.260% RN3 4.012 3.759 6.306% 3.76 6.281% 4.004 0.199% 4.014369 0.059% RN4 3.826 3.547 7.292% 3.552 7.162% 3.742 2.196% 3.774814 1.338% RN5 4.039 3.779 6.437% 3.779 6.437% 4.014 0.619% 4.039639 0.016% RN6 4.036 3.802 5.798% 3.803 5.773% 4.054 0.446% 4.04164 0.140% RN7 4.052 3.808 6.022% 3.808 6.022% 4.054 0.049% 4.054993 0.074% RN8 3.992 3.765 5.686% 3.765 5.686% 4.019 0.676% 4.014534 0.564% RN9 4.041 3.786 6.310% 3.787 6.286% 4.038 0.074% 4.033583 0.184% RN10 4.071 3.789 6.927% 3.79 6.902% 4.026 1.105% 4.041712 0.719% average 6.378% 6.355% 0.613% 0.367% As can be seen in Figure 6, as with the results in SAT and ACTC, the mean for each imputation is different due the randomness from the random number generated

45 dataset. It can be seen that the peak value 5, of median imputation, occurred in set 5, and has a higher mean difference than the other imputation methods. It also can be seen that the performance of multiple imputation, mean imputation and hot-deck imputation are competing. Yet, multiple imputation has three higher values than mean imputation and hot-deck imputation, occurring in RN5, RN7 and RN9. 0.300% 0.250% 0.200% 0.150% 0.100% 0.050% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 6: ACTM Imputation Mean Comparison Results 8.000% 6.000% 4.000% 2.000% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 7: ACTM Imputation Standard Deviation Comparison Results Multiple imputation and hot-deck imputation seem to have very small variances and standard deviations compared to mean and median imputation. Multiple imputation, again, shows a better performance than the other imputation methods regarding the

46 variance difference. In seven out of ten results of the imputation, multiple imputation has the smallest standard difference value compared to hot-deck, mean, and median imputation. High School Size Tables 12 and 13 show summaries of High School Size imputation for ten random number generated datasets. In the table below, it can be seen that the multiple imputation method has the lowest average difference in mean and standard deviation compared to other imputation methods. Multiple imputation outperformed the other imputation methods in standard deviation differences by having the lowest total standard deviation difference (0.41%). The total average mean of the multiple imputation method is 0.24%, which is also the lowest among the other imputation methods. Unlike the previous variables results, the variable High School Size has more than 10% difference in total average difference in mean compared to the hot-deck imputation and mean imputation methods.

Table 15: Mean Imputation Results for High School Size Mean Dataset (initial mean imputation median hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % RN1 275.99 275.13 0.312% 272.15 1.391% 276.77 0.283% 276.398 0.148% RN2 281.79 282.93 0.405% 279.51 0.809% 282.1 0.110% 281.974 0.065% RN3 280.72 282.57 0.659% 280.1 0.221% 281.52 0.285% 281.956 0.440% RN4 280.15 279.53 0.221% 276.32 1.367% 280.82 0.239% 279.244 0.323% RN5 278.78 279.23 0.161% 271.63 2.565% 280.48 0.610% 280.23 0.520% RN6 276.9 275.09 0.654% 271.98 1.777% 274.73 0.784% 276.326 0.207% RN7 277.01 276.03 0.354% 273.42 1.296% 275.35 0.599% 276.922 0.032% RN8 277.7 276.94 0.274% 274.09 1.300% 278.37 0.241% 277.924 0.081% RN9 276.14 275.44 0.253% 271.98 1.506% 276.4 0.094% 276.270 0.047% RN10 281.27 279.57 0.604% 276.38 1.739% 279.43 0.654% 279.858 0.502% average 0.39% 1.40% 0.39% 0.24% 47 Table 16: Standard Deviation Imputation Results for High School Size Standard Deviation Dataset (initial mean imputation median hot deck imputation MI value) StDev delta % StDev delta % StDev delta % StDev delta % RN1 140.97 126.25 10.44% 126.4 10.34% 142.86 1.34% 141.990 0.72% RN2 145.59 131.05 9.99% 131.2 9.88% 146.48 0.61% 146.108 0.36% RN3 143.15 128.9 9.95% 129 9.88% 143.65 0.35% 144.212 0.74% RN4 143.76 127.55 11.28% 127.7 11.17% 144.23 0.33% 142.849 0.63% RN5 171.32 153.16 10.60% 153.9 10.17% 170.5 0.48% 171.7198 0.23% RN6 141.65 124.79 11.90% 124.94 11.80% 138.79 2.02% 140.915 0.52% RN7 141.22 125.37 11.22% 125.46 11.16% 139.32 1.35% 141.492 0.19% RN8 140.46 124.77 11.17% 124.92 11.06% 141.67 0.86% 140.635 0.12% RN9 142.05 127.48 10.26% 127.67 10.12% 143.29 0.87% 142.873 0.58% RN10 142.49 127.12 10.79% 127.33 10.64% 142.02 0.33% 142.556 0.05% average 10.76% 10.62% 0.85% 0.41% As can be seen in Figure 8, the median imputation seems to have the highest mean among the other imputation methods except in one dataset, RN3. It can be seen that median imputation has a higher mean difference than the other imputation methods. The performance of multiple imputation has done better than the other imputation methods due the lower value of mean difference. Again, in seven out of ten sets, the mean of the

multiple imputation method has lower values than do hot-deck imputation, mean imputation and median imputation. 48 3.000% 2.500% 2.000% 1.500% 1.000% 0.500% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 MEan Median Hot Deck Mi Figure 8: High School Size Imputation Mean Comparison Results 15.00% 10.00% 5.00% 0.00% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 9: High School Size Imputation Standard Deviation Comparison Results As can be seen in Figure 9, the standard deviation for each imputation tends to show similar behavior. Multiple imputation and hot-deck imputation seem to have very small variances and standard deviations compared to mean and median imputation. In Figure 9, it can be seen that multiple imputation and hot-deck imputation have slight differences, except in RN6 and RN7, where hot-deck imputation has higher variance difference than multiple imputation.

49 High School Rank Tables 14 and 15 show summaries of variable High School Rank imputation results for 10 random number generated datasets. In the table 14, it can be seen that the multiple imputation method has the lowest average difference in mean compared to other imputation methods. Surprisingly, unlike the previous variable comparison results, multiple imputation outperformed the other imputation methods only in mean difference, yet, the lowest total standard deviation difference (0.57%) was performed by hot-deck imputation. Table 17: Mean Imputation Results for High School Rank Mean Dataset (initial mean imputation median hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % RN1 68.385 68.138 0.361% 68.208 0.259% 68.391 0.009% 68.3290 0.082% RN2 68.453 68.594 0.206% 68.771 0.465% 68.5 0.069% 68.5378 0.124% RN3 68.776 68.713 0.092% 68.937 0.234% 68.782 0.009% 68.7264 0.072% RN4 68.543 68.553 0.015% 68.804 0.381% 68.589 0.067% 68.3862 0.229% RN5 68.55 68.777 0.331% 68.922 0.543% 68.437 0.165% 68.5942 0.064% RN6 68.174 67.861 0.459% 67.974 0.293% 67.831 0.503% 68.0838 0.132% RN7 67.946 67.964 0.026% 68.026 0.118% 67.638 0.453% 67.8606 0.126% RN8 68.468 68.576 0.158% 68.894 0.622% 68.368 0.146% 68.2856 0.266% RN9 67.978 67.708 0.397% 68.077 0.146% 67.693 0.419% 67.7736 0.301% RN10 68.339 68.379 0.059% 68.7 0.528% 68.498 0.233% 68.4236 0.124% average 0.210% 0.359% 0.207% 0.152%

Table 18: Standard Deviation Imputation Results for High School Rank Standard Deviation Dataset (initial mean imputation median hot deck imputation MI value) StDev delta % StDev delta % StDev delta % StDev delta % RN1 20.011 17.746 11.32% 17.744 11.33% 19.757 1.27% 19.81229 0.99% RN2 19.667 17.572 10.65% 17.578 10.62% 19.623 0.22% 19.68115 0.07% RN3 19.621 17.566 10.47% 17.566 10.47% 19.617 0.02% 19.60684 0.07% RN4 19.611 17.444 11.05% 17.454 11.00% 19.651 0.20% 19.40444 1.05% RN5 20.243 18.142 10.38% 18.148 10.35% 20.256 0.06% 20.1091 0.66% RN6 19.856 17.706 10.83% 17.706 10.83% 19.863 0.04% 19.71822 0.69% RN7 19.894 17.738 10.84% 17.741 10.82% 19.864 0.15% 19.77367 0.60% RN8 19.778 17.55 11.27% 17.561 11.21% 19.526 1.27% 19.57847 1.01% RN9 19.775 17.526 11.37% 17.536 11.32% 19.559 1.09% 19.60874 0.84% RN10 19.728 17.486 11.36% 17.495 11.32% 19.456 1.38% 19.48139 1.25% average 10.95% 10.93% 0.57% 0.73% 50 As can be seen in Figure 10, the mean for each imputation is different due to the randomness from the random number generated dataset. For variable High School Rank, the trend seems inconsistent. Yet, median imputation still holds the highest mean average among the other imputation methods. Six out of ten imputations results with the high values of mean difference for median imputation method. Hot-deck imputation has the three highest values of mean difference in RN6, RN7 and RN9. This leads to high total average of mean difference for hot-deck imputation. Yet, the total average percentage difference between multiple imputation and hot-deck imputation is only 0.55%.

51 0.700% 0.600% 0.500% 0.400% 0.300% 0.200% 0.100% Mean Median Hot Deck MI 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Figure 10: High School Rank Imputation Mean Comparison Results 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 11: High School Rank Imputation Standard Deviation Comparison Results As can be seen in Figure 11, the standard deviation for each imputation tends to show similar behavior. Multiple imputation and hot-deck imputation seem to have a small variance and standard deviation compared to mean and median imputation. For High School Rank, hot-deck imputation has a lower difference value compared to the multiple imputation method.

52 Accuracy Evaluation Another factor that was considered for determining a better imputation method is the accuracy evaluation. This is done by calculating the root mean square error or RMSE. RMSE indicates how close the observed data points are to the model s predicted values. RMSE can also be interpreted as the standard deviation of the unexplained variance. A better fit is indicated by a lower RMSE value. The results of RMSE for each variables indicate that mean imputation and median imputation have the lowest value of RMSE. Meanwhile, due its high value of standard deviation, as expected, zero imputation has the highest RMSE value for all the five variables imputed. Table 19: RMSE for SAT Dataset RMSE mean median zero Hot deck MI RN1 145.8449 146.2818 1103.029 208.615 205.6632 RN2 148.0055 148.1602 1112.657 206.032 208.146 RN3 146.5739 147.0087 1110.019 205.1587 206.6825 RN4 141.4244 141.7982 1107.5 202.5508 203.3173 RN5 149.6505 150.2558 1108.685 184.9254 188.7998 RN6 144.3293 144.7058 1105.935 205.676 204.7344 RN7 146.9572 147.1231 1108.772 208.8785 205.9754 RN8 149.7519 149.9827 1107.515 207.905 209.3979 RN9 148.406 148.6202 1109.018 207.6985 210.2003 RN10 149.5899 149.9967 1106.201 204.3395 208.8357 average 147.0534 147.3933 1107.933 204.1779 205.1753 For SAT, as can be seen in Table 16, mean imputation has the lowest average of RMSE, which is 147.0534, followed by median imputation with RMSE of 147.3933. The RMSE values of SAT for mean and median imputation have only slight difference, of

53 0.34 values of difference. On the other hand, hot-deck imputation and multiple imputation also have a slight difference, only by 1.0 difference. Yet, the difference between mean, median, hot-deck and multiple imputation have differences of more than 30% difference. Table 20: RMSE for ACTC Dataset RMSE mean median zero Hot deck MI RN1 3.644302 3.710585 23.97732 5.473296 5.34472 RN2 3.273201 3.308583 23.7106 4.629069 4.777564 RN3 3.559494 3.587604 23.72931 4.938764 5.001299 RN4 3.767871 3.78278 23.67215 4.986663 5.183174 RN5 3.398894 3.41435 23.57213 4.87655 4.762817 RN6 3.411701 3.451657 23.77464 4.817148 4.854754 RN7 3.472275 3.468831 23.45081 4.853594 4.771559 RN8 3.645583 3.67535 23.7325 5.189641 5.032933 RN9 3.657278 3.686206 23.74838 5.126874 5.039873 RN10 3.448265 3.499707 23.85488 4.922972 4.901295 average 3.527886 3.558565 23.72227 4.981457 4.966999 For ACTC, mean imputation has the lowest average of RMSE, which is 3.528, followed by median imputation with RMSE of 3.558. The RMSE values of ACTC for mean and median imputation are slightly different, only 0.033 values of difference. On the other hand, hot-deck imputation and multiple imputation also have a slight difference, only 0.02. Unlike SAT, the RMSE results for ACTC show the differences among mean, median, hot-deck and multiple imputation at less than 10%. This is caused by the small variance and range imputed for the ACTC values, unlike SAT which has a wider range of values for imputation.

54 Table 21: RMSE for ACTM Dataset RMSE mean median zero Hot deck MI RN1 4.047611 4.053687 23.00272 5.673674 5.695751 RN2 3.906588 3.907485 23.24658 5.58852 5.662923 RN3 4.02281 4.027315 23.16145 5.696562 5.635005 RN4 3.66275 3.676465 23.65739 4.840912 5.180227 RN5 4.020971 4.026297 23.10831 5.581549 5.653583 RN6 4.047098 4.055456 23.03783 5.770367 5.59727 RN7 3.929341 3.927885 23.33905 5.438366 5.528872 RN8 4.115823 4.117386 23.21386 5.562241 5.674367 RN9 3.9354 3.935963 23.24125 5.767703 5.646812 RN10 3.887522 3.906699 22.81052 5.439496 5.519812 average 3.957591 3.963464 23.1819 5.535939 5.579462 For ACTM, the mean imputation still has the lowest average of RMSE, which is 3.958, followed by the median imputation with RMSE of 3.964. The RMSE values of ACTC for mean and median imputation have slight differences, only 0.014. On the other hand, the hot-deck and multiple imputations also have only a slight difference, of a 0.04. As with ACTC, the RMSE result for ACTM shows the slight difference among mean, median, hot-deck and multiple imputations which is less than a 10% difference. Similar, again, to ACTC, this is caused by the small variance and range imputed for the ACTM values.

55 Table 22: RMSE for High School Size Dataset RMSE mean Median zero Hot deck MI RN1 145.7964 146.2819 308.7433 208.5036 202.9797 RN2 162.9461 166.5604 327.9132 235.8396 231.2771 RN3 142.7093 143.0102 313.8998 203.3134 201.9144 RN4 147.7273 148.6765 317.2929 207.1692 205.8168 RN5 139.8328 139.9858 308.0965 198.6699 199.9979 RN6 141.1522 142.0475 309.9443 193.9123 198.6902 RN7 138.0055 138.9242 312.4548 196.3024 198.9634 RN8 134.6479 134.8777 302.75 197.7474 194.6398 RN9 141.8623 142.5835 307.2276 207.0231 204.1218 RN10 133.689 133.4804 299.3589 188.4526 195.9915 average 142.8369 143.6428 310.7681 203.6933 203.4392 For High School Size, as can be seen in Table 19, mean imputation still has the lowest average of RMSE, which is 142.38, followed by median imputation with RMSE of 142.84. The RMSE values of ACTC for mean and median imputation have slight differences, of only 0.814. On the other hand, hot-deck imputation and multiple imputation also have a slight difference, only 0.5 difference, with a lower value for multiple imputation. As for High School Rank, mean imputation still has the lowest average of RMSE, which is 19.899, followed by median imputation with RMSE of 18.942. The RMSE values of ACTC for mean and median imputation are slightly different, only 0.05 values of difference. On the other hand, hot-deck imputation and multiple imputation differ, only by 0.08, with the lower value for multiple imputation.

56 Table 23: RMSE for High School Size Dataset RMSE Mean median zero hot deck MI RN1 19.86072 19.85057 71.17887 27.75653 28.49925 RN2 20.6813 20.7125 71.54717 28.51429 28.40525 RN3 19.9886 20.05777 70.49815 27.86728 27.8379 RN4 19.6264 19.67822 71.21997 28.49046 27.08294 RN5 20.39496 20.43671 71.35005 28.82778 28.36758 RN6 19.44133 19.4256 71.08541 27.40545 27.36615 RN7 20.04076 20.05728 70.612 28.61733 28.24956 RN8 19.67122 19.72874 71.38076 27.47608 27.59758 RN9 19.12422 19.19393 70.269 26.75153 27.5103 RN10 20.16606 20.28191 70.48331 28.31892 28.29455 average 19.89956 19.94232 70.96247 28.00257 27.92111 The accuracy results show that zero imputation is the poorest method. According to the RMSE results, mean imputation and median imputation performed with better accuracy than hot-deck and multiple imputation. Although mean and median imputation tend to center the distribution and decrease the variance and standard deviation, they still have a better performance for accuracy. As for hot-deck and multiple imputation, they have lower variance, yet the RMSE results show that they are less accurate than mean and median imputation.

57 CHAPTER 5: PREDICTION MODEL Roth (2008) used mean imputation and zero imputation to fill in missing values for her prediction model. For this research, the prediction model for freshman student retention, based on winter data, was built using hot-deck and multiple imputation methods to fill in the missing values. Based on the imputation results, all missing variables were imputed with the method with low mean and also standard deviation. All the variables were imputed using the multiple imputation approach except high school rank, because the standard deviation comparison results showed that hot-deck imputation has a lower standard deviation value and mean value, as well. The model was built with linear regression, logistic regression and AD Tree, using WEKA, MINITAB and Microsoft EXCEL software. To replicate Roth s approach, 31 variables were used to predict the fall sophomore enrollment. The variables are listed in Table 21.

58 Table 24: Variables in predicting Fall Enrollment from Winter model 1 RACE CODE 2 SEX 3 HS SIZE 4 HSSIZE filled 5 HS PERCENTILE RANK 6 HS PERCENTILE RANK filled 7 HS GPA 8 HS GPA filled 9 STATE filled 10 COUNTY CODE 11 COUNTY CODE filled 12 ACTC 13 ACTC filled 14 ACTM 15 ACTM filled 16 SATTOTAL 17 SATTOTAL filled 18 FALL GPA 19 FALL COLLEGE 20 FALL MAJOR PROGRAM 21 FALL 2006 MAJOR CODE 22 FALL UNDECIDED 23 WINTER COLLEGE 24 WINTER MAJOR PROGRAM 25 WINTER 2007 MAJOR CODE 26 WINTER UNDECIDED 27 MAJOR CHANGE W 28 CHANGE OUT OF UNDECIDED 29 EXPECTED FAMILY CONTRIBUTION 30 FINANCIAL AID DATA HERE 31 GATEWAY In order for MINITAB to run a linear regression model, some of the variables needed to be transformed into regressible variables. In a similar model, Khajuria (2007)

59 transformed nominal variables into a sparse array of binary variables. The models were developed using an initial 3,818 student entries and 31 variables. Following the process of transforming the nominal variables into regressible variables, 940 input variables were entered into MINITAB and considered for inclusion into the retention prediction model. In Table 22, 23, and 24 were Sadie s result of the prediction using only mean imputation and zero imputation. Table 25: Roth s Predicted Fall Enrollment from Winter Alternating Decision Tree vs. Actual Fall Enrollment ACTUAL PREDICTION Retention Attrition Total Retention 3133 29 3162 Attrition 623 33 656 Total 3756 62 3818 Retention Accuracy: 83.41% Attrition Accuracy: 53.23% Overall Accuracy: 82.92% Table 26: Roth s Predicted Fall Enrollment from Winter Logistic Regression vs. Actual Fall Enrollment ACTUAL PREDICTION Retention Attrition Total Retention 3020 142 3162 Attrition 601 55 656 Total 3621 197 3818 Retention Accuracy: 83.40% Attrition Accuracy: 27.92% Overall Accuracy: 80.54%

60 Table 27: Roth s Predicted Fall Enrollment from Winter Linear Regression vs. Actual Fall Enrollment for Individual Cases ACTUAL PREDICTION Retention Attrition Total Retention 3156 6 3162 Attrition 653 3 656 Total 3809 9 3818 Retention Accuracy: 82.86% Attrition Accuracy: 33.33% Overall Accuracy: 82.74% However, since multiple imputation was used to fill in the missing values, the student entries were multiplied five times as the multiple imputation was replicated five times. The student entries expanded from 3,818 to 19,090. The forward selection regression model indentified total expanded 824 variables as significant indicators of the retention. The prediction result of alternating decision tree against the actual fall enrollment appears showed in Table 22. Table 28: Predicted Fall Enrollment from Winter Alternating Decision Tree vs. Actual Fall Enrollment ACTUAL PREDICTION Retention Attrition Total Retention 15805 5 15810 Attrition 3270 10 3280 Total 19075 15 19090 Retention Accuracy: 82.86% Attrition Accuracy: 66.67% Overall Accuracy: 82.84% From Table 25, it can be seen that the alternating decision tree using the winter 2007 dataset was able to predict a student s retention status in fall of 2007 with 82.84 % overall accuracy. The overall accuracy had decreased from the previous model by 0.08%.

61 There were 19,075 retention predictions made, and 82.86% of them were accurate. Attrition was predicted for just 15 student entries and the predictions were accurate 66.6% of the time. Yet, this prediction result cannot be considered useful, since it only predicts 15 out of a total 19,090 entries for attrition. The decision tree created using the winter 2007 data to predict fall 2007 enrollment is shown in Figure 12.

Figure 12: Winter Alternating Decision Tree Predicting Fall Enrollment 62