Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of

Size: px
Start display at page:

Download "Missing Data Imputation Method Comparison in Ohio University Student Retention. Database. A thesis presented to. the faculty of"

Transcription

1 Missing Data Imputation Method Comparison in Ohio University Student Retention Database A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Dyah A. Hening November Dyah A. Hening. All Rights Reserved

2 2 This thesis titled Missing Data Imputation Method Comparison in Ohio University Student Retention Database by DYAH A HENING has been approved for the Department of Industrial and Systems Engineering and the Russ College of Engineering and Technology by David A. Koonce Associate Professor of Industrial and Systems Engineering Dennis Irwin Dean, Russ College of Engineering and Technology

3 3 ABSTRACT HENING, DYAH A., M.S., November 2009, Industrial and Systems Engineering Missing Data Imputation Method Comparison in Ohio University Student Retention Database (74 pp.) Director of Thesis: David A. Koonce Ohio University has been conducting research on first-year-student retention to prevent dropouts (OU Office of Institutional Research, First-Year Students Retention, 2008). Yet, the data sets have more than 20% of missing values, which can lead to bias in prediction. Missing data affects on the ability to generalize results to the target population. This study categorizes the missing data in variables into one of three types of missing data: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). After the missing data is identified, the proper method of handling it is discussed. The proposed method is validated through developed and tested models. The goal of this work is to explore the methods of imputation missing data, and apply them to the Ohio University student retention dataset. Approved: David A. Koonce Associate Professor of Industrial and Systems Engineering

4 4 TABLE OF CONTENTS Page ABSTRACT... 3 LIST OF TABLES... 6 LIST OF FIGURES... 8 CHAPTER 1: INTRODUCTION... 9 Problem Statement CHAPTER 2: BACKGROUND Literature Review Missing Values Categories Methods of Handling Missing Data Data deletion Single imputation Multiple imputation CHAPTER 3: METHODOLOGY Summary of the Original Dataset Data Cleaning Summary of Imputation Method Procedure CHAPTER 4: IMPUTATION METHODS COMPARISON Random Number Generation Data Deletion Imputation Result... 37

5 5 SAT ACTC ACTM High School Size High School Rank Accuracy Evaluation CHAPTER 5: PREDICTION MODEL CHAPTER 6: CONCLUSION AND DISCUSSION Recommendations for Future Research REFERENCES... 72

6 6 LIST OF TABLES Page Table 1: Rubin's Multiple Imputation Efficiency Table 2: Retention Dataset Variables (Roth, 2008) Table 3: Variables and Number of Students with Null Values Table 4 : Original Dataset Distribution Analysis Table 5: ANOVA Test Results for Equal Variance Table 6: Mean Imputation Results for SAT Table 7: Standard Deviation Imputation Results for SAT Table 8: Mean Imputation Results for ACTC Table 9: Standard Deviation Imputation Results for ACTC Table 10: Mean Imputation Results for ACTM Table 11: Standard Deviation Imputation Results for ACTM Table 12: Mean Imputation Results for High School Size Table 13: Standard Deviation Imputation Results for High School Size Table 14: Mean Imputation Results for High School Rank Table 15: Standard Deviation Imputation Results for High School Rank Table 16: RMSE for SAT Table 17: RMSE for ACTC Table 18: RMSE for ACTM Table 19: RMSE for High School Size Table 20: RMSE for High School Size... 56

7 7 Table 21: Variables in predicting Fall Enrollment from Winter model Table 22: Roth s Predicted Fall Enrollment from Winter Alternating Decision Tree vs. Actual Fall Enrollment Table 23: Roth s Predicted Fall Enrollment from Winter Logistic Regression vs. Actual Fall Enrollment Table 24: Roth s Predicted Fall Enrollment from Winter Linear Regression vs. Actual Fall Enrollment for Individual Cases Table 25: Predicted Fall Enrollment from Winter Alternating Decision Tree vs. Actual Fall Enrollment Table 26: Predicted Fall Enrollment from Winter Logistic Regression vs. Actual Fall Enrollment Table 27: Predicted Fall Enrollment from Winter Linear Regression vs. Actual Fall Enrollment for Individual Cases... 63

8 8 LIST OF FIGURES Page Figure 1: OU Overall First-Year Student Retention Figure 2: SAT Imputation Mean Comparison Results Figure 3: SAT Imputation Standard Deviation Comparison Results Figure 4: ACTC Imputation Mean Comparison Results Figure 5: ACTC Imputation Standard Deviation Comparison Results Figure 6: ACTM Imputation Mean Comparison Results Figure 7: ACTM Imputation Standard Deviation Comparison Results Figure 8: High School Size Imputation Mean Comparison Results Figure 9: High School Size Imputation Standard Deviation Comparison Results Figure 10: High School Rank Imputation Mean Comparison Results Figure 11: High School Rank Imputation Standard Deviation Comparison Results Figure 12: Winter Alternating Decision Tree Predicting Fall Enrollment... 62

9 9 CHAPTER 1: INTRODUCTION Quality is very important for many organizations as one of the parameters of success. Total Quality Management (TQM) is one of the quality control tools that initiate continuous improvement, and it has been implemented by profit and higher education organizations. Yet, the implementation of TQM in non-profit organizations has had very little impact especially on education quality for higher education institutions (Koch, 2003). This limited impact is caused by the special challenges and difficulties faced by higher education institutions as non-profit service-oriented organizations. The most important phase of the TQM implementation is customer identification. In higher education institutions, students are considered customers (Sirvanci, 2004). Customer identification leads to tailoring systems for customer satisfaction, which has been the focus of TQM implementation. Aware of the importance of the students as their customers, many universities are striving to provide quality education by putting serious efforts into preventing student dropouts. One such institution is Ohio University (OU), which has been conducting research on first-year student retention in an effort to improve its quality of service to the student body. Still, the rate of students retention at OU has been declining over the last six years, which can be seen in Figure 1 (OU Office, 2008).

10 10 Figure 1: OU Overall first-year student retention. Several research studies related to predicting student retention have been conducted. The research has led to some theoretical models on student retention development and highlighted the significant factors affecting student retention. Tinto (1975) synthesized his theory from social psychology and education economics in describing the interaction between individuals and college institutions, and his model led to the development of an attrition model. This model involves a cost versus benefit analysis and includes factors about students family backgrounds, various individual characteristics, past educational experiences, and goal commitment levels. Problem Statement In previous research, Roth (2008) developed a model to predict the retention of students based on the datasets of OU student behavior. However, large numbers of values are missing, especially numbers linked to attributes relevant to the prediction model. Roth used simple mean value imputation to fill in missing values. It can be easily shown that mean value imputation changes the distribution of a given variable. It needs to be pointed

11 11 out here that most predictive analysis techniques do not include a specific method of handling missing data, hence it is necessary to address this problem. Current literature contains many proposed methods in dealing with missing data; yet, these techniques are not applicable to every situation and condition of missing values. The purpose of this research is to develop a better understanding of how to handle missing data, especially for developing models to predict student attrition in OU student retention datasets. A comparison of the missing data imputation methods will provide evidence of the most appropriate method to be used to fill in data for predicting student attrition.

12 12 CHAPTER 2: BACKGROUND Literature Review Missing data are a nuisance for statistical analysis. The main threat for institutional research is that the missing data can affect the legitimacy of a study s internal validity. In addition, missing data may have an effect on a study s external validity and limit its generalizability across a target population. Therefore, it is important to explore and identify ways to deal with missing data. According to Cohen et al. (2003), even when investigators employ conventionally appropriate strategies for coping with missing data, different approaches may lead to significantly different conclusions. The following section will briefly discuss missing data imputation methods used in this research, which are commonly. First, an overview of missing values categories is introduced. Next, the methods of handling missing values: data deletion, single imputation and multiple imputation are discussed. To address missing data appropriately, it is helpful to understand the types and characteristics of the missing data. The most commonly occurring reason for missing data is non-response to items which, according to Umbach (2005), can stem from a variety of reasons. For instance, errors that might emerge during coding or data entry, respondents inability to answer the survey questions, and the limitation of the study design can elicit responses (Umbach, 2005).

13 13 Missing Values Categories Gelman and Hill (2007) posit several reasons data may be missing. They group missing data into four types: missing completely at random (MCAR), missing at random (MAR), missing that depends on unobserved predictors, and missing that depends on the missing value itself. Missing values that depend on unobserved predictors and missing values that depend on the missing value itself can also be considered missing not at random (MNAR). These categories are meant to identify the characteristics of the data that will be missing, not the missing value itself. MCAR occurs when any values of a variable have the same probability of being missing. In other words, this is the case when the data values in the dataset will be randomly missing and there will be no reason why a specific value is missing. An example of this would be if a respondent decides not to answer a certain question in a survey by rolling a die and letting the decision be based on a certain number on the die. The occurrences of non-response are quite common in sampling surveys, and in Gelman and Hill s study the mechanism of non-response is assumed as MCAR, (for more detailed explanation, see Rubin, 1987). When missing data are MCAR, no specific clue could be derived from the other responses as to what the missing value should be. MAR, or missing at random, can be considered to be semi-mcar. It occurs when the probability of any variable instance to be missing is the same for all units. However, what distinguishes MAR from MCAR is that with MAR the variable can be predicted from other available data. When data are MAR, omitting cases with missing data is accepted because doing so will reduce the bias of the inferences.

14 14 The last type of missing data is missing not at random. MNAR can be subcategorized into: missing values which depend on unobserved predictors and missing values that depend on the missing value itself. In these cases the likelihood of a value being missing is dependent on some value. A good example of this comes from medical studies--when any particular treatment causes discomfort to a patient, the likelihood of that patient walking out or dropping out will increase (Rubin, 1987). Another consideration that Rubin (1987) has put into his classification method is whether the missing data are ignorable or not. By ignorable he means that the whole variable can be omitted or disregarded in the model building. In cases of MAR, the ignorable missing data mechanism occurs when variables are less important or less related to the model than are other variables. This assumption has the same underlying philosophy as the causal framework, in which ignoring something can be done if sufficient evidence and information have already been gathered. So, in these cases, few correlated variables can be omitted. For example, suppose we want to predict someone s athletic capability or performance. The variable of favorite color would most likely not be related to the prediction model; therefore, excluding this variable will probably have few negative effects on the prediction model's accuracy. The three main categories of missing data are: MCAR or missing completely at random, MNAR or missing not at random and MAR or missing at random. Having examined the primary characteristics of missing data, methods for handling missing data will be discussed next.

15 15 Methods of Handling Missing Data Data deletion The simplest mechanism for handling a missing value is to discard the observation with the missing value. However, with large dimension data sets (common in data mining) a significant portion of the observations may have missing values. In addition, the discarding-data approach can lead to biased estimation and can cause larger standard errors due to reduced sample size. According to Gelman and Hill (2007), the discarding-data approach can be divided into three categories: complete analysis, available-case analysis and non-response weighting. Complete analysis refers to excluding any missing values of either input or output data. This method can cause bias to the analysis when the missing units differ systematically from the completely observed cases. Consider the case in which study participants are less likely to report their weight if they are obese, deleting all missing value observations from the set would then render the study biased towards non-obese participants as they are more likely to provide that data. To maintain the size of the data set, and possibly remove the bias from missing values, missing data can be modeled and the missing data values imputed. Yet, there are still difficulties in identifying whether the missing data are really missing at random (MAR) or the missing data depend on unobserved predictors or the missing data themselves (MNAR). According to Mcknightet al. (2007), the assumption regarding whether the missing data are MAR or MNAR is crucial to determining the best course of action. Since determining whether the data are MNAR is a subjective task, researchers

16 16 should check other studies, conduct follow-up surveys, interview participants, or remeasure the sample units. It is important to note that while checking other references is the approach of choice in solving this matter, not every research can provide good references. Single imputation A simple method for supplying missing values is single imputation. There are three types of single imputation, based on the types of values: constant, randomly selected, and non-randomly derived values (Mcknight et al., 2007). Constant substitution refers to replacing the missing values with constant values such as mean substitution (either the arithmetic mean or the estimated mean of the population), median substitution, or zero imputation. Random imputation, which uses random values, consists of two major divergences: hot-deck and cold-deck imputation. The non-random imputations are derived values from regression, conditional imputation, or data that have been previously recorded from a subject. Despite different types of single variable imputations, these methods all have something in common--they assume that the standard error for the estimate is low. Constant subtitution. Due to the ease and simplicity of the following single imputation method, the most used type for supplying missing values is constant replacement (Mcknight et al., 2007). One constant replacement method is mean imputation, which consists of predicting the missing observation by simply filling in the missing values with the mean of the observed values. However, this method is less desirable because it tends to underrepresent extreme values, which biases the analysis by

17 17 yielding a variable with greater central tendency than should be expected. This invalidates the estimates of variance and covariance, affecting the internal validity of the work. Another single imputation method is ML estimated mean substitution, which is based on the maximum likelihood (ML) algorithm. The arithmetic of this method slightly enhances the traditional mean imputation method regarding its sensitivity to the outliers values. The methods draw on the assumption of normal distribution of the data. Although the ML substitution provides an estimate mean of the population (µ) instead of the sample mean, this method is still considered a less desirable method because the substantial deviations from the assumed normal distribution provide poor estimation. The third constant replacement method, median substitution, is used when the data are not normally distributed, in which case the curve can be skewed, flat or peaked and cannot be represented by the mean replacement. Median imputation tends to produce larger standard errors, which is not optimal to avoid type I error. However, compared to the two previous constant imputation methods, it is better at reducing type II error (Mcknight et al., 2007). The last common type of constant replacement method is replacing the missing values with a value of 0 based on logical rules. If the missing data happen to be in the outcome variable and the probability of the predictors fully depends on recorded variables, then the missing values can be modeled by adding another parameter having value of 0 or 1. The added parameter will have value 1 for recorded data and 0 for missing data. For example, in the Ohio University student retention datasets, one data

18 18 element is the accumulated GPA. If the value of the current GPA is missing, the rule allows a substitution from the previous quarter's GPA. If there were no recorded GPAs in the previous quarters, the accumulative GPA value would be set at zero. Random imputation. Random imputation of a single variable is needed when more than a small fraction of data has missing values. Random imputation involves replacing the missing values with randomly generated values. Randomly generated values can come from the available values in the current dataset, also known as hot-deck imputation, or from similar datasets containing matching variables, also known as cold-deck imputation. In random imputation, the estimation of suitable values for replacing the missing values is generated based on the available data. According to McKnight et al. (2007), there are different strategies for hot-deck imputation. The first strategy is simple random imputation, by imputing the missing value of any missing variable with randomized values based on the available data. If the missing data is MCAR, then there is no method for defining the missing value. Thus, if the observable values occur in the same proportion as the sampled population, supplying missing values from this predicted population will not introduce bias to the variable. This approach is considered to be a good starting point for preliminary data analysis. The strategy is hot-deck within adjustment cells that is, blocking the relevant covariates and imputing the missing data based on the randomly generated values of the available data. Yet another approach uses the nearest neighbor s value in order to replace the data. This method imputes the missing value with the closest criteria from the available data. For

19 19 example, if the ethnicity of a participant is missing from a group with a similar ethnicity, the missing values will be imputed with the particular ethnicity in that group. Matching and hot-deck imputation determine each missing unit (y) with a value from similar value of predictors (x) in the observed data. Matching can become challenging when the matching vectors need to be built with a small amount of available data. To solve this problem, random imputation of the five closest resolved cases or other available information can be used. One can also predict the missing values based on several other variables that are fully observed; thus, the predicted data can be matched and imputed to the datasets. The most common problem that arises from this method is that it underestimates the standard errors due to the decreased variability. This is caused by the missing data being imputed by values that already exist in the dataset. According to Seastorm, Kaufman, & Lee (2002), hot-deck imputation preserves the distribution of the original data and increases the variance compared to mean imputation. Consequently, according to Mundform & Whitcomb (1998), the estimate of the prediction accuracy would be too dependent to the randomly selected value, due to its variation from one selection value to another. In their research, Mundform & Whitcomb were running 1000 repetitions for hot-deck imputation and took the average value of the 1000 results of each 99 entries to obtain the value used in his research. Cold-deck imputation is similar to hot-deck imputation, but another set or sample is used to impute the data. Although the purpose of this method is to solve the problem that occurs in hot-deck imputation, it still may increase the probability of type I error due to the small standard error (McKnight et al., 2007).

20 20 Nonrandom imputation. Nonrandom imputation can be divided into single condition and multiple conditions. Single condition methods consist of: conditional mean, last value carried forward, and next value carried backward. Multiple conditions are used when there are more than one single variable needed to provide more information for each missing value case. Conditional mean imputation is based on a single condition and uses a classification variable to estimate the mean to substitute the missing values. It emphasizes the relationship between the classification variable and the missing data. If the relationships are weak, the mean imputation resembles the method used in the hot- deck imputation. Last value carried forward (LVCF) replaces the missing data value with the previous available data, from the same subject or research participant under a certain time. This is based on the assumption that the most recent available observation is the best guess for subsequent missing values. To use this method, a prior observed value for the observation must be available. For example, if the academic record of a student is missing the GPA for a term, we would substitute the GPA of the most recent term. Next value carried backward (NVCB) uses a similar process as the LVCF, where the imputation of the missing values on the early observation can be filled behind with the next available data. The use of these methods is limited to a subject s own data that are observed continuously in a certain time period. Multiple condition nonrandom value imputation uses regression and error. A better result can be produced if the value for the missing variable can be predicted with a

21 21 regression against the observed cases. Random regression imputation uses a regression model to predict the missing values. This strategy uses uncertainty by adding the prediction error into the regression. Overall, although the single imputation method can be easily implemented, several of its weaknesses could lead to distortion of the variable distribution. This distortion could then lead to underestimation of the standard deviation, which in turn would result in underestimation of the standard errors, and thus increased type I errors. Multiple imputation Multiple imputation is a method of supplying multiple values for a missing value. By utilizing Markov Chain Monte Carlo (MCMC) simulation, multiple values can be generated (Mcknight et al., 2007). MCMC is using computer simulation of Markov chains where the posterior distribution of the statistical inference problem is the asymptotic( (Muller, 2003). The imputed values can be analyzed for mean and variation. These statistics can then be used to derive expected values and associated confidence intervals. Two common methods of multiple imputation using MCMC-method-derived Bayesian estimated values are routine multivariate imputation and iterative regression imputation. In routine multivariate imputation, a fitted multivariate model is built using all the variables containing missing values. The predictors (x) and the outcome (y) are considered vectors. This method has some difficulties, one of them being that much effort is required to set up a reasonable multivariate regression model. The t-distribution or multivariate normal distribution is commonly used for continuous outcomes, while the

22 multinomial distribution is used for discrete outcomes. According to Rubin (1978), the efficiency of an estimate (relative efficiency in %) based on m imputations is shown by: 22 1 (1) γ = rate of missing information for estimated quantity. The multiple imputation efficiencies for various values of m and γ are shown in Table 1. Table 1: Rubin's Multiple Imputation Efficiency γ m According to Rubin(1987), if the rate of missing information is not very high, there is little advantage in producing and analyzing only a few imputed datasets. Due the missing data, multiple imputation performance across the imputed date sets could reflect statistical uncertainty. Rubin estimated rate of missing information in order to provide some diagnostic measures for the multiple imputation procedure that point out how strongly the estimated quantity is influenced by missing data. The estimated rate of missing information (γ) is / (2)

23 23 where (3) = variance increase due nonresponse. The rate of missing information ( and the number of imputations m, verifies the relative efficiency of the MI inference (Rubin, 1987). Multiple imputation has three steps: imputation, routine analysis, and parameter estimation from the results. The first step, the imputation process is similar to single imputation. Yet, what makes multiple imputation different from single imputation is that there is no necessary restriction on selecting which single imputation procedure to use (McKnight, McKnight, Sidana, & Fiqueredo, 2007). The values may be imputed using random normal values, hot-deck values, or MCMC-method-derived Bayesian estimated values. However, it is recommended to only use a single imputation method for multiple imputation, since each of the single imputation yield different results. The second step is to analyze the complete data sets provided after imputation. In the literatures, there are no specific preferences of any types of statistical analyses that can be performed on the multiple imputation datasets. The analyses used in this research are means, standard deviation and variance. Following the statistical analysis, there are several steps for the parameter estimation to compute the overall standard errors. First, within-imputation variance must be computed, which is the mean of the standard errors related to all the parameters of

24 24 interest in the statistical model. Each parameter estimate is referred to as. The withinimputation variance, referred to as, is to have an average of the total standard error or variance. represents the variability of standard error or variance that is calculated within each of imputations. The next step is to compute the between-imputation variance, referred to as B in Rubin s, 1987, nomenclature. The formula for calculating B is given by Equation 4. (4) Between-imputation variance is basically the sum of the squared deviations for each estimates or divided by the number of imputed data sets minus 1. Next, total variance is the sum of within-variance and between-variance B. Yet, according Rubin, the between-imputation variance needs to be weighted according to the number of imputations performed. Thus, the total variance (T) is calculated by the following equation. 1 (5)

25 25 CHAPTER 3: METHODOLOGY The purpose of this research is to develop a protocol for determining how to handle missing data, especially in the student data retention dataset. The dataset to be used in this research is the 2006 freshman class, provided by the Ohio University Office of Institutional Research. The following sections discuss the contents of the dataset, the procedures to clean it by identifying the missing values, and a summary of the imputation methods comparison procedure. Summary of the Original Dataset In order to conduct this research, a dataset from the Ohio University Office of Institutional Research containing admissions and involvement data from the 2006 freshman class was retrieved. This dataset has been used by Roth (2008) to create a model predicting first-year Ohio University student attrition. The original data were retrieved from four resources: student applications to Ohio University, the Student Information System (SIS), the students financial aid records, and the students involvement survey carried out by the Office of Institutional Research. The total number of variables after unification is 66, with 4061 students. Roth has already created a list of the variables included in the original dataset, which can be seen in Table 2. The table includes each variable s description, the variable type, the source of the data, and when in the school year timeline the variable is available.

26 Table 2: Retention Dataset Variables (Roth, 2008) 26

27 Table 3: (continued) 27

28 Table 4: (continued) 28

29 29 Table 5: (continued) Key : Bin = Binary, Dec = Decimal, Int = Integer, Nom= Nominal, A=Application, S= SIS, F= Financial Aid Record, I = Involvement Survey As can be seen in Table 2, the original data were retrieved from four resources: student applications to Ohio University, the Student Information System (SIS), the students financial aid records, and students involvement survey carried out by the Office of Institutional Research. Student applications contain students demographics,

30 30 high school information, and standardized test scores. SIS, the second source, is a software program that Ohio University uses to manage student information. It provides students registration information from the past until present academic information. The third source is the information of students financial aid records. These variables were entered into university databases through student s Free Application for Federal Student Aid, or FAFSA. The last source was a student involvement survey conducted by the Office of Institutional Research. This survey was conducted at the end of Winter quarter and provides information on students attitudes and behaviors related to social and academic involvement in their first year at Ohio University. Data Cleaning The dataset from Institutional Research described above was received as one dataset that combined the information from all four sources. Yet, the dataset indicated a large amount of missing data. Roth (2008) had taken several steps to clean the data by keeping the valid data in preparation for the data modeling in predicting student attrition. Her second step after data cleaning was data imputation. Several simple imputation methods, such as mean and zero imputation, were utilized. Due to the biases from the decreased number of entries in sample sizes that can be created from applying completecase analysis, or the elimination of any entry with a missing data point, complete-case analysis was not utilized (Gelman & Hill, 2007). In this research, data imputation techniques comparison will be the main focus in order to find out the best imputation technique to be utilized in creating the predicting model. Table 3 shows a summary of

31 the variables containing missing data points and the number of students missing information in each category. 31 Table 6: Variables and Number of Students with Null Values % of missing out # of Variables of 4061 Students observations Comments HS Size % taken HS Percentile Rank % taken HS GPA % ignored State % ignored County Code % ignored ACT Composite % taken ACT Math % taken SAT Total % taken Expected Family Contribution % ignored Involvement Survey Variables(28 Variables) % ignored After the variables with the missing values are identified, the number of student entries is analyzed. The analysis indicated a stopout behavior after comparing student enrollment statuses in winter, spring and sophomore fall quarters, which resulted in student entries reduction. A stopout student is one that demonstrates non-permanent attrition behavior, or who drops out one quarter, only to return in a following quarter (Roth, 2008). Since the purpose of the model is permanent attrition, 22 stopout students were removed from the dataset, resulting in a dataset of 4,039 entries for the model. Summary of Imputation Method Procedure For the imputation methods comparison, five variables with the largest number of missing values from the original data were chosen. As can be seen in Table 3, Expected

32 32 Family Contribution has the highest rate of missing information with a total of 1219 missing values. This variable is considered as missing not at random because the information was missing from due to students not filling out a FAFSA. So this variable was excluded from this research. Another variable with high missing values that was excluded from the imputation method comparison is the Involvement Survey with 815 missing values. The involvement survey data was conducted at the end of Winter Quarter and this can be considered too late for any typical retention intervention, which usually takes place at the time of fall quarter pre-registration. Then, the rest of the variables with missing data were chosen based on data available from the beginning of the fall quarter, The five variables with missing data chosen for the research are: SAT Total (1847 missing values), High School Percentile Rank (830 missing values), High School Size (829 missing values), ACT Composite (488 missing values) and ACT Math (488 missing values). For each of these variables, a new dataset of 10 replications with similar distributional characteristics was randomly generated. For each of these sets, values were removed according to MCAR. For each variable, the missing values were imputed with one of the five different methods. These five methods are: mean imputation, median imputation, zero value imputation, hot-deck imputation and multiple imputation. After imputing the missing values, each set was analyzed for accuracy in the imputed values. First, each imputed value was compared to the removed value. The mean and variance of each imputed variable set was compared to the original to determine if the mean or variance has been affected.

33 33 Based on the comparison results, for each variable in the Ohio University student retention dataset, domain knowledge is used to classify the reasons for the missing values and the best imputation method will be implemented. Finally, a prediction model of student retention was built. This model then is compared with the Roth's model (2008) for prediction accuracy.

34 34 CHAPTER 4: IMPUTATION METHODS COMPARISON In this research, a new dataset with similar distributional characteristics was generated randomly with 10 replications. Before a dataset can be generated, the distributional characteristics of the five variables need to be determined. MINITAB was used to fit in the distribution characteristics of the variables; this can be seen in Table 4. Table 7 : Original Dataset Distribution Analysis Variable MINITAB distribution SAT Lognormal ACTC Lognormal ACTM Beta HsSIze 3 parameter Lognormal HSRank 3 parameter Weibull Random Number Generation After fitting the distributional characteristics, the random number for each variable was then generated using MINITAB. This random number dataset was tested to see whether it was valid for the research. The first test, ANOVA, verified any statistical differences between the original dataset and the replicated random number generated dataset. The purpose of this test is to see whether they have mean differences using variance. ARENA was considered for use for this research, but, the dataset that ARENA generated did not past the first test. Most of the results of each generated values compared to the original data had p-values less than 0.05, except for ACTM. The hypothesis of H 0 of equal variance was rejected, due the alpha value of 0.05.

35 35 At the beginning of the research, ten replications decided upon. Then the number of replications was increased to 100, with the 10 highest Bartlett s p-values for test variance chosen. This was done in MINITAB. The result of the ANOVA test for equal variance with the 10 highest p-values for each variable can be seen in Table 5. Table 8: ANOVA Test Results for Equal Variance Random number Bartlett test of p values SAT ACTC ACTM Hssize HsRank RNM RNM RNM RNM RNM RNM RNM RNM RNM RNM Bartlett's test (Snedecor and Cochran, 1983) is used to test if k samples have equal variances. Since the characteristic distributions of the variables are non-normal, a Bartlett s test was used due its sensitivity to departures from the normal distribution for the variables. The null hypothesis (H 0 ) is if the original data has the same variance as the random generated number dataset. The confidence level used is α = The result in Table 5 shows that the hypothesis is accepted. It means that these random number dataset have no statistical differences to the original available data.

36 36 Data Deletion After the ten sets of random numbers were verified, the next step was data deletion according to the three characteristics of missing data; MCAR, MAR and MNAR. The characteristics were determined based on the reason why the data are missing. All the variables with missing data contain discrete data, except high school rank. The classification of each variable s characteristics is discussed next. The missing data in variables ACTC, ACTM and SAT scores are considered as MAR due the reason of each missing values. Because high school students have a choice of taking either the SAT or the ACT or taking both, the probability of missing values is quite high. Because many students likely take only one of the tests, the information for the test that is not taken was missing from the original dataset. The probability of SAT or ACT scores missing is not mutually dependent, whereas neither of the tests taken is dependent to the score, regardless the score result. The reason only one test or even both were taken or even both is unclear; the students choice of the test taken cannot be determined. The missing values for high school rank and high school size have the characteristics of MCAR, or Missing Completely at Random. Because not all of high school information was provided for the dataset, and the missing values in the high school rank and high school size are not related to any variables in the dataset, they are considered as MCAR. After identifying the characteristic of each variable, the data then were randomly deleted from each new generated dataset according to the number of missing data. For SAT, the total deleted from each new generated dataset was The total deleted

37 37 values for other variables were: 830 values for High School Percentile Rank, 829 values for High School Size, 488 values for both ACT Composite and ACT Math. In this research, the values were removed from the complete generated dataset using the MCAR methods. The MCAR was used because of the variables are treated as independent variables due the lack of information whether they are MAR or MNAR methods. Since the level of the values that are missing was independent for each variable, the MCAR method was considered appropriate to use for the deletion method. Imputation Result After the deleted data were generated, the imputation methods are utilized in this study. The tables show a summary of each imputation result for each random number set. Two statistical factors compared to the complete dataset are the mean and standard deviation. The tables for each variable are in percentage differences between the mean and standard deviation of the complete dataset and the imputed dataset. Yet, due to the random factor, the numbers of the dataset can be completely unexpected and different for each dataset. To avoid bias in the analysis, total difference average was calculated to show the average differences between the random numbers which generated the dataset. A small difference percentage means that the difference can be considered small enough to be accepted as the appropriate imputation method. It can be seen in the tables that zero imputation has a very large percentage difference for all variables. Thus, zero imputation is excluded in the graphical summary to illustrate the result effectively and avoid bias analysis due to misinterpretation.

38 38 SAT Tables 6 and 7 show a summary of variable SAT imputation for the 10 sets. In mean comparison, as expected, mean imputation has the lowest total percentage difference between the initial values. The second lowest total percentage difference is multiple imputation, and then hot-deck imputation with 0.205% and 0.265% of difference. Although mean imputation has the lowest value of total mean average difference, it has a large standard deviation percentage difference. Multiple imputation and hot-deck imputation are superior to mean and median imputation for preserving the standard deviation values. Multiple imputation has the lowest percentage difference for standard deviation values with a 1.61% average of difference. Table 9: Mean Imputation Results for SAT Mean Dataset (initial mean imputation median zero hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % mean delta % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % average 0.140% 0.306% % 0.265% 0.205%

39 Table 10: Standard Deviation Imputation Results for SAT Dataset (initial value) Standard Deviation mean imputation median zero hot deck imputation MI StDev delta % StDev delta % StDev delta % StDev delta % StDev delta % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % RN % % % % % average 25.43% 25.33% % 2.28% 1.61% 39 In Figure 2, the mean for each imputation is different due the randomness of the random number generated dataset. With mean imputation, small differences between the imputed mean and the complete dataset are expected. This is indicated by the low values that mean imputation has throughout the generated dataset, except in Random Number generated dataset 3 or RN3 that has a high value of mean compared to other imputed random number generated datasets. Yet, the overall performance of mean imputation, compared to the dataset or initial dataset mean value, is superior to other imputation methods.

40 % 1.000% 0.800% 0.600% 0.400% 0.200% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 2: SAT Imputation Mean Comparison Results 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 3: SAT Imputation Standard Deviation Comparison Results Mean imputation and median imputation seem to have similar behavior in standard deviation comparison. This can be seen in figure 3, where mean and median imputations perform poorly compared to hot-deck imputation and multiple imputation. ACTC Tables 8 and 9 show a summary of variable ACTC imputation results for 10 random number generated datasets. In Table 9, it can be seen that the multiple imputation method has the lowest average difference in standard deviation compared to other imputation methods. In the mean imputation result, as expected, the mean

41 imputation method has the lowest percentage difference. Yet, the difference between the total mean average for multiple imputation and hot-deck imputation is only 0.011%. Multiple imputation has the second lowest total average mean difference, 0.084%. And the third lowest total average mean difference is hot-deck imputation with a total difference of 0.093%. Multiple imputation still outperformed the other imputation methods in standard deviation difference by having the lowest total standard deviation difference (0.41%). Table 11: Mean Imputation Results for ACTC Mean Dataset (initial mean imputation median hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % average 0.073% 0.279% 0.093% 0.084% 41

42 Table 12: Standard Deviation Imputation Results for ACTC Standard Deviation Dataset (initial mean imputation median hot deck imputation MI value) StDev delta % StDev delta % StDev delta % StDev delta % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % average 6.32% 6.23% 0.54% 0.41% % 0.400% 0.300% 0.200% 0.100% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 4: ACTC Imputation Mean Comparison Results As can be seen in Figure 4, the mean for each imputation is different due to the randomness of the random number generated dataset. It can be seen that the median imputation has the highest mean difference compared to the other methods.

43 % 6.00% 4.00% 2.00% 0.00% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 5: ACTC Imputation Standard Deviation Comparison Results As can be seen in Figure 5, the standard deviations for all imputations tend to show similar behavior. Multiple imputation and hot-deck imputation seems to have very low standard deviations compared to mean and median imputation. Figure 5 shows that in seven out of ten results of the imputation, multiple imputation has the smallest value of standard deviation compared to hot-deck, mean, and median imputations. ACTM Tables 10 and 11 show summaries of variable ACTM imputation results for ten random number generated datasets. In the table below, as with the results in ACTC, it can be seen that the multiple imputation method has the lowest average difference in standard deviation compared to other imputation methods. Again, mean imputation, as expected, has the lowest percentage difference in mean imputation comparison results. Yet, the hot-deck imputation method has the same low percentage difference as the mean imputation method, which is %. Multiple imputation has the third lowest total average mean difference, 0.082%. Multiple imputation still outperformed the other

44 imputation methods in standard deviation difference by having the lowest total standard deviation difference (0.367%). 44 Table 13: Mean Imputation Results for ACTM Mean Dataset (initial mean imputation median hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % average % 0.106% % 0.082% Table 14: Standard Deviation Imputation Results for ACTM Standard Deviation Dataset (initial mean imputation median hot deck imputation MI value) StDev delta % StDev delta % StDev delta % StDev delta % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % average 6.378% 6.355% 0.613% 0.367% As can be seen in Figure 6, as with the results in SAT and ACTC, the mean for each imputation is different due the randomness from the random number generated

45 45 dataset. It can be seen that the peak value 5, of median imputation, occurred in set 5, and has a higher mean difference than the other imputation methods. It also can be seen that the performance of multiple imputation, mean imputation and hot-deck imputation are competing. Yet, multiple imputation has three higher values than mean imputation and hot-deck imputation, occurring in RN5, RN7 and RN % 0.250% 0.200% 0.150% 0.100% 0.050% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 6: ACTM Imputation Mean Comparison Results 8.000% 6.000% 4.000% 2.000% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 7: ACTM Imputation Standard Deviation Comparison Results Multiple imputation and hot-deck imputation seem to have very small variances and standard deviations compared to mean and median imputation. Multiple imputation, again, shows a better performance than the other imputation methods regarding the

46 46 variance difference. In seven out of ten results of the imputation, multiple imputation has the smallest standard difference value compared to hot-deck, mean, and median imputation. High School Size Tables 12 and 13 show summaries of High School Size imputation for ten random number generated datasets. In the table below, it can be seen that the multiple imputation method has the lowest average difference in mean and standard deviation compared to other imputation methods. Multiple imputation outperformed the other imputation methods in standard deviation differences by having the lowest total standard deviation difference (0.41%). The total average mean of the multiple imputation method is 0.24%, which is also the lowest among the other imputation methods. Unlike the previous variables results, the variable High School Size has more than 10% difference in total average difference in mean compared to the hot-deck imputation and mean imputation methods.

47 Table 15: Mean Imputation Results for High School Size Mean Dataset (initial mean imputation median hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % average 0.39% 1.40% 0.39% 0.24% 47 Table 16: Standard Deviation Imputation Results for High School Size Standard Deviation Dataset (initial mean imputation median hot deck imputation MI value) StDev delta % StDev delta % StDev delta % StDev delta % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % average 10.76% 10.62% 0.85% 0.41% As can be seen in Figure 8, the median imputation seems to have the highest mean among the other imputation methods except in one dataset, RN3. It can be seen that median imputation has a higher mean difference than the other imputation methods. The performance of multiple imputation has done better than the other imputation methods due the lower value of mean difference. Again, in seven out of ten sets, the mean of the

48 multiple imputation method has lower values than do hot-deck imputation, mean imputation and median imputation % 2.500% 2.000% 1.500% 1.000% 0.500% 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 MEan Median Hot Deck Mi Figure 8: High School Size Imputation Mean Comparison Results 15.00% 10.00% 5.00% 0.00% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 9: High School Size Imputation Standard Deviation Comparison Results As can be seen in Figure 9, the standard deviation for each imputation tends to show similar behavior. Multiple imputation and hot-deck imputation seem to have very small variances and standard deviations compared to mean and median imputation. In Figure 9, it can be seen that multiple imputation and hot-deck imputation have slight differences, except in RN6 and RN7, where hot-deck imputation has higher variance difference than multiple imputation.

49 49 High School Rank Tables 14 and 15 show summaries of variable High School Rank imputation results for 10 random number generated datasets. In the table 14, it can be seen that the multiple imputation method has the lowest average difference in mean compared to other imputation methods. Surprisingly, unlike the previous variable comparison results, multiple imputation outperformed the other imputation methods only in mean difference, yet, the lowest total standard deviation difference (0.57%) was performed by hot-deck imputation. Table 17: Mean Imputation Results for High School Rank Mean Dataset (initial mean imputation median hot deck imputation MI value) mean delta % mean delta % mean delta % mean delta % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % average 0.210% 0.359% 0.207% 0.152%

50 Table 18: Standard Deviation Imputation Results for High School Rank Standard Deviation Dataset (initial mean imputation median hot deck imputation MI value) StDev delta % StDev delta % StDev delta % StDev delta % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % RN % % % % average 10.95% 10.93% 0.57% 0.73% 50 As can be seen in Figure 10, the mean for each imputation is different due to the randomness from the random number generated dataset. For variable High School Rank, the trend seems inconsistent. Yet, median imputation still holds the highest mean average among the other imputation methods. Six out of ten imputations results with the high values of mean difference for median imputation method. Hot-deck imputation has the three highest values of mean difference in RN6, RN7 and RN9. This leads to high total average of mean difference for hot-deck imputation. Yet, the total average percentage difference between multiple imputation and hot-deck imputation is only 0.55%.

51 % 0.600% 0.500% 0.400% 0.300% 0.200% 0.100% Mean Median Hot Deck MI 0.000% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Figure 10: High School Rank Imputation Mean Comparison Results 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% RN1 RN2 RN3 RN4 RN5 RN6 RN7 RN8 RN9 RN10 Mean Median Hot Deck MI Figure 11: High School Rank Imputation Standard Deviation Comparison Results As can be seen in Figure 11, the standard deviation for each imputation tends to show similar behavior. Multiple imputation and hot-deck imputation seem to have a small variance and standard deviation compared to mean and median imputation. For High School Rank, hot-deck imputation has a lower difference value compared to the multiple imputation method.

52 52 Accuracy Evaluation Another factor that was considered for determining a better imputation method is the accuracy evaluation. This is done by calculating the root mean square error or RMSE. RMSE indicates how close the observed data points are to the model s predicted values. RMSE can also be interpreted as the standard deviation of the unexplained variance. A better fit is indicated by a lower RMSE value. The results of RMSE for each variables indicate that mean imputation and median imputation have the lowest value of RMSE. Meanwhile, due its high value of standard deviation, as expected, zero imputation has the highest RMSE value for all the five variables imputed. Table 19: RMSE for SAT Dataset RMSE mean median zero Hot deck MI RN RN RN RN RN RN RN RN RN RN average For SAT, as can be seen in Table 16, mean imputation has the lowest average of RMSE, which is , followed by median imputation with RMSE of The RMSE values of SAT for mean and median imputation have only slight difference, of

53 values of difference. On the other hand, hot-deck imputation and multiple imputation also have a slight difference, only by 1.0 difference. Yet, the difference between mean, median, hot-deck and multiple imputation have differences of more than 30% difference. Table 20: RMSE for ACTC Dataset RMSE mean median zero Hot deck MI RN RN RN RN RN RN RN RN RN RN average For ACTC, mean imputation has the lowest average of RMSE, which is 3.528, followed by median imputation with RMSE of The RMSE values of ACTC for mean and median imputation are slightly different, only values of difference. On the other hand, hot-deck imputation and multiple imputation also have a slight difference, only Unlike SAT, the RMSE results for ACTC show the differences among mean, median, hot-deck and multiple imputation at less than 10%. This is caused by the small variance and range imputed for the ACTC values, unlike SAT which has a wider range of values for imputation.

54 54 Table 21: RMSE for ACTM Dataset RMSE mean median zero Hot deck MI RN RN RN RN RN RN RN RN RN RN average For ACTM, the mean imputation still has the lowest average of RMSE, which is 3.958, followed by the median imputation with RMSE of The RMSE values of ACTC for mean and median imputation have slight differences, only On the other hand, the hot-deck and multiple imputations also have only a slight difference, of a As with ACTC, the RMSE result for ACTM shows the slight difference among mean, median, hot-deck and multiple imputations which is less than a 10% difference. Similar, again, to ACTC, this is caused by the small variance and range imputed for the ACTM values.

55 55 Table 22: RMSE for High School Size Dataset RMSE mean Median zero Hot deck MI RN RN RN RN RN RN RN RN RN RN average For High School Size, as can be seen in Table 19, mean imputation still has the lowest average of RMSE, which is , followed by median imputation with RMSE of The RMSE values of ACTC for mean and median imputation have slight differences, of only On the other hand, hot-deck imputation and multiple imputation also have a slight difference, only 0.5 difference, with a lower value for multiple imputation. As for High School Rank, mean imputation still has the lowest average of RMSE, which is , followed by median imputation with RMSE of The RMSE values of ACTC for mean and median imputation are slightly different, only 0.05 values of difference. On the other hand, hot-deck imputation and multiple imputation differ, only by 0.08, with the lower value for multiple imputation.

56 56 Table 23: RMSE for High School Size Dataset RMSE Mean median zero hot deck MI RN RN RN RN RN RN RN RN RN RN average The accuracy results show that zero imputation is the poorest method. According to the RMSE results, mean imputation and median imputation performed with better accuracy than hot-deck and multiple imputation. Although mean and median imputation tend to center the distribution and decrease the variance and standard deviation, they still have a better performance for accuracy. As for hot-deck and multiple imputation, they have lower variance, yet the RMSE results show that they are less accurate than mean and median imputation.

57 57 CHAPTER 5: PREDICTION MODEL Roth (2008) used mean imputation and zero imputation to fill in missing values for her prediction model. For this research, the prediction model for freshman student retention, based on winter data, was built using hot-deck and multiple imputation methods to fill in the missing values. Based on the imputation results, all missing variables were imputed with the method with low mean and also standard deviation. All the variables were imputed using the multiple imputation approach except high school rank, because the standard deviation comparison results showed that hot-deck imputation has a lower standard deviation value and mean value, as well. The model was built with linear regression, logistic regression and AD Tree, using WEKA, MINITAB and Microsoft EXCEL software. To replicate Roth s approach, 31 variables were used to predict the fall sophomore enrollment. The variables are listed in Table 21.

58 58 Table 24: Variables in predicting Fall Enrollment from Winter model 1 RACE CODE 2 SEX 3 HS SIZE 4 HSSIZE filled 5 HS PERCENTILE RANK 6 HS PERCENTILE RANK filled 7 HS GPA 8 HS GPA filled 9 STATE filled 10 COUNTY CODE 11 COUNTY CODE filled 12 ACTC 13 ACTC filled 14 ACTM 15 ACTM filled 16 SATTOTAL 17 SATTOTAL filled 18 FALL GPA 19 FALL COLLEGE 20 FALL MAJOR PROGRAM 21 FALL 2006 MAJOR CODE 22 FALL UNDECIDED 23 WINTER COLLEGE 24 WINTER MAJOR PROGRAM 25 WINTER 2007 MAJOR CODE 26 WINTER UNDECIDED 27 MAJOR CHANGE W 28 CHANGE OUT OF UNDECIDED 29 EXPECTED FAMILY CONTRIBUTION 30 FINANCIAL AID DATA HERE 31 GATEWAY In order for MINITAB to run a linear regression model, some of the variables needed to be transformed into regressible variables. In a similar model, Khajuria (2007)

59 59 transformed nominal variables into a sparse array of binary variables. The models were developed using an initial 3,818 student entries and 31 variables. Following the process of transforming the nominal variables into regressible variables, 940 input variables were entered into MINITAB and considered for inclusion into the retention prediction model. In Table 22, 23, and 24 were Sadie s result of the prediction using only mean imputation and zero imputation. Table 25: Roth s Predicted Fall Enrollment from Winter Alternating Decision Tree vs. Actual Fall Enrollment ACTUAL PREDICTION Retention Attrition Total Retention Attrition Total Retention Accuracy: 83.41% Attrition Accuracy: 53.23% Overall Accuracy: 82.92% Table 26: Roth s Predicted Fall Enrollment from Winter Logistic Regression vs. Actual Fall Enrollment ACTUAL PREDICTION Retention Attrition Total Retention Attrition Total Retention Accuracy: 83.40% Attrition Accuracy: 27.92% Overall Accuracy: 80.54%

60 60 Table 27: Roth s Predicted Fall Enrollment from Winter Linear Regression vs. Actual Fall Enrollment for Individual Cases ACTUAL PREDICTION Retention Attrition Total Retention Attrition Total Retention Accuracy: 82.86% Attrition Accuracy: 33.33% Overall Accuracy: 82.74% However, since multiple imputation was used to fill in the missing values, the student entries were multiplied five times as the multiple imputation was replicated five times. The student entries expanded from 3,818 to 19,090. The forward selection regression model indentified total expanded 824 variables as significant indicators of the retention. The prediction result of alternating decision tree against the actual fall enrollment appears showed in Table 22. Table 28: Predicted Fall Enrollment from Winter Alternating Decision Tree vs. Actual Fall Enrollment ACTUAL PREDICTION Retention Attrition Total Retention Attrition Total Retention Accuracy: 82.86% Attrition Accuracy: 66.67% Overall Accuracy: 82.84% From Table 25, it can be seen that the alternating decision tree using the winter 2007 dataset was able to predict a student s retention status in fall of 2007 with % overall accuracy. The overall accuracy had decreased from the previous model by 0.08%.

61 61 There were 19,075 retention predictions made, and 82.86% of them were accurate. Attrition was predicted for just 15 student entries and the predictions were accurate 66.6% of the time. Yet, this prediction result cannot be considered useful, since it only predicts 15 out of a total 19,090 entries for attrition. The decision tree created using the winter 2007 data to predict fall 2007 enrollment is shown in Figure 12.

62 Figure 12: Winter Alternating Decision Tree Predicting Fall Enrollment 62

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE Victoria SAS Users Group November 26, 2013 Missing value imputation in SAS: an intro to Proc MI and MIANALYZE Sylvain Tremblay SAS Canada Education Copyright 2010 SAS Institute Inc. All rights reserved.

More information

Multiple Imputation for Missing Data in KLoSA

Multiple Imputation for Missing Data in KLoSA Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1. Missing Data and Missing Data Mechanisms 2. Imputation 3. Missing Data and Multiple Imputation in Baseline

More information

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts When you need to understand situations that seem to defy data analysis, you may be able to use techniques

More information

Missing Data Treatments

Missing Data Treatments Missing Data Treatments Lindsey Perry EDU7312: Spring 2012 Presentation Outline Types of Missing Data Listwise Deletion Pairwise Deletion Single Imputation Methods Mean Imputation Hot Deck Imputation Multiple

More information

Handling Missing Data. Ashley Parker EDU 7312

Handling Missing Data. Ashley Parker EDU 7312 Handling Missing Data Ashley Parker EDU 7312 Presentation Outline Types of Missing Data Treatments for Handling Missing Data Deletion Techniques Listwise Deletion Pairwise Deletion Single Imputation Techniques

More information

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT

RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS ABSTRACT RELATIVE EFFICIENCY OF ESTIMATES BASED ON PERCENTAGES OF MISSINGNESS USING THREE IMPUTATION NUMBERS IN MULTIPLE IMPUTATION ANALYSIS Nwakuya, M. T. (Ph.D) Department of Mathematics/Statistics University

More information

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop Missing Data Methods (Part I): Multiple Imputation Advanced Multivariate Statistical Methods Workshop University of Georgia: Institute for Interdisciplinary Research in Education and Human Development

More information

Predicting Wine Quality

Predicting Wine Quality March 8, 2016 Ilker Karakasoglu Predicting Wine Quality Problem description: You have been retained as a statistical consultant for a wine co-operative, and have been asked to analyze these data. Each

More information

Flexible Imputation of Missing Data

Flexible Imputation of Missing Data Chapman & Hall/CRC Interdisciplinary Statistics Series Flexible Imputation of Missing Data Stef van Buuren TNO Leiden, The Netherlands University of Utrecht The Netherlands crc pness Taylor &l Francis

More information

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H.

Online Appendix to. Are Two heads Better Than One: Team versus Individual Play in Signaling Games. David C. Cooper and John H. Online Appendix to Are Two heads Better Than One: Team versus Individual Play in Signaling Games David C. Cooper and John H. Kagel This appendix contains a discussion of the robustness of the regression

More information

Imputation of multivariate continuous data with non-ignorable missingness

Imputation of multivariate continuous data with non-ignorable missingness Imputation of multivariate continuous data with non-ignorable missingness Thais Paiva Jerry Reiter Department of Statistical Science Duke University NCRN Meeting Spring 2014 May 23, 2014 Thais Paiva, Jerry

More information

IT 403 Project Beer Advocate Analysis

IT 403 Project Beer Advocate Analysis 1. Exploratory Data Analysis (EDA) IT 403 Project Beer Advocate Analysis Beer Advocate is a membership-based reviews website where members rank different beers based on a wide number of categories. The

More information

Emerging Local Food Systems in the Caribbean and Southern USA July 6, 2014

Emerging Local Food Systems in the Caribbean and Southern USA July 6, 2014 Consumers attitudes toward consumption of two different types of juice beverages based on country of origin (local vs. imported) Presented at Emerging Local Food Systems in the Caribbean and Southern USA

More information

Relation between Grape Wine Quality and Related Physicochemical Indexes

Relation between Grape Wine Quality and Related Physicochemical Indexes Research Journal of Applied Sciences, Engineering and Technology 5(4): 557-5577, 013 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 013 Submitted: October 1, 01 Accepted: December 03,

More information

IMSI Annual Business Meeting Amherst, Massachusetts October 26, 2008

IMSI Annual Business Meeting Amherst, Massachusetts October 26, 2008 Consumer Research to Support a Standardized Grading System for Pure Maple Syrup Presented to: IMSI Annual Business Meeting Amherst, Massachusetts October 26, 2008 Objectives The objectives for the study

More information

MBA 503 Final Project Guidelines and Rubric

MBA 503 Final Project Guidelines and Rubric MBA 503 Final Project Guidelines and Rubric Overview There are two summative assessments for this course. For your first assessment, you will be objectively assessed by your completion of a series of MyAccountingLab

More information

Buying Filberts On a Sample Basis

Buying Filberts On a Sample Basis E 55 m ^7q Buying Filberts On a Sample Basis Special Report 279 September 1969 Cooperative Extension Service c, 789/0 ite IP") 0, i mi 1910 S R e, `g,,ttsoliktill:torvti EARs srin ITQ, E,6

More information

Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand

Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand Southeast Asian Journal of Economics 2(2), December 2014: 77-102 Labor Supply of Married Couples in the Formal and Informal Sectors in Thailand Chairat Aemkulwat 1 Faculty of Economics, Chulalongkorn University

More information

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK 2013 SUMMARY Several breeding lines and hybrids were peeled in an 18% lye solution using an exposure time of

More information

wine 1 wine 2 wine 3 person person person person person

wine 1 wine 2 wine 3 person person person person person 1. A trendy wine bar set up an experiment to evaluate the quality of 3 different wines. Five fine connoisseurs of wine were asked to taste each of the wine and give it a rating between 0 and 10. The order

More information

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution STA 2023 Module 6 The Normal Distribution Learning Objectives 1. Explain what it means for a variable to be normally distributed or approximately normally distributed. 2. Explain the meaning of the parameters

More information

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves STA 2023 Module 6 The Normal Distribution Learning Objectives 1. Explain what it means for a variable to be normally distributed or approximately normally distributed. 2. Explain the meaning of the parameters

More information

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good

Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good Relationships Among Wine Prices, Ratings, Advertising, and Production: Examining a Giffen Good Carol Miu Massachusetts Institute of Technology Abstract It has become increasingly popular for statistics

More information

Flexible Working Arrangements, Collaboration, ICT and Innovation

Flexible Working Arrangements, Collaboration, ICT and Innovation Flexible Working Arrangements, Collaboration, ICT and Innovation A Panel Data Analysis Cristian Rotaru and Franklin Soriano Analytical Services Unit Economic Measurement Group (EMG) Workshop, Sydney 28-29

More information

Method for the imputation of the earnings variable in the Belgian LFS

Method for the imputation of the earnings variable in the Belgian LFS Method for the imputation of the earnings variable in the Belgian LFS Workshop on LFS methodology, Madrid 2012, May 10-11 Astrid Depickere, Anja Termote, Pieter Vermeulen Outline 1. Introduction 2. Imputation

More information

DETERMINANTS OF DINER RESPONSE TO ORIENTAL CUISINE IN SPECIALITY RESTAURANTS AND SELECTED CLASSIFIED HOTELS IN NAIROBI COUNTY, KENYA

DETERMINANTS OF DINER RESPONSE TO ORIENTAL CUISINE IN SPECIALITY RESTAURANTS AND SELECTED CLASSIFIED HOTELS IN NAIROBI COUNTY, KENYA DETERMINANTS OF DINER RESPONSE TO ORIENTAL CUISINE IN SPECIALITY RESTAURANTS AND SELECTED CLASSIFIED HOTELS IN NAIROBI COUNTY, KENYA NYAKIRA NORAH EILEEN (B.ED ARTS) T 129/12132/2009 A RESEACH PROPOSAL

More information

Napa County Planning Commission Board Agenda Letter

Napa County Planning Commission Board Agenda Letter Agenda Date: 7/1/2015 Agenda Placement: 10A Continued From: May 20, 2015 Napa County Planning Commission Board Agenda Letter TO: FROM: Napa County Planning Commission John McDowell for David Morrison -

More information

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

The Market Potential for Exporting Bottled Wine to Mainland China (PRC) The Market Potential for Exporting Bottled Wine to Mainland China (PRC) The Machine Learning Element Data Reimagined SCOPE OF THE ANALYSIS This analysis was undertaken on behalf of a California company

More information

Appendix A. Table A.1: Logit Estimates for Elasticities

Appendix A. Table A.1: Logit Estimates for Elasticities Estimates from historical sales data Appendix A Table A.1. reports the estimates from the discrete choice model for the historical sales data. Table A.1: Logit Estimates for Elasticities Dependent Variable:

More information

An application of cumulative prospect theory to travel time variability

An application of cumulative prospect theory to travel time variability Katrine Hjorth (DTU) Stefan Flügel, Farideh Ramjerdi (TØI) An application of cumulative prospect theory to travel time variability Sixth workshop on discrete choice models at EPFL August 19-21, 2010 Page

More information

Missing data in political science

Missing data in political science SOC 597A Seminar in survey research Final paper Missing data in political science Claudiu Tufis December 10, 2003 Abstract In this paper I analyze a series of techniques designed for replacing missing

More information

FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE

FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE 12 November 1953 FACTORS DETERMINING UNITED STATES IMPORTS OF COFFEE The present paper is the first in a series which will offer analyses of the factors that account for the imports into the United States

More information

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data . Activity 10 Coffee Break Economists often use math to analyze growth trends for a company. Based on past performance, a mathematical equation or formula can sometimes be developed to help make predictions

More information

Regression Models for Saffron Yields in Iran

Regression Models for Saffron Yields in Iran Regression Models for Saffron ields in Iran Sanaeinejad, S.H., Hosseini, S.N 1 Faculty of Agriculture, Ferdowsi University of Mashhad, Iran sanaei_h@yahoo.co.uk, nasir_nbm@yahoo.com, Abstract: Saffron

More information

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2]

Can You Tell the Difference? A Study on the Preference of Bottled Water. [Anonymous Name 1], [Anonymous Name 2] Can You Tell the Difference? A Study on the Preference of Bottled Water [Anonymous Name 1], [Anonymous Name 2] Abstract Our study aims to discover if people will rate the taste of bottled water differently

More information

Gasoline Empirical Analysis: Competition Bureau March 2005

Gasoline Empirical Analysis: Competition Bureau March 2005 Gasoline Empirical Analysis: Update of Four Elements of the January 2001 Conference Board study: "The Final Fifteen Feet of Hose: The Canadian Gasoline Industry in the Year 2000" Competition Bureau March

More information

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017 Decision making with incomplete information Some new developments Rudolf Vetschera University of Vienna Tamkang University May 15, 2017 Agenda Problem description Overview of methods Single parameter approaches

More information

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017 Modeling Wine Quality Using Classification and Mario Wijaya MGT 8803 November 28, 2017 Motivation 1 Quality How to assess it? What makes a good quality wine? Good or Bad Wine? Subjective? Wine taster Who

More information

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT New Zealand Avocado Growers' Association Annual Research Report 2004. 4:36 46. COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT J. MANDEMAKER H. A. PAK T. A.

More information

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name: 3 rd Science Notebook Structures of Life Investigation 1: Origin of Seeds Name: Big Question: What are the properties of seeds and how does water affect them? 1 Alignment with New York State Science Standards

More information

STUDY REGARDING THE RATIONALE OF COFFEE CONSUMPTION ACCORDING TO GENDER AND AGE GROUPS

STUDY REGARDING THE RATIONALE OF COFFEE CONSUMPTION ACCORDING TO GENDER AND AGE GROUPS STUDY REGARDING THE RATIONALE OF COFFEE CONSUMPTION ACCORDING TO GENDER AND AGE GROUPS CRISTINA SANDU * University of Bucharest - Faculty of Psychology and Educational Sciences, Romania Abstract This research

More information

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship

AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship AJAE Appendix: Testing Household-Specific Explanations for the Inverse Productivity Relationship Juliano Assunção Department of Economics PUC-Rio Luis H. B. Braido Graduate School of Economics Getulio

More information

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015 Supplementary Material to Modelling workplace contact networks: the effects of organizational structure, architecture, and reporting errors on epidemic predictions, published in Network Science Gail E.

More information

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere

More information

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials Project Overview The overall goal of this project is to deliver the tools, techniques, and information for spatial data driven variable rate management in commercial vineyards. Identified 2016 Needs: 1.

More information

Is Fair Trade Fair? ARKANSAS C3 TEACHERS HUB. 9-12th Grade Economics Inquiry. Supporting Questions

Is Fair Trade Fair? ARKANSAS C3 TEACHERS HUB. 9-12th Grade Economics Inquiry. Supporting Questions 9-12th Grade Economics Inquiry Is Fair Trade Fair? Public Domain Image Supporting Questions 1. What is fair trade? 2. If fair trade is so unique, what is free trade? 3. What are the costs and benefits

More information

A Comparison of Approximate Bayesian Bootstrap and Weighted Sequential Hot Deck for Multiple Imputation

A Comparison of Approximate Bayesian Bootstrap and Weighted Sequential Hot Deck for Multiple Imputation A Comparison of Approximate Bayesian Bootstrap and Weighted Sequential Hot Deck for Multiple Imputation Darryl V. Creel RTI International 1 RTI International is a trade name of Research Triangle Institute.

More information

A Comparison of X, Y, and Boomer Generation Wine Consumers in California

A Comparison of X, Y, and Boomer Generation Wine Consumers in California A Comparison of,, and Boomer Generation Wine Consumers in California Marianne McGarry Wolf, Scott Carpenter, and Eivis Qenani-Petrela This research shows that the wine market in the California is segmented

More information

Evaluating Population Forecast Accuracy: A Regression Approach Using County Data

Evaluating Population Forecast Accuracy: A Regression Approach Using County Data Evaluating Population Forecast Accuracy: A Regression Approach Using County Data Jeff Tayman, UC San Diego Stanley K. Smith, University of Florida Stefan Rayer, University of Florida Final formatted version

More information

PARENTAL SCHOOL CHOICE AND ECONOMIC GROWTH IN NORTH CAROLINA

PARENTAL SCHOOL CHOICE AND ECONOMIC GROWTH IN NORTH CAROLINA PARENTAL SCHOOL CHOICE AND ECONOMIC GROWTH IN NORTH CAROLINA DR. NATHAN GRAY ASSISTANT PROFESSOR BUSINESS AND PUBLIC POLICY YOUNG HARRIS COLLEGE YOUNG HARRIS, GEORGIA Common claims. What is missing? What

More information

International Journal of Business and Commerce Vol. 3, No.8: Apr 2014[01-10] (ISSN: )

International Journal of Business and Commerce Vol. 3, No.8: Apr 2014[01-10] (ISSN: ) The Comparative Influences of Relationship Marketing, National Cultural values, and Consumer values on Consumer Satisfaction between Local and Global Coffee Shop Brands Yi Hsu Corresponding author: Associate

More information

OF THE VARIOUS DECIDUOUS and

OF THE VARIOUS DECIDUOUS and (9) PLAXICO, JAMES S. 1955. PROBLEMS OF FACTOR-PRODUCT AGGRE- GATION IN COBB-DOUGLAS VALUE PRODUCTIVITY ANALYSIS. JOUR. FARM ECON. 37: 644-675, ILLUS. (10) SCHICKELE, RAINER. 1941. EFFECT OF TENURE SYSTEMS

More information

7 th Annual Conference AAWE, Stellenbosch, Jun 2013

7 th Annual Conference AAWE, Stellenbosch, Jun 2013 The Impact of the Legal System and Incomplete Contracts on Grape Sourcing Strategies: A Comparative Analysis of the South African and New Zealand Wine Industries * Corresponding Author Monnane, M. Monnane,

More information

Which of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not?

Which of your fingernails comes closest to 1 cm in width? What is the length between your thumb tip and extended index finger tip? If no, why not? wrong 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 right 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 score 100 98.5 97.0 95.5 93.9 92.4 90.9 89.4 87.9 86.4 84.8 83.3 81.8 80.3 78.8 77.3 75.8 74.2

More information

Power and Priorities: Gender, Caste, and Household Bargaining in India

Power and Priorities: Gender, Caste, and Household Bargaining in India Power and Priorities: Gender, Caste, and Household Bargaining in India Nancy Luke Associate Professor Department of Sociology and Population Studies and Training Center Brown University Nancy_Luke@brown.edu

More information

You know what you like, but what about everyone else? A Case study on Incomplete Block Segmentation of white-bread consumers.

You know what you like, but what about everyone else? A Case study on Incomplete Block Segmentation of white-bread consumers. You know what you like, but what about everyone else? A Case study on Incomplete Block Segmentation of white-bread consumers. Abstract One man s meat is another man s poison. There will always be a wide

More information

Biologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name

Biologist at Work! Experiment: Width across knuckles of: left hand. cm... right hand. cm. Analysis: Decision: /13 cm. Name wrong 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 right 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 score 100 98.6 97.2 95.8 94.4 93.1 91.7 90.3 88.9 87.5 86.1 84.7 83.3 81.9

More information

Problem. Background & Significance 6/29/ _3_88B 1 CHD KNOWLEDGE & RISK FACTORS AMONG FILIPINO-AMERICANS CONNECTED TO PRIMARY CARE SERVICES

Problem. Background & Significance 6/29/ _3_88B 1 CHD KNOWLEDGE & RISK FACTORS AMONG FILIPINO-AMERICANS CONNECTED TO PRIMARY CARE SERVICES CHD KNOWLEDGE & RISK FACTORS AMONG FILIPINO-AMERICANS CONNECTED TO PRIMARY CARE SERVICES Background & Significance Who are the Filipino- Americans? Alona D. Angosta, PhD, APN, FNP, NP-C Assistant Professor

More information

What makes a good muffin? Ivan Ivanov. CS229 Final Project

What makes a good muffin? Ivan Ivanov. CS229 Final Project What makes a good muffin? Ivan Ivanov CS229 Final Project Introduction Today most cooking projects start off by consulting the Internet for recipes. A quick search for chocolate chip muffins returns a

More information

2016 China Dry Bean Historical production And Estimated planting intentions Analysis

2016 China Dry Bean Historical production And Estimated planting intentions Analysis 2016 China Dry Bean Historical production And Estimated planting intentions Analysis Performed by Fairman International Business Consulting 1 of 10 P a g e I. EXECUTIVE SUMMARY A. Overall Bean Planting

More information

RESEARCH UPDATE from Texas Wine Marketing Research Institute by Natalia Kolyesnikova, PhD Tim Dodd, PhD THANK YOU SPONSORS

RESEARCH UPDATE from Texas Wine Marketing Research Institute by Natalia Kolyesnikova, PhD Tim Dodd, PhD THANK YOU SPONSORS RESEARCH UPDATE from by Natalia Kolyesnikova, PhD Tim Dodd, PhD THANK YOU SPONSORS STUDY 1 Identifying the Characteristics & Behavior of Consumer Segments in Texas Introduction Some wine industries depend

More information

Bt Corn IRM Compliance in Canada

Bt Corn IRM Compliance in Canada Bt Corn IRM Compliance in Canada Canadian Corn Pest Coalition Report Author: Greg Dunlop (BSc. Agr, MBA, CMRP), ifusion Research Ltd. 15 CONTENTS CONTENTS... 2 EXECUTIVE SUMMARY... 4 BT CORN MARKET OVERVIEW...

More information

Statistics: Final Project Report Chipotle Water Cup: Water or Soda?

Statistics: Final Project Report Chipotle Water Cup: Water or Soda? Statistics: Final Project Report Chipotle Water Cup: Water or Soda? Introduction: For our experiment, we wanted to find out how many customers at Chipotle actually get water when they order a water cup.

More information

A Hedonic Analysis of Retail Italian Vinegars. Summary. The Model. Vinegar. Methodology. Survey. Results. Concluding remarks.

A Hedonic Analysis of Retail Italian Vinegars. Summary. The Model. Vinegar. Methodology. Survey. Results. Concluding remarks. Vineyard Data Quantification Society "Economists at the service of Wine & Vine" Enometrics XX A Hedonic Analysis of Retail Italian Vinegars Luigi Galletto, Luca Rossetto Research Center for Viticulture

More information

Valuation in the Life Settlements Market

Valuation in the Life Settlements Market Valuation in the Life Settlements Market New Empirical Evidence Jiahua (Java) Xu 1 1 Institute of Insurance Economics University of St.Gallen Western Risk and Insurance Association 2018 Annual Meeting

More information

1) What proportion of the districts has written policies regarding vending or a la carte foods?

1) What proportion of the districts has written policies regarding vending or a la carte foods? Rhode Island School Nutrition Environment Evaluation: Vending and a La Carte Food Policies Rhode Island Department of Education ETR Associates - Education Training Research Executive Summary Since 2001,

More information

Archdiocese of New York Practice Items

Archdiocese of New York Practice Items Archdiocese of New York Practice Items Mathematics Grade 8 Teacher Sample Packet Unit 1 NY MATH_TE_G8_U1.indd 1 NY MATH_TE_G8_U1.indd 2 1. Which choice is equivalent to 52 5 4? A 1 5 4 B 25 1 C 2 1 D 25

More information

Napa County Planning Commission Board Agenda Letter

Napa County Planning Commission Board Agenda Letter Agenda Date: 3/4/2015 Agenda Placement: 10A Napa County Planning Commission Board Agenda Letter TO: FROM: Napa County Planning Commission David Morrison - Director Planning, Building and Environmental

More information

The University of Georgia

The University of Georgia The University of Georgia Center for Agribusiness and Economic Development College of Agricultural and Environmental Sciences A Survey of Pecan Sheller s Interest in Storage Technology Prepared by: Kent

More information

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years

Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years Using Growing Degree Hours Accumulated Thirty Days after Bloom to Help Growers Predict Difficult Fruit Sizing Years G. Lopez 1 and T. DeJong 2 1 Àrea de Tecnologia del Reg, IRTA, Lleida, Spain 2 Department

More information

De La Salle University Dasmariñas

De La Salle University Dasmariñas A COMPARATIVE STUDY OF THE LEVEL OF CUSTOMER SATISFACTION OF J.CO DONUTS IN SM DASMARIÑAS & KRISPY KREME THE DISTRICT IMUS An Undergraduate Thesis Presented to The Faculty of Hospitality Management De

More information

Summary Report Survey on Community Perceptions of Wine Businesses

Summary Report Survey on Community Perceptions of Wine Businesses Summary Report Survey on Community Perceptions of Wine Businesses Updated August 10, 2018 Conducted by Professors David McCuan and Richard Hertz for the Wine Business Institute School of Business and Economics

More information

CHAPTER I BACKGROUND

CHAPTER I BACKGROUND CHAPTER I BACKGROUND 1.1. Problem Definition Indonesia is one of the developing countries that already officially open its economy market into global. This could be seen as a challenge for Indonesian local

More information

A Note on a Test for the Sum of Ranksums*

A Note on a Test for the Sum of Ranksums* Journal of Wine Economics, Volume 2, Number 1, Spring 2007, Pages 98 102 A Note on a Test for the Sum of Ranksums* Richard E. Quandt a I. Introduction In wine tastings, in which several tasters (judges)

More information

Temperature effect on pollen germination/tube growth in apple pistils

Temperature effect on pollen germination/tube growth in apple pistils FINAL PROJECT REPORT Project Title: Temperature effect on pollen germination/tube growth in apple pistils PI: Dr. Keith Yoder Co-PI(): Dr. Rongcai Yuan Organization: Va. Tech Organization: Va. Tech Telephone/email:

More information

Using Standardized Recipes in Child Care

Using Standardized Recipes in Child Care Using Standardized Recipes in Child Care Standardized recipes are essential tools for implementing the Child and Adult Care Food Program meal patterns. A standardized recipe identifies the exact amount

More information

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa

Volume 30, Issue 1. Gender and firm-size: Evidence from Africa Volume 30, Issue 1 Gender and firm-size: Evidence from Africa Mohammad Amin World Bank Abstract A number of studies show that relative to male owned businesses, female owned businesses are smaller in size.

More information

Imputation Procedures for Missing Data in Clinical Research

Imputation Procedures for Missing Data in Clinical Research Imputation Procedures for Missing Data in Clinical Research Appendix B Overview The MATRICS Consensus Cognitive Battery (MCCB), building on the foundation of the Measurement and Treatment Research to Improve

More information

Dietary Diversity in Urban and Rural China: An Endogenous Variety Approach

Dietary Diversity in Urban and Rural China: An Endogenous Variety Approach Dietary Diversity in Urban and Rural China: An Endogenous Variety Approach Jing Liu September 6, 2011 Road Map What is endogenous variety? Why is it? A structural framework illustrating this idea An application

More information

To make wine, to sell the grapes or to deliver them to a cooperative: determinants of the allocation of the grapes

To make wine, to sell the grapes or to deliver them to a cooperative: determinants of the allocation of the grapes American Association of Wine Economists (AAWE) 10 th Annual Conference Bordeaux June 21-25, 2016 To make wine, to sell the grapes or to deliver them to a cooperative: determinants of the allocation of

More information

F&N 453 Project Written Report. TITLE: Effect of wheat germ substituted for 10%, 20%, and 30% of all purpose flour by

F&N 453 Project Written Report. TITLE: Effect of wheat germ substituted for 10%, 20%, and 30% of all purpose flour by F&N 453 Project Written Report Katharine Howe TITLE: Effect of wheat substituted for 10%, 20%, and 30% of all purpose flour by volume in a basic yellow cake. ABSTRACT Wheat is a component of wheat whole

More information

Instruction (Manual) Document

Instruction (Manual) Document Instruction (Manual) Document This part should be filled by author before your submission. 1. Information about Author Your Surname Your First Name Your Country Your Email Address Your ID on our website

More information

Academic Year 2014/2015 Assessment Report. Bachelor of Science in Viticulture, Department of Viticulture and Enology

Academic Year 2014/2015 Assessment Report. Bachelor of Science in Viticulture, Department of Viticulture and Enology Academic Year 2014/2015 Assessment Report Bachelor of Science in Viticulture, Department of Viticulture and Enology Due to changes in faculty assignments, there was no SOAP coordinator for the Department

More information

Fairfield Public Schools Family Consumer Sciences Curriculum Food Service 30

Fairfield Public Schools Family Consumer Sciences Curriculum Food Service 30 Fairfield Public Schools Family Consumer Sciences Curriculum Food Service 30 Food Service 30 BOE Approved 05/09/2017 1 Food Service 30 Food Service 30 Students will continue to participate in the school

More information

HW 5 SOLUTIONS Inference for Two Population Means

HW 5 SOLUTIONS Inference for Two Population Means HW 5 SOLUTIONS Inference for Two Population Means 1. The Type II Error rate, β = P{failing to reject H 0 H 0 is false}, for a hypothesis test was calculated to be β = 0.07. What is the power = P{rejecting

More information

Veganuary Month Survey Results

Veganuary Month Survey Results Veganuary 2016 6-Month Survey Results Project Background Veganuary is a global campaign that encourages people to try eating a vegan diet for the month of January. Following Veganuary 2016, Faunalytics

More information

Atis (Annona Squamosa) Tea

Atis (Annona Squamosa) Tea Vol. 1 January 2012 International Peer Reviewed Journal IAMURE: International Journal of Mathematics, International Engineering Peer Reviewed & Technology Journal Atis (Annona Squamosa) Tea PAULETTE MARCIA

More information

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method Name: Period: 5 Populations Estimating Animal Populations by Using the Mark-Recapture Method Background Information: Lincoln-Peterson Sampling Techniques In the field, it is difficult to estimate the population

More information

Colorado State University Viticulture and Enology. Grapevine Cold Hardiness

Colorado State University Viticulture and Enology. Grapevine Cold Hardiness Colorado State University Viticulture and Enology Grapevine Cold Hardiness Grapevine cold hardiness is dependent on multiple independent variables such as variety and clone, shoot vigor, previous season

More information

(A report prepared for Milk SA)

(A report prepared for Milk SA) South African Milk Processors Organisation The voluntary organisation of milk processors for the promotion of the development of the secondary dairy industry to the benefit of the dairy industry, the consumer

More information

Northern Region Central Region Southern Region No. % of total No. % of total No. % of total Schools Da bomb

Northern Region Central Region Southern Region No. % of total No. % of total No. % of total Schools Da bomb Some Purr Words Laurie and Winifred Bauer A number of questions demanded answers which fell into the general category of purr words: words with favourable senses. Many of the terms supplied were given

More information

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4

Please sign and date here to indicate that you have read and agree to abide by the above mentioned stipulations. Student Name #4 The following group project is to be worked on by no more than four students. You may use any materials you think may be useful in solving the problems but you may not ask anyone for help other than the

More information

Introduction Methods

Introduction Methods Introduction The Allium paradoxum, common name few flowered leek, is a wild garlic distributed in woodland areas largely in the East of Britain (Preston et al., 2002). In 1823 the A. paradoxum was brought

More information

Michael Bankier, Jean-Marc Fillion, Manchi Luc and Christian Nadeau Manchi Luc, 15A R.H. Coats Bldg., Statistics Canada, Ottawa K1A 0T6

Michael Bankier, Jean-Marc Fillion, Manchi Luc and Christian Nadeau Manchi Luc, 15A R.H. Coats Bldg., Statistics Canada, Ottawa K1A 0T6 IMPUTING NUMERIC AND QUALITATIVE VARIABLES SIMULTANEOUSLY Michael Bankier, Jean-Marc Fillion, Manchi Luc and Christian Nadeau Manchi Luc, 15A R.H. Coats Bldg., Statistics Canada, Ottawa K1A 0T6 KEY WORDS:

More information

Gender and Firm-size: Evidence from Africa

Gender and Firm-size: Evidence from Africa World Bank From the SelectedWorks of Mohammad Amin March, 2010 Gender and Firm-size: Evidence from Africa Mohammad Amin Available at: https://works.bepress.com/mohammad_amin/20/ Gender and Firm size: Evidence

More information

Menu Labeling Evaluation

Menu Labeling Evaluation Menu Labeling Evaluation Recommendations for restaurants Drexel University, School of Public Health Introduction Americans currently purchase over one-third of their calories dining out. Recent rising

More information

Washington Vineyard Acreage Report: 2011

Washington Vineyard Acreage Report: 2011 Washington Vineyard Acreage Report: 2011 COMPILED BY USDA/NATIONAL AGRICULTURAL STATISTICS SERVICE WASHINGTON FIELD OFFICE DAVID KNOPF, DIRECTOR DENNIS KOONG, DEPUTY DIRECTOR P. O. BOX 609 OLYMPIA, WASHINGTON

More information

The Roles of Social Media and Expert Reviews in the Market for High-End Goods: An Example Using Bordeaux and California Wines

The Roles of Social Media and Expert Reviews in the Market for High-End Goods: An Example Using Bordeaux and California Wines The Roles of Social Media and Expert Reviews in the Market for High-End Goods: An Example Using Bordeaux and California Wines Alex Albright, Stanford/Harvard University Peter Pedroni, Williams College

More information

Harvesting Charges for Florida Citrus, 2016/17

Harvesting Charges for Florida Citrus, 2016/17 Harvesting Charges for Florida Citrus, 2016/17 Ariel Singerman, Marina Burani-Arouca, Stephen H. Futch, Robert Ranieri 1 University of Florida, IFAS, CREC, Lake Alfred, FL This article summarizes the charges

More information

SA Winegrape Crush Survey Regional Summary Report 2017 South Australia - other

SA Winegrape Crush Survey Regional Summary Report 2017 South Australia - other SA Winegrape Crush Survey Regional Summary Report 2017 South Australia - other Vintage overview South Australia (other) includes the GI region of Southern Flinders Ranges, the Peninsulas zone, and the

More information