Vignette to Package impute.r - PDF Free Download

1 2 3 4 5 Vignette to Package impute.r Yvonne M. Badke Department of Animal Science Michigan State University East Lansing, Mi, USA email: badkeyvo@msu.edu Juan P. Steibel Departments of Animal Science, Fisheries and Wildlife Michigan State University East Lansing, Mi, USA October 25, 2012 Version 1.0 6 1 Introduction 7 8 9 10 11 12 13 14 15 16 17 impute.r is an R [7] package developed to reproduce imputation accuracy calculations presented in Badke et al. [1]. The package is build as an extension to the R package synbreed [8]. We expanded the functionality of synbreed to include genotype imputation and phasing using a reference panel of haplotypes, and subsets of tagsnp to impute all non-typed SNP. impute.r includes functions necessary to obtain and utilize the input/output of the BEAGLE software [2, 3], as well as performing the imputation using various options of the BEAGLE phasing algorithm. In addition, impute.r contains three functions that are able to generate input for the FESTA program [6] from a gpdata (package synbreed) object containing phased haplotypes. Functions for FESTA program formatting are detailed in a separate vignette: Conversion of data into FESTA format. In addition, the impute.r package includes functions to compute the accuracy measures reported in Badke et al. [1]. This guide provides users with step-by-step instructions to obtain the graphical output included in the publication [1]. 18 19 20 21 2 Input formats, recoding, and quality editing 2.1 Input formats Genotypes used in Badke et al. [1] were obtained from DNA samples of 889 Yorkshire sires and 96 animals in sire/dam/offspring trios, genotyped for all SNP (M=62,163) on the Illumina PorcineSNP60 Genotyping 1

22 23 24 25 26 BeadChip (Illumina Inc.) at a commercial laboratory (GeneSeek, a Neogen Company, Lincoln, NE). These genotypes were available in a long table format with one row per SNP/sample combination and two columns containing the observed alleles in character coding (A/G/T/C). We reformatted the original long-format data into a gpdata object (york_gpdata) that is provided as part of the impute.r package. The following data objects are released as part of the impute.r package: 27 28 29 30 1. york_gpdata This file is a gpdata object containing 889 unrelated Yorkshire sires and 96 animals in sire/dam/offspring trios genotyped on 62163 SNP. Besides the raw genotypes in a data-frame with SNP in columns and individuals in rows (geno), the object contains a map, a phenotype file, and a pedigree. 31 32 2. acc_table Is a data-frame containing accuracies for three methods of tagsnp selection that will be used in section 4. 33 34 35 3. ref_size Is a data-frame containing SNP-wise imputation accuracies for all SNP on SSC14 for in- creasing reference panel size, as well as the minor allele frequency of each SNP. The data-frame will be used in section 4 to illustrate how to derive two Figures from Badke et al. [1] 36 37 38 39 40 To install the package impute.r it is important that the packages epicalc, coda, and synbreed are preinstalled. This can be achieved using the following code: install.packages("epicalc") install.packages("synbreed") install.packages("coda") 41 42 The york_gpdata object contains the following data-frames (for further detail on how to create a gpdata object please refer to [8]): 43 44 45 1. geno is a data frame with samples organized in rows (identified by row-names) and SNP organized in columns (identified by column-names). The genotype entries can be either in numerical format, as counts of the minor allele, or in character formatting. 46 47 48 49 2. pheno is a data frame with samples organized in rows and traits organized in columns. To use the impute function it is necessary that pheno contains at least one column (sample) with entries trio or random to identify whether the sample is part of a trio and that it should be phased as such, or if the sample is not part of a trio and it should be phased as unrelated to all other samples. 50 51 3. map is a data frame with one row for each marker and two columns (named chr and pos). The first column identifies the chromosome (numeric or character but not factor) and second column the 2

52 53 54 position on the chromosome in centi Morgan or the physical distance relative to the reference sequence in base-pairs. Unique row-names indicate the marker names which should match with marker names in geno. 55 56 4. pedigree is an object of the class pedigree in synbreed, that can be obtained using the function create.pedigree. create.pedigree requires a vector of sample IDs, a vector identifying the first 57 parent, a vector identify the second parent, and a vector containing the sex of each animal. The 58 59 60 user can further specify, if create.pedigree should infer the generation of each animal, or provide a vector identifying the generation of each sample, and if create.pedigree should add ancestors to the pedigree that did not occur in the sample vector. 61 62 63 64 65 2.2 Quality editing Animals for this study have been previously cleaned such that all given input only contains data for animals with genotypes available for more than 90% of the SNP, leaving 889 sires and 96 trio animals. To obtain a york_gpdata object was created using create.gpdata from synbreed [8]. Please refer to the manual and vignette of synbreed for more detail on how to use this function. The provided york_gpdata contains 66 67 68 69 70 71 72 73 74 75 76 77 uncleaned genotypes, with only those SNP removed that have not been called in any study sample. further process the data we will use the codegeno to clean the data and recode it into numeric format: library(impute.r) library(epicalc) library(synbreed) library(coda) data(york_gpdata) which.heter<-function(x){substr(x,1,1)!=substr(x,3,3)} # applying codegeno for recoding / quality editing york_cleaned<-codegeno(york_gpdata,impute=false, replace.value=null, maf=0.05, nmiss=0.10, label.heter=which.heter, keep.identical=true, verbose=true, print.report=false) To 78 79 80 81 82 83 84 85 The gpdata object input_cl now contains only SNP with genotypes available for more than 90% of samples, SNP with minor allele frequency (MAF) larger than 5% and alleles have been recoded into numeric counts of the B allele (0,1,2). synbreed identifies the B allele based on MAF such that if data is assembled from several sources it is not advisable to create different gpdata objects due to the fact that small differences in the MAF between the data sources provided could lead to opposite recoding. Using this input we can use the function impute, as explained below, to perform all phasing and genotype imputation that is necessary to reproduce results presented in Badke et al. [1]. 3

86 3 Haplotype phasing and genotype imputation using impute 87 88 89 90 Four different phasing and genotype imputation scenarios were used in Badke et al. [1]. In this section we detail how the gpdata object developed in section 2 can be used with the function impute to obtain the desired output for all four scenarios. First we introduce the function impute and all necessary arguments that need to be provided and second we show examples for all four scenarios. 91 92 93 94 95 96 97 98 99 100 101 102 103 104 3.1 The impute function The impute function is structurally similar to the codegeno function provided by synbreed [8]. However, codegeno does not implement estimation of phase using sire/dam/offspring trios using the BEAGLE trio input option, and it does not facilitate the use of a reference panel of haplotypes. As a result codegeno can impute randomly missing genotypes for a set of samples/snp (subsets can be obtained using discard.markers or discard.individuals), but it does not support the imputation of high density genotypes from a set of tagsnp. While using much of the original structure of codegeno we added these options to impute. Usage of impute and specification of necessary arguments: impute(gpdata, all_animals=true, animals=c(), all_snp=true, snp=c(), beagle_method=c("trio", "unrelated", "pairs"), reference=false, ref_panel=null, showbeagleoutput=false, nsamples=4, niterations=10, mem=6000) 105 106 107 108 1. gpdata is a gpdata object as detailed above containing the data frames geno, pheno, map, and pedigree. All these data frames are assembled as specified by synbreed with one column identifying the sample as either random or trio in the pheno data frame. 109 110 2. all_animals logical, should all samples in geno be imputed, default is TRUE 111 112 113 3. animals a vector containing the IDs of animals (as found in row names of pheno and geno) that should be imputed, if all_animals=false. If all_animals=true any input to this vector will be ignored. 114 115 4. all_snp logical, should all SNP in geno be imputed, default is TRUE 116 5. snp 4

117 118 a vector containing the IDs of SNP that should be used for imputation if all_snp=false. If all_snp=true any input to this vector will be ignored. of geno). 119 120 121 122 123 6. beagle_method a character string indicating the beagle method that should be used. impute takes inputs "trio", indicating that the trio procedure in BEAGLE should be used, "unrelated" indicating that no pedigree relation between animals should be assumed for imputation, and "pairs" indicating that the BEAGLE procedure to impute parent-offspring data should be used. 124 125 7. reference logical, should a reference panel be used to impute SNP/samples in geno, default is FALSE 126 8. ref_panel 127 if reference=true this is a data frame of reference haplotypes. SNP will be in the columns and 128 129 130 identified by column names and haplotypes will be in the rows. This data frame is expected to contain characters A and B to identify the alleles. Row names can be used to identify the individual the haplotype is sampled from, but they are not required. 131 132 9. showbeagleoutput logical, should the BEAGLE output during the imputation be printed on the screed, default is FALSE 133 134 135 10. nsamples numeric, identifies the number of haplotype pairs to sample for each individual during each iteration of the BEAGLE phasing algorithm. The default is nsamples=4 as specified in [2]. 136 137 138 11. niterations positive even integer giving the number of iterations of the phasing algorithm. If an odd integer is specified, the next even integer is used. The default is niterations=10 as specified in [2]. 139 140 141 12. mem numeric, is the number of Megabytes of memory available. The default is mem=6000 allowing BEAGLE to use a maximum of 6GB of RAM. 142 143 144 145 3.2 Phasing of a reference panel of haplotypes from a trio design BEAGLE [3] has a special option allowing the user to provide genotypes from sire/dam/offspring trios for phasing. The resulting sire/dam haplotypes are suitable as a reference panel of haplotypes for imputation based on low density SNP panels. The provided family file can be used to identify those animals that are 5

146 147 148 149 150 part of a sire/dam/offspring trio and provide a vector containing the IDs of these animals as input for the animal argument of the impute function: trio<-rownames(york_cleaned$geno) [as.data.frame(york_cleaned$pheno)$sample=="trio"] scen1<-impute(york_cleaned, all_animals=false, animals=trio, all_snp=true, beagle_method="trio", reference=false, showbeagleoutput=true) 151 152 153 154 155 The output of this application of the impute function is a list containing: 1) scen1$gpimputed, a gpdata object including imputed allelic dosages of all SNP/sample combinations in the geno data frame and, 2) scen1$ref, a data frame with SNP in the columns and haplotypes of the sires and dams in the input data in the rows. The data frame of haplotypes has two rows per sample. The second object returned by this function (scen1$ref) can be used as reference panel for future imputations as ref_panel=scen1$ref. 156 157 158 159 160 161 162 163 3.3 Imputation of randomly missing genotypes and phasing of unrelated individuals When there is no previous reference panel and samples are not presented in trios BEAGLE, can still estimate phase and impute missing data [3]. The following code applies the impute function to such a case for all animals labeled as randomly sampled from the sire population: sires<-rownames(york_cleaned$geno)[as.data.frame(york_cleaned$pheno)$sample=="random"] scen2<-impute(york_cleaned, all_animals=false, animals=sires, all_snp=true, beagle_method="unrelated", reference=false, showbeagleoutput=true) 164 165 The output of this imputation is identical to the output described above, only that these haplotypes were obtained from unrelated individuals. 166 167 168 169 170 171 172 173 174 3.4 Imputation of randomly missing genotypes and phasing of unrelated individuals using a reference panel of haplotypes The result of this application of impute is similar to the one in section 3.3, only that in this case the haplotypes from the first phasing run are used as reference panel for imputation: sires<-rownames(york_cleaned$geno)[as.data.frame(york_cleaned$pheno)$sample=="random"] scen3<-impute(york_cleaned, all_animals=false, animals=sires, all_snp=true, beagle_method="unrelated", reference=true, ref_panel=scen1$ref, showbeagleoutput=true) 6

175 176 177 178 179 180 181 182 183 3.5 Imputation of unrelated individuals genotyped for a subset of SNP (tagsnp) using a reference panel of high density haplotypes In this case impute uses a list of tagsnp in a dataset and a reference panel of high density haplotypes derived from high density genotypes (scen1). data(tagsnp) scen4<-impute(york_cleaned, all_animals=false, animals=paste(sires), all_snp=false, snp=tagsnp, beagle_method="unrelated", reference=true, ref_panel=scen1[[2]], showbeagleoutput=true) 184 185 The resulting gpdata object contains the data frame geno with the imputed allelic dosage, that can be used to estimate accuracy of imputation through comparison with the input data. 186 187 188 189 190 191 3.6 Estimation of accuracy of imputed genotypes Accuracy of imputation can be measured as 1) the proportion of correctly imputed alleles, 2) the correlation between observed and imputed allelic dosage, or 3) the proportion of correctly imputed alleles adjusted for MAF. The proportion of correctly imputed alleles can be obtained by either counting the difference between the observed allelic dosage and the inferred allelic dosage or by counting the difference between the observed allelic dosage and the posterior expectation of the allelic dosage: IA = 1 M N i i=1 j=1 g ij ĝ ij (1) 2 M N i i=1 192 193 194 195 196 197 198 199 200 201 202 203 204 where g ij is the observed allelic dosage of the i th SNP in the j th individual, ĝ ij is the corresponding posterior expected/inferred allelic dosage obtained from BEAGLE output, M is the total number of imputed SNP, and N i is the number of individuals with called genotypes for the i th SNP. However, recent research has pointed out, that quantifying imputation accuracy as the proportion of correctly imputed alleles is biased by the MAF of the imputed SNP [4, 5]. To obtain a measure of imputation accuracy that is unbiased by MAF we used the correlation between observed and imputed allelic dosage [1]. To estimate imputation accuracy in the following examples we used original cleaned gpdata object input_cl (section 2.2) to run the following example on SSC18 using a previously devised list of tagsnp: # discard markers in gpdata that are not on chr 18 york_gpdata<-discard.markers(york_cleaned, which=rownames(york_cleaned$map)[!york_cleaned$map$chr=="18"]) idx<-tagsnp%in%colnames(york_gpdata$geno) tagsnp<-tagsnp[idx] # making a reference panel of trios 7

205 206 207 208 209 210 211 212 213 214 215 216 217 trios<-impute(york_gpdata, all_animals=false, animals=trio, all_snp=true, beagle_method="trio", reference=false, showbeagleoutput=true) # imputing from the tagsnp for all sires imp<-impute(york_gpdata, all_animals=false, animals=sires, all_snp=false, snp=tagsnp, beagle_method="unrelated", reference=true, ref_panel=trios$ref, showbeagleoutput=true) # discarding the observed trio individuals prior to estimating accuracy obs_sires<-discard.individuals(york_gpdata, which=trio) # applying the accuracy estimation function using the observed genotypes and the imputed genotypes acc_out<-accuracy_summary(gpobserved=obs_sires, gpimputed=imp$gpimputed, tagsnp=tagsnp, HPD=0.95) 218 The function accuracy_summary returns average accuracy, SNP-specific accuracy, and individual-specific 219 accuracy, as well as several summary measures of imputation accuracy. The first object returned by 220 221 222 223 224 225 226 227 228 229 230 231 accuracy_summary is summary_acc_ia, which is a data-frame of summary measures of imputation accuracy estimated as the proportion of correctly imputed alleles. acc_out$summary_acc_ia # total Sample SNP #Min 0.0000000 0.7787460 0.7077652 #0.25 0.9908000 0.9488456 0.9428389 #0.5 0.9994500 0.9623211 0.9658024 #0.75 0.9999500 0.9731083 0.9829770 #Max 1.0000000 0.9977089 1.0000000 #mean 0.9584529 0.9584570 0.9584332 #HPD-lowerbound 0.6270000 0.9161534 0.8912189 #HPD-upperbound 1.0000000 0.9921971 1.0000000 232 233 234 235 236 237 238 239 240 241 242 243 244 245 The second object returned by accuracy_summary is summary_acc_r2, which is a data-frame of summary measures of imputation accuracy estimated as the correlation between observed and imputed allelic dosage. dim(acc_out$summary_acc_r2) # [1] 8 2 acc_out$summary_acc_r2 Sample SNP #Min 0.2658073 0.00349742 #0.25 0.8259178 0.77691586 #0.5 0.8837018 0.86064119 #0.75 0.9214254 0.93795831 #Max 0.9990741 1.00000000 #mean 0.8668054 0.83593746 #HPD-lowerbound 0.7215666 0.60426503 #HPD-upperbound 0.9892984 1.00000000 246 247 248 249 250 The third object returned by accuracy_summary is individual_acc, which is a data-frame of individual imputation accuracy measured as both the proportion of correctly imputed alleles and the correlation between observed and imputed allelic dosage. dim(acc_out$individual_acc) # [1] 889 3 8

251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 head(acc_out$individual_acc) # SampleID IA R2 #York_Sid_1 York_Sid_1 0.9399080 0.7925426 #York_Sid_10 York_Sid_10 0.9335986 0.8045120 #York_Sid_100 York_Sid_100 0.9703537 0.8996986 #York_Sid_101 York_Sid_101 0.9896789 0.9681626 #York_Sid_102 York_Sid_102 0.9440060 0.8031022 #York_Sid_103 York_Sid_103 0.8922152 0.7056243 summary(acc_out$individual_acc[,2:3]) # IA R2 # Min. :0.7787 Min. :0.2658 # 1st Qu.:0.9488 1st Qu.:0.8259 # Median :0.9623 Median :0.8837 # Mean :0.9585 Mean :0.8668 # 3rd Qu.:0.9731 3rd Qu.:0.9214 # Max. :0.9977 Max. :0.9991 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 The fourth object returned by accuracy_summary is snp_acc, which is a data-frame of SNP imputation accuracy measured as both the proportion of correctly imputed alleles and the correlation between observed and imputed allelic dosage. dim(acc_out$snp_acc) # [1] 745 3 head(acc_out$snp_acc) # SNP IA R2 #MARC0041179 MARC0041179 1.0000000 1.0000000 #H3GA0051271 H3GA0051271 0.8969368 0.4513149 #MARC0036973 MARC0036973 0.8930017 0.6192299 summary(acc_out$snp_acc[,2:3]) # IA R2 # Min. :0.7078 Min. :0.003497 # 1st Qu.:0.9428 1st Qu.:0.776916 # Median :0.9658 Median :0.860641 # Mean :0.9584 Mean :0.835938 # 3rd Qu.:0.9830 3rd Qu.:0.937958 # Max. :1.0000 Max. :1.000000 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 The fifth object returned by accuracy_summary is snp_measures, which is a data-frame of SNP MAF estimated from the observed allele frequencies and the scaled chromosomal location of each SNP. dim(acc_out$snp_measures) # [1] 745 3 head(acc_out$snp_measures) # SNP MAF scaled_position #MARC0041179 MARC0041179 0.1702489 0.008251552 #H3GA0051271 H3GA0051271 0.2007874 0.009869225 #MARC0036973 MARC0036973 0.3301462 0.011538081 #CASI0008570 CASI0008570 0.2148571 0.021077860 #H3GA0054356 H3GA0054356 0.4604072 0.030238134 #ASGA0106244 ASGA0106244 0.2725200 0.030979106 summary(acc_out$snp_measures[,2]) # Min. 1st Qu. Median Mean 3rd Qu. Max. #0.04556 0.16970 0.28630 0.28210 0.39860 0.49890 300 The sixth object returned by accuracy_summary is acc_mat, which is a data-frame of the proportion of 9

301 302 303 304 305 306 307 308 309 310 correctly imputed alleles for each genotype. Rows correspond to SNP and columns to individuals. dim(acc_out$acc_mat) # [1] 889 745 acc_out$acc_mat[1:5,1:5] # MARC0041179 H3GA0051271 MARC0036973 CASI0008570 H3GA0054356 #York_Sid_1 1 0.9997 0.99970 0.99635 0.51375 #York_Sid_10 1 0.9572 0.99920 0.99665 0.99785 #York_Sid_100 1 0.9993 0.99935 0.99645 0.99260 #York_Sid_101 1 0.9917 0.99990 0.99900 0.99995 #York_Sid_102 1 0.9994 0.99935 0.99680 0.99765 311 4 Visualization of imputation accuracy 312 In this section we provide code to obtain the figures published in Badke et al. [1]: 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 1. Accuracy of Imputation by the scaled chromosomal location of imputed SNP To investigate whether there is a difference in SNP wise imputation accuracy as a function of chromosomal location we plot the estimated accuracy by the scaled location of each SNP. This plot can be build from the output data obtained from the accuracy_summary function that we applied in 3.2. The object acc_out contains the results of accuracy_summary. In addition, we added the weighted mean average and the overall average accuracy to the plot. The graphical output can be seen in Figure 1. # opening the pdf to which the plot will be written pdf("accuracy_by_density.pdf") # obtaining the loess smoother to plot the weighted mean average pred<-loess(acc_out$snp_acc[,2]~acc_out$snp_measures[,3]) # open a plot window of the right dimensions # the accuracy/scaled location are taken from the acc_out object rendered in 3.2 plot(acc_out$snp_acc[,2]~acc_out$snp_measures[,3], main="accuracy of Imputation by the scaled chromosomal location of imputed SNP", xlab="scaled chromosome position", ylab="imputation accuracy", ylim=c(0,1)) # insert a horizontal line representing the average accuracy abline(h=mean(acc_out$snp_acc[,2]), col="green") # inserting the weighted mean average estimated using a loess smoother points(pred$x, pred$fitted, col="red", pch=18) dev.off() 333 334 335 336 337 338 339 2. Accuracy of imputation by MAF of the SNP Figures 2 contains a plot of SNP wise imputation accuracy as a function of MAF, estimated as the square correlation between observed and imputed allelic dosage. Estimates of accuracy and MAF used in this plot can be obtained from the objects available in acc_out obtained in section 3.2. In addition, we added the weighted mean average accuracy into the plot to assess if there is an obvious pattern of accuracy across all minor allele frequencies. # obtain color coding by density - darker color=more data density in that area 10

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy of Imputation by the scaled chromosomal location of imputed SNP Scaled chromosome position Imputation accuracy Figure 1: Accuracy of Imputation by the scaled chromosomal location of imputed SNP colors<-denscols(acc_out$snp_acc[,3]~acc_out$snp_measures[,2]) 340 pdf("accuracy_by_maf.pdf") 341 # obtaining the loess smoother to plot the weighted mean average 342 pred<-loess(acc_out$snp_acc[,3]~acc_out$snp_measures[,2]) 343 # open a plot window of the right dimensions 344 # the accuracy/scaled location are taken from the acc_out object rendered in 3.2 345 plot(acc_out$snp_acc[,3]~acc_out$snp_measures[,2], 346 main="accuracy of Imputation by SNP MAF", 347 xlab="maf", ylab=expression( Accuracy R ^2), 348 ylim=c(0,1), pch=20, col=colors) 349 # inserting the weighted mean average estimated using a loess smoother 350 points(pred$x, pred$fitted, col="red", pch=18) 351 dev.off() 352 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy of Imputation by SNP MAF MAF Accuracy R 2 Figure 2: Accuracy of Imputation by SNP MAF 3. Accuracy of Imputation by tagsnp density and selection method 353 In the paper accompanying this package several methods for tagsnp selection were compared across a 354 11

355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 variety of tagsnp densities [1]. Since only one density of tagsnp was explored in the example above we have provided a small table with imputation accuracy for several densities of tagsnp estimated for all three methods of tagsnp selection, with the corresponding graphical output shown in Figure 3: # Accuracy of imputation by tagsnp density data("acc_table") head(acc_table) # Number.of.SNP r2.threshold FESTA BEAGLE n_even evenly.spaced #1 87 0.1 0.870 0.888 87 0.875 #2 165 0.2 0.936 0.930 161 0.920 #3 235 0.3 0.956 0.945 217 0.939 #4 317 0.4 0.964 NA 274 0.948 #5 399 0.5 0.970 NA 328 0.957 #6 478 0.6 0.976 NA 370 0.957 pdf("accuracy_by_tagsnpdensity.pdf") tab<-acc_table # open an empty plot with the correct dimensions plot(0, pty="n", main="accuracy by density", xlab="number of tagsnp", ylab="accuracy of Imputation", ylim=c(0,1), xlim=c(0,max(tab[,1])+50)) # add points for the results of statistical tagsnp selection points(tab[,3]~tab[,1], type="p", col="black", pch=19) # add points for the results of predictive tagsnp selection points(tab[,6]~tab[,5], type="p", col="red", pch=15) # add points for the results of evenly spaced tagsnp points(tab[,4]~tab[,1], type="p", col="darkgreen", pch=17) # add a legend legend(x="bottomright", pch=c(19, 15, 17), bty="n", legend=c("statistical selection", "evenly spaced", "predictive selection"), col=c("black","red","darkgreen")) dev.off() Accuracy by density Accuracy of Imputation 0.0 0.2 0.4 0.6 0.8 1.0 statistical selection evenly spaced predictive selection 0 200 400 600 number of tagsnp Figure 3: Accuracy of Imputation by tagsnp density and selection method 384 This code can be adjusted to obtain figures similar to Figures 1 and 2 in Badke et. al [1]. 385 386 4. Accuracy of imputation by reference panel size Badke et al. [1] also investigated the effect of increasing the number of reference haplotypes. To obtain 12

387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 a larger reference panel the available 889 Yorkshire sires split into 200 validation animals and 689 animals that were added to stepwise increase the number of reference haplotypes. Since the example detailed in 3.2 only included one imputation we provided a file containing imputation accuracy for a random sample of 2000 SNP for a variety of reference panels (accuracy_by_ref_size.txt). data(ref_size) # extracting the number of reference animals from column names n_ref<-as.numeric(sub("x", "",colnames(ref_size[,2:7])))*2 # average accuracy for reference panel sizes - each column corresponds to a panel size avg_acc<-colmeans(ref_size[,2:7]) pdf("accuracy_by_refsize.pdf") # plot accuracy by number of reference haplotypes plot(avg_acc~n_ref, type="p", main="accuracy by Reference panel size", ylab=expression( Accuracy R ^2), xlab="number of reference haplotypes", xlim=c(0,max(n_ref)), ylim=c(0,1), pch=20) dev.off() Accuracy by Reference panel size Accuracy R 2 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 1200 1400 Number of reference haplotypes Figure 4: Accuracy of Imputation by reference panel size 402 403 404 405 406 407 408 409 410 411 412 413 414 5. Supplementary Figures 1 & 2 The supplementary Figures 1 and 2 provided by Badke et al. [1] show the weighted mean average accuracy as a function of MAF and the chromosomal location for a variety of reference panel sizes, to illustrate how increasing the number of reference haplotypes affects overall SNP accuracy, but especially imputation accuracy of SNP with previously below average accuracy. Example code to obtain a figure of that particular type can be found below for the graphical output that can be see in Figure 5 pdf("accuracy_by_refsize_weighted.pdf") # specifying colors for all sizes of the reference panel cols<-c("black", "red", "blue", "orange", "magenta", "darkgreen") # open an empty plot of the correct dimensions plot(0, type="n", main="accuracy of Imputation by increasing reference panel size", ylab=expression( Accuracy R ^2), xlab="scaled chromosomal location", ylim=c(0,1), xlim=c(0,1)) 13

415 416 417 418 419 420 421 422 423 424 # adding all 7 columns as points to the plot for (i in 1:length(n_ref)) { # estimating the weighted mean average using a loess smoother pred<-loess(ref_size[,i+1]~ref_size[,8]) # adding the points points(pred$x, pred$fitted, col=cols[i], pch=18, cex=0.25) } # including a legend to the plot legend(x="bottomright", pch=18, bty="n", legend=paste(n_ref, " reference haplotypes", sep= ), col=cols) dev.off() Accuracy of Imputation by increasing reference panel size Accuracy R 2 0.0 0.2 0.4 0.6 0.8 1.0 128 reference haplotypes 256 reference haplotypes 512 reference haplotypes 1024 reference haplotypes 1200 reference haplotypes 1378 reference haplotypes 0.0 0.2 0.4 0.6 0.8 1.0 scaled chromosomal location Figure 5: Accuracy of Imputation by increasing reference panel size 425 References 426 427 [1] Yvonne M Badke, Ronald O Bates, Catherine W Ernst, Clint Schwab, Justin Fix, and Juan P Steibel. TagSNP selection and imputation accuracy using a reduced size haplotype panel in swine. submitted, 2012. 428 [2] Brian L Browning. Documentation of BEAGLE 3.3.1, 2011. 429 430 431 [3] Brian L Browning and Sharon R Browning. A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals. Am J Hum Genet, 84(2):210 223, January 2009. 432 433 [4] B. J. Hayes, P. J. Bowman, H. D. Daetwyler, J. W. Kijas, and J. H. J. van der Werf. Accuracy of genotype imputation in sheep breeds. Anim Genet, 43(1):72 80, February 2012. 434 435 [5] John M. Hickey, Jose Crossa, Raman Babu, and Gustavo de los Campos. Factors Affecting the Accuracy of Genotype Imputation in Populations from Several Maize Breeding Programs. Crop Science, 52(2):654, 2012. 436 437 [6] Z S Qin, S Gopalakrishnan, and G R Abecasis. An efficient comprehensive search algorithm for tagsnp selection using linkage disequilibrium criteria. Bioinformatics, 22(2):220 225, January 2006. 14

438 [7] The R Development Core Team. R: A language and environment for statistical computing. Vienna, Austria, 2011. 439 440 [8] Valentin Wimmer, Theresa Albrecht, Hans-Jürgen Auinger, and Chris-Carolin Schön. synbreed: A framework for the analysis of genomic prediction data using R. Bioinformatics, 28(15):2086 7, 2012. 15