Detecting Conserved Sequences Eukaryotic Comparative Genomics June 2018 GEP Alumni Workshop Charles Darwin Motoo Kimura Barak Cohen Evolution of Neutral DNA Evolution of Non-Neutral DNA A A T C T A A T T G C T G T G A T T C A G A G T A G C A G T G A A T A G T C T T T G A T G T T G T T G C A G G A G T A G T C G T A * * * * * * * * * * * * * * * * * * * * * * * * * A C T T A G T C C G A T G T G C G T A C C G A C C A T A A G G A T G A C C A C G T A T A C C A T G T G G T A T C C G A T C C A T A A G C A T A C T * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Multi-Species Alignment ATGTGGCGCAGCCTGTGCCAGCTGGACGATCGA ATGTAGCCTAGCCAGTGCCAGCTGGACGATCGA GTACATCGATAGCTTAGAATGCTGGACGATCTC GTACGTCGATAGCATAGAATGCTGGACGATCTC * * * * *********** How to do Comparative Genomics 1. Choose species to analyze 2. Align sequences 3. Identify streches of highly conserved nucleotides 1
closely related species Choose species distantly related species Closely Related Species align well not many changes Distantly Related Species hard to align lots of changes Case Study: Coding vs.non-coding. ORF TAA CASE 1: Non-Coding Non-Coding DNA -regulatory functions -short (5-15 bp) -degenerate -variable spacing Coding DNA -codes for protein -triplet code -open reading frame (ORF) -tend to be long (50-500 bp) -highly constrained TAA Closely-related sequences are uninformative paradoxus TCTTCTGAGACAGCATCACTTCTTCTTNTTTTTTACATAACTTATTCTTCTATAATTTTC cerevisiae TCCTTTGAGACAGCATTCGCCCAGTATTTTTTTTATTCTACA-AACCTTCTATAATTT-C ** * *********** * * ******* ** * ************ * paradoxus AACGTATTTACATAGTTCTGTATCAGTTTAATCACCATAATATTGTTTTCCCTCAACTAA cerevisiae AAAGTATTTACATAATTCTGTATCAGTTTAATCACCATAATATCGTTTTCT-----TTGT ** *********** **************************** ****** * paradoxus TGAATGCAATTAGATTTTCTTATTGTTCCCTCGCGGCTTTTTTTTGTTTTATAATCTATT cerevisiae TTAGTGCAATTAATTTTTCCTATTGTTACTTCG-GGCCTTTTTCTGTTTTATGAGCTATT * * ******** ***** ******* * *** *** ***** ******** * ***** paradoxus TTTTCCGTCATTTCTTCCCCAGATTTCCAACTTCATCTCCAGATTGTGTCTATGTAATGC cerevisiae TTTTCCGTCATC-CTTCCCCAGATTTTCAGCTTCATCTCCAGATTGTGTCTACGTAATGC *********** ************* ** ********************** ******* paradoxus ATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCTACTGTCT cerevisiae ACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGCTACTGTCT * ** ***** ** *** * ** ****** *** ********** *************** 2
Distantly-related sequences do not align Noncoding (Promoter) cerevisiae ACTTACCAT-CAAC-CATAGATGGGTAAAC---GGTTAGTAACTAGGAACACGAT castelli AGA-GTCAAACTTTTCGT ATA--TATATATAATATGTCTGATTGCTGGTT---T * ** * * * * * * * * * Multiple sequence alignments reveal conserved elements cerevisiae TGAGACAGCAT-CACTTCTT-CTTNTTTTTTACATAACTTATTCTTCTATAATTTTCAAC mikatae TGAGACAGCATTCACTTCTTTCTTTTTTTTTACATATCTTATTCTTCTATAATTTTCAAC Bayanus TGAGACAGCATTCGCCCAGT--ATTTTTTTTAT-TCTACAAACCTTCTATAATTT-CAAA kudriadzevi TGAGACTGCACTCCC--------TCTTCCTTTC------------TCCATAACTT---AC ****** *** * * * ** ** ** **** ** * UAS1 UAS2 paradoxus GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAAC kluyveri GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAAC cerevisiae GTATTTACATAATTCTGTATCAGTTTAATCACCATAAT------ATCGTTTTCTTTGT-- bayanus TTATTTACATAGTTTTGTATCAGTTTAATCACCATAATCGTAACACCGTTTTACCTCACC ********** ** *********************** * ***** * paradoxus kluyveri cerevisiae bayanus paradoxus kluyveri cerevisiae bayanus paradoxus kluyveri cerevisiae bayanus TAATGAATGCAATTAGATTTTC-TTATTGTTCCC-TCGCGGCTTTTTTTTGTTTTATAAT TAATGAATGCAATTAGATTTTCCTTATTGTTCCCCTCGCGGCTTTTTTTTGTTTTATAAT ---TTAGTGCAATTAATTTTTC-CTATTGTTACT-TCG-GGCCTTTTTCTGTTTTATGAG TGATGCGGG--A---ATCCTTC-AGACCGTTCTC-TCGCGC------------------- * * * *** * *** *** * UES MIG1 MIG1 -CTATTTTTTCCGTCATTTCTTCCCC-AGATTTCCAACTTCAT-CTCCAGATTGTGTCTA ACTATTTTTTCCGTCATTTCTTCCCCCAGATTTCCAACTTCATACTCCAGATTGTGTCTA -CTATTTTTTCCGTCATC-CTTCCCC-AGATTTTCAGCTTCAT-CTCCAGATTGTGTCTA -CTTTTTTTTTCGTCATTTCTTCCCC-AGATCTACAACTTTAA-CTCCAGACGGTGTATA ** ****** ****** ******* **** * ** *** * ******* **** ** TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGC TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGC CGTAATGCACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGC GGCAGTACAAGCAGTGCTTTTGGGAAGAGGCAAAGCTGCAGACCTCGAGAACAATGAAGC * * * ** ** * * ** ** * * ** ** **** *** ******* CASE 2: Coding CLN3 TAA 3
Closely-related sequences are uninformative Less distantly related species not informative either Distanly related species reveal functional protein domains Identification of Multi-Species Conserved Regions (MCS) Human Chimp Mouse Rat Dog cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctct cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctct ttcagtcgtttcccagtgtctctga-cattcagagactactttagtaagcattt-tctct tcagtccttccctggcatctccag-cactcaa-gactactttagtaagcattt-tctctg tcaatgactttcccagtctcttctactgggaagagattaggttgcaaatcatttttctct * * * * * * ** How can we decide if this region is conserved? Margulies et al (2003) Gen. Res. 13:2507-18 4
Its like flipping coins (really) Binomial-Based Method for Detecting Conserved Sequences Human: AATGG Mouse: AATCG Status: CCCDC p = probability that a site is the same between human and mouse by chance alone (Kimura), q = 1-p For an alignment N base pairs long with n identities calculate the cumulative binomial probability as: P ( X n) = N i= n p q i N i N i Margulies et al (2003) Gen. Res. 13:2507-18 Large sequencing projects are underway Tree Topology Influences Power Star Phylogeny Actual Phylogeny species A species F species B species E species C species D 5
Challenges in larger genomes PhastCons and the UCSC Browser Olig2 1) Deciding on the neutral rate of substitution 2) Local differences in neutral rate of substitutions 3) Multiple hypothesis testing 100 Kb upstream of Olig2 4) Repeat sequences and uneven base composition Motif Searching Across Several Multiple Alignments Information Content Species 1 Species 2 Species 3 Gene 1 Gene 2 Gene 3 Gene N EcoR1 Random GCCTAC ACATTC TCATTC CGACTC ATATCG GAAATG Rap1 TGTATGGGTG TGTTCGGATT TGCATGGGTG TGTACAGGTG TGTATGGATG TGTTCGGGTT TGTATGGGTG Weight Matrix Model of TATA Box Weight Matrix Model of TATA Box Score = -24 A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11.A C T A T A A T G T A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11 6
Weight Matrix Model of TATA Box Weight Matrix Model of TATA Box Score = 43 N(b,i).A C T A T A A T G T A: -8 10-1 2 1-8 F(b,i) C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11 S(b,i) = log[f(b,i)/p(b)] Now we can compare motifs to each other Species 1 Species 2 Species 3 MAGMA unaligned motif finding in multispecies conserved regions Gene 1 Gene 2 Gene 3 Gene N A C G T 4-3 5-6 -2-5 2-1 -2 11-1 -1-10 8 2-4 2-3 -3 2 1 2-3 15 A C G T 3-2 2 1 3 1 3-1 -2 7-2 -1-8 6 3-2 2-2 -1 1 1 4-3 9 *Ihuegbu, Stormo, & Buhler, JCB 19:139, 2012 7