Eukaryotic Comparative Genomics
Detecting Conserved Sequences Charles Darwin Motoo Kimura
Evolution of Neutral DNA A A T C TA AT T G CT G T GA T T C A GA G T A G CA G T GA AT A GT C T T T GA T GT T G T T GC A G GA GT A GT C G T A * * * * * * * * * * * * * * * * * * * * * * * * *
Evolution of Non-Neutral DNA A CT T AG T C CG A T G T G CG T A C C G A C C A T A AG G A TG AC C A * C GT A T AC C A T G T G G T A TC C G AT C C A T A A G CA T A CT * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Multi-Species Alignment ATGTGGCGCAGCCTGTGCCAGCTGGACGATCGA ATGTAGCCTAGCCAGTGCCAGCTGGACGATCGA GTACATCGATAGCTTAGAATGCTGGACGATCTC GTACGTCGATAGCATAGAATGCTGGACGATCTC * * * * ***********
How to do Comparative Genomics 1. Choose species to analyze 2. Align sequences 3. Identify streches of highly conserved nucleotides
Choose species closely related species distantly related species Closely Related Species align well not many changes Distantly Related Species hard to align lots of changes
~10Mya ~20Mya S.cerevisiae S. cariocanus S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S. castellii S. kluyveri ~150Mya >350Mya Kluyveromyces lactis Schizosaccharomyces pombe
Case Study: Coding vs.non-coding ATG. ORF TAA Non-Coding DNA -regulatory functions -short (5-15 bp) -degenerate -variable spacing Coding DNA -codes for protein -triplet code -open reading frame (ORF) -tend to be long (50-500 bp) -highly constrained
CASE 1: Non-Coding ATG GAL4 TAA
~10Mya ~20Mya S.cerevisiae S. cariocanus S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S. castellii S. kluyveri ~150Mya >350Mya Kluyveromyces lactis Schizosaccharomyces pombe
Closely-related sequences are uninformative ATG GAL4 paradoxus TCTTCTGAGACAGCATCACTTCTTCTTNTTTTTTACATAACTTATTCTTCTATAATTTTC cerevisiae TCCTTTGAGACAGCATTCGCCCAGTATTTTTTTTATTCTACA-AACCTTCTATAATTT-C ** * *********** * * ******* ** * ************ * paradoxus AACGTATTTACATAGTTCTGTATCAGTTTAATCACCATAATATTGTTTTCCCTCAACTAA cerevisiae AAAGTATTTACATAATTCTGTATCAGTTTAATCACCATAATATCGTTTTCT-----TTGT ** *********** **************************** ****** * paradoxus TGAATGCAATTAGATTTTCTTATTGTTCCCTCGCGGCTTTTTTTTGTTTTATAATCTATT cerevisiae TTAGTGCAATTAATTTTTCCTATTGTTACTTCG-GGCCTTTTTCTGTTTTATGAGCTATT * * ******** ***** ******* * *** *** ***** ******** * ***** paradoxus TTTTCCGTCATTTCTTCCCCAGATTTCCAACTTCATCTCCAGATTGTGTCTATGTAATGC cerevisiae TTTTCCGTCATC-CTTCCCCAGATTTTCAGCTTCATCTCCAGATTGTGTCTACGTAATGC *********** ************* ** ********************** ******* paradoxus ATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGCTACTGTCT cerevisiae ACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGCTACTGTCT * ** ***** ** *** * ** ****** *** ********** ***************
~10Mya ~20Mya S.cerevisiae S. cariocanus S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S. castellii S. kluyveri ~150Mya >350Mya Kluyveromyces lactis Schizosaccharomyces pombe
Distantly-related sequences do not align ATG GAL4 Noncoding (Promoter) cerevisiae ACTTACCAT-CAAC-CATAGATGGGTAAAC---GGTTAGTAACTAGGAACACGAT castelli AGA-GTCAAACTTTTCGT ATA--TATATATAATATGTCTGATTGCTGGTT---T * ** * * * * * * * * *
~10Mya ~20Mya S.cerevisiae S. cariocanus S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S. castellii S. kluyveri ~150Mya >350Mya Kluyveromyces lactis Schizosaccharomyces pombe
Multiple sequence alignments reveal conserved elements cerevisiae TGAGACAGCAT-CACTTCTT-CTTNTTTTTTACATAACTTATTCTTCTATAATTTTCAAC mikatae TGAGACAGCATTCACTTCTTTCTTTTTTTTTACATATCTTATTCTTCTATAATTTTCAAC Bayanus TGAGACAGCATTCGCCCAGT--ATTTTTTTTAT-TCTACAAACCTTCTATAATTT-CAAA kudriadzevi TGAGACTGCACTCCC--------TCTTCCTTTC------------TCCATAACTT---AC ****** *** * * * ** ** ** **** ** * paradoxus kluyveri cerevisiae bayanus UAS1 ATG UAS2 GAL4 GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAAC GTATTTACATAGTTCTGTATCAGTTTAATCACCATAAT------ATTGTTTTCCCTCAAC GTATTTACATAATTCTGTATCAGTTTAATCACCATAAT------ATCGTTTTCTTTGT-- TTATTTACATAGTTTTGTATCAGTTTAATCACCATAATCGTAACACCGTTTTACCTCACC ********** ** *********************** * ***** * paradoxus kluyveri cerevisiae bayanus paradoxus kluyveri cerevisiae bayanus paradoxus kluyveri cerevisiae bayanus TAATGAATGCAATTAGATTTTC-TTATTGTTCCC-TCGCGGCTTTTTTTTGTTTTATAAT TAATGAATGCAATTAGATTTTCCTTATTGTTCCCCTCGCGGCTTTTTTTTGTTTTATAAT ---TTAGTGCAATTAATTTTTC-CTATTGTTACT-TCG-GGCCTTTTTCTGTTTTATGAG TGATGCGGG--A---ATCCTTC-AGACCGTTCTC-TCGCGC------------------- * * * *** * *** *** * UES MIG1 MIG1 -CTATTTTTTCCGTCATTTCTTCCCC-AGATTTCCAACTTCAT-CTCCAGATTGTGTCTA ACTATTTTTTCCGTCATTTCTTCCCCCAGATTTCCAACTTCATACTCCAGATTGTGTCTA -CTATTTTTTCCGTCATC-CTTCCCC-AGATTTTCAGCTTCAT-CTCCAGATTGTGTCTA -CTTTTTTTTTCGTCATTTCTTCCCC-AGATCTACAACTTTAA-CTCCAGACGGTGTATA ** ****** ****** ******* **** * ** *** * ******* **** ** TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGC TGTAATGCATGCTATCATATTGAGAAAAGATAGAGAAACAACCCTCCTGAAAAATGAAGC CGTAATGCACGCCATCATTTTAAGAGAGGACAGAGAAGCAAGCCTCCTGAAAGATGAAGC GGCAGTACAAGCAGTGCTTTTGGGAAGAGGCAAAGCTGCAGACCTCGAGAACAATGAAGC * * * ** ** * * ** ** * * ** ** **** *** *******
CASE 2: Coding ATG CLN3 TAA
~10Mya ~20Mya S.cerevisiae S. cariocanus S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S. castellii S. kluyveri ~150Mya >350Mya Kluyveromyces lactis Schizosaccharomyces pombe
Closely-related sequences are uninformative
~10Mya ~20Mya S.cerevisiae S. cariocanus S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S. castellii S. kluyveri ~150Mya >350Mya Kluyveromyces lactis Schizosaccharomyces pombe
Less distantly related species not informative either
~10Mya ~20Mya S.cerevisiae S. cariocanus S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. pastorianus S. servazzii S. unisporus S. exiguus S. diarenensis S. castellii S. kluyveri ~150Mya >350Mya Kluyveromyces lactis Schizosaccharomyces pombe
Distanly related species reveal functional protein domains
Identification of Multi-Species Conserved Regions (MCS) Human Chimp Mouse Rat Dog cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctct cccattcttttccaagtgtctccg--cctgcagcgattaggttagaaagcatttctctct ttcagtcgtttcccagtgtctctga-cattcagagactactttagtaagcattt-tctct tcagtccttccctggcatctccag-cactcaa-gactactttagtaagcattt-tctctg tcaatgactttcccagtctcttctactgggaagagattaggttgcaaatcatttttctct * * * * * * ** How can we decide if this region in conserved? Margulies et al (2003) Gen. Res. 13:2507-18
Binomial-Based Method for Detection of MCS Human: AATGG Mouse: AATCG Status: CCCDC p = chance that a site is the same between human and mouse, q = 1-p For an alignment N base pairs long with n identities calculate the cumulative binomial probability as: P ( X n) N i n p i q N i N i Margulies et al (2003) Gen. Res. 13:2507-18
How to score human-mouse conservation? score M σ μ 1) Look at 50 bp windows that align 2) M is the number of identical bases in a particular 50 bp alignment 3) is the average number of identical residues in 50 bp alignments of local ancient, syntenic repeats (neutral) 4) is the standard deviation of Nature (2002) 420: 520-62
5% Conserved between Human-Mouse Red = neutral Blue = observed genomic Gray = estimated selection (20% of windows under selection)(25% of bp in alignments) = 5% Nature (2002) 420: 520-62
What does 5% conservation mean? Only 1.5% of the genome is coding sequence 5 UTRs, 3 UTRs, promoters, and introns do not make up the difference
Problem with resolution Answer: Sequence more genomes (maybe)! Eddy 2005: Binomial model for power calculations
Tree Topology Influences Power Star Phylogeny Actual Phylogeny species A species F species B species E species C species D
Ultraconserved Sequences 481 sequences longer than 200 bp are 100% identical between orthologous regions of human, mouse, and rat Most conserved at 99% in chicken and dog too 5000 sequences longer than 100 bp are 100% identical in these species Bejerano et al (2004) Science 304: 1321-1325
Olig2 100 Kb upstream of Olig2
So what do they do?