Why PAM Works. An In-Depth Look at Scoring Matrices and Algorithms. Michael Darling Nazareth College. The Origin: Sequence Alignment

Why PAM Works An In-Depth Look at Scoring Matrices and Algorithms Michael Darling Nazareth College The Origin: Sequence Alignment Scoring used in an evolutionary sense Compare protein sequences to find interspecies relationships Also find protein relationships within an organism Helps to explain life as we know it

Introduction: Scoring Sequences Find the relationship between sequences Compare matching and unmatching amino acids Simple Example: Sequences ACTCCA and GCACTA First, align the sequences side by side: ACTCCA GCACTA Use scoring system: Match=1, Mismatch=-1 First column is mismatch, second is match, etc. There are 3 matches and 3 mismatches, giving a score of: -1+1-1+1-1+1=0 With no mismatch penalty, score is 3 Gap Penalties Insertion of gaps provides a better relation Used to account for insertions or deletions Penalties include a subtraction from the score for the start of a gap, as well as for the length Consider the same example, but add gaps: ACTCCA GCACT A The creation of two gaps made a fourth match, but the score may not change depending on the scale of the penalty If origin penalty is -1 and length is 0, then we have 4 matches minus 2 gaps, giving us a score of 2. If origin penalty is -1 and length is length/2, we have 4 matches minus 2 gaps of length 2 (minus one each), bringing the score back to zero.

Creation of a Scoring Matrix Start by deriving a PAM (Accepted Point Mutation) Matrix M Ratio = Use data to find probability of protein substitutions in sequences Divide number of substitutions of specific amino acid by relative mutability ab where M ab is probability that the substitution can happenin nature, pb is the frequency of pb Relative mutability is the total number of times substituted by ANY other amino acid Substitutions P = Rl. Mutability occurenceof b Normalize this value with the probability of occurrence of the amino acids Resulting value is probability that a column amino acid is substituted by its corresponding row amino acid with 1% divergence This 1% substitution rate corresponds to one PAM unit, thus giving us PAM-1 PAM-1 A R N D C Q E G H I L K M F P S T W Y V A 0.9867 0.0002 0.0009 0.0010 0.0003 0.0008 0.0017 0.0021 0.0002 0.0006 0.0004 0.0002 0.0006 0.0002 0.0022 0.0035 0.0032 0.0000 0.0002 0.0018 R 0.0001 0.9913 0.0001 0.0000 0.0001 0.0010 0.0000 0.0000 0.0010 0.0003 0.0001 0.0019 0.0004 0.0001 0.0004 0.0006 0.0001 0.0008 0.0000 0.0001 N 0.0004 0.0001 0.9822 0.0036 0.0000 0.0004 0.0006 0.0006 0.0021 0.0003 0.0001 0.0013 0.0000 0.0001 0.0002 0.0020 0.0009 0.0001 0.0004 0.0001 D 0.0006 0.0000 0.0042 0.9859 0.0000 0.0006 0.0053 0.0006 0.0004 0.0001 0.0000 0.0003 0.0000 0.0000 0.0001 0.0005 0.0003 0.0000 0.0000 0.0001 C 0.0001 0.0001 0.0000 0.0000 0.9973 0.0000 0.0000 0.0000 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0001 0.0000 0.0003 0.0002 Q 0.0003 0.0009 0.0004 0.0005 0.0000 0.9876 0.0027 0.0001 0.0023 0.0001 0.0003 0.0006 0.0004 0.0000 0.0006 0.0002 0.0002 0.0000 0.0000 0.0001 E 0.0010 0.0000 0.0007 0.0056 0.0000 0.0035 0.9865 0.0004 0.0002 0.0003 0.0001 0.0004 0.0001 0.0000 0.0003 0.0004 0.0002 0.0000 0.0001 0.0002 G 0.0021 0.0001 0.0012 0.0011 0.0001 0.0003 0.0007 0.9935 0.0001 0.0000 0.0001 0.0002 0.0001 0.0001 0.0003 0.0021 0.0003 0.0000 0.0000 0.0005 H 0.0001 0.0008 0.0018 0.0003 0.0001 0.0020 0.0001 0.0000 0.9912 0.0000 0.0001 0.0001 0.0000 0.0002 0.0003 0.0001 0.0001 0.0001 0.0004 0.0001 I 0.0002 0.0002 0.0003 0.0001 0.0002 0.0001 0.0002 0.0000 0.0000 0.9872 0.0009 0.0002 0.0021 0.0007 0.0000 0.0001 0.0007 0.0000 0.0001 0.0033 L 0.0003 0.0001 0.0003 0.0000 0.0000 0.0006 0.0001 0.0001 0.0004 0.0022 0.9947 0.0002 0.0045 0.0013 0.0003 0.0001 0.0003 0.0004 0.0002 0.0015 K 0.0002 0.0037 0.0025 0.0006 0.0000 0.0012 0.0007 0.0002 0.0002 0.0004 0.0001 0.9926 0.0020 0.0000 0.0003 0.0008 0.0011 0.0000 0.0001 0.0001 M 0.0001 0.0001 0.0000 0.0000 0.0000 0.0002 0.0000 0.0000 0.0000 0.0005 0.0008 0.0004 0.9874 0.0001 0.0000 0.0001 0.0002 0.0000 0.0000 0.0004 F 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0008 0.0006 0.0000 0.0004 0.9946 0.0000 0.0002 0.0001 0.0003 0.0028 0.0000 P 0.0013 0.0005 0.0002 0.0001 0.0001 0.0008 0.0003 0.0002 0.0005 0.0001 0.0002 0.0002 0.0001 0.0001 0.9926 0.0012 0.0004 0.0000 0.0000 0.0002 S 0.0028 0.0011 0.0034 0.0007 0.0011 0.0004 0.0006 0.0016 0.0002 0.0002 0.0001 0.0007 0.0004 0.0003 0.0017 0.9840 0.0038 0.0005 0.0002 0.0002 T 0.0022 0.0002 0.0013 0.0004 0.0001 0.0003 0.0002 0.0002 0.0001 0.0011 0.0002 0.0008 0.0006 0.0001 0.0005 0.0032 0.9871 0.0000 0.0002 0.0009 W 0.0000 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0000 0.0001 0.0000 0.9976 0.0001 0.0000 Y 0.0001 0.0000 0.0003 0.0000 0.0003 0.0000 0.0001 0.0000 0.0004 0.0001 0.0001 0.0000 0.0000 0.0021 0.0000 0.0001 0.0001 0.0002 0.9945 0.0001 V 0.0013 0.0002 0.0001 0.0001 0.0003 0.0002 0.0002 0.0003 0.0003 0.0057 0.0011 0.0001 0.0017 0.0001 0.0003 0.0002 0.0010 0.0000 0.0002 0.9901 This PAM-1 Matrix was obtained from M.O. Dayhoff and colleagues, 1978

PAM Matrices on a Larger Scale The corresponding number, or PAM unit, of a PAM matrix represents the number of mutations per 100 residues How do we come up with PAM-2, PAM-3, PAM- 100, PAM-250?? Just use Matrix Multiplication: PAM-x = (PAM-1) x Questions to consider: For PAM-2, why do we use matrix multiplication and not just square each entry? The probability of A remaining A after one substitution squared should be the probability of A remaining A with two substitutions by the laws of probability, right? Why PAM-x = (PAM-1) x PAM Matrices consider probabilities of mutations Matrix Multiplication: Simple Example: Alanine(A) and Arginine(R).95.04» Assume P(A-A)=.95, P(A-R)=.04, P(R-A)=.05, P(R-R)=.87.05.95.87.04.05.95.95 +.05.04 =.87.04.95 +.87.04.95.05 +.05.87.04.05 +.87.87» From this we can see that PAM takes into consideration not only A remaining A with 2 substitutions, but adds to this the probability that A is substituted by R, which is then in turn substituted by A, thus making the original A an A after 2 mutations. Note: Numbers used in example are arbitrary

Log-Odds Score Matrix Odds Ratio: R = M» where R is our desired ratio, M ab is the probability that the ab p ab mutation is accepted by nature, p b is the frequency of b occurrence of b Dayhoff took 10 times the log of this result to get a score for each mutation» Logs are used for counting purposes» Logs are more efficient than multiplying at each position S ( a, b) = 10 log( R) Quick Overview: Needleman & Wunsch Set up matrix for optimization of alignment Want to minimize gaps but maximize matches 4 possibilities at each point in matrix:» Match, Mismatch, Gap in Seq. 1, Gap in Seq. 2 Scores are input into matrix, looking to maximize score Quick Model of Process: Seq.1 Seq.2 Diagonal lines are alignments, vertical are gaps in sequence 1, horizontal are gaps in sequence 2

Sources Krane, Dan E. and Michael L. Raymer. Fundamental Concepts of Bioinformatics. San Francisco: Pearson Education Inc, 2003 Pevsner, Jonathon. Bioinformatics and Functional Genomics. New Jersey: John Wiley & Sons, 2003