Why PAM Works. An In-Depth Look at Scoring Matrices and Algorithms. Michael Darling Nazareth College. The Origin: Sequence Alignment

Similar documents
STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

MUMmer 2.0. Original implementation required large amounts of memory

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Fractions with Frosting

Grade 7 Unit 2 Family Materials

Economics 101 Spring 2016 Answers to Homework #1 Due Tuesday, February 9, 2016

What Is This Module About?

1.3 Box & Whisker Plots

Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology

Comparing and Graphing Ratios

SOYBEAN GROWTH & DEVELOPMENT

Detecting Melamine Adulteration in Milk Powder

Flexible Imputation of Missing Data

Jake Bernstein Trading Webinar

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

Unit 2, Lesson 2: Introducing Proportional Relationships with Tables

Predicting Wine Quality

Entry Level Assessment Blueprint Retail Commercial Baking

Activity 10. Coffee Break. Introduction. Equipment Required. Collecting the Data

Name: Class: Date: Secondary I- CH. 10 Test REVIEW. 1. Which type of thin-crust pizza was most popular?

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

Economics 452 International Trade Theory and Policy Fall 2012

Since the cross price elasticity is positive, the two goods are substitutes.

The Column Oven Oven capabilities Oven safety Configuring the oven Making a temperature-programmed run Fast chromatography

What makes a good muffin? Ivan Ivanov. CS229 Final Project

Protest Campaigns and Movement Success: Desegregating the U.S. South in the Early 1960s

Alisa had a liter of juice in a bottle. She drank of the juice that was in the bottle.

Distribution of Hermit Crab Sizes on the Island of Dominica

Lesson 11: Comparing Ratios Using Ratio Tables

Northern Region Central Region Southern Region No. % of total No. % of total No. % of total Schools Da bomb

Chapter 3. Labor Productivity and Comparative Advantage: The Ricardian Model

Preview. Chapter 3. Labor Productivity and Comparative Advantage: The Ricardian Model

Demand, Supply and Market Equilibrium. Lecture 4 Shahid Iqbal

Chapter 1: The Ricardo Model

Harvesting Soybean. Soybean Loss. John Nowatzki Extension Agricultural Machine Systems Specialist

Maximising Sensitivity with Percolator

A New Approach for Smoothing Soil Grain Size Curve Determined by Hydrometer

Managing Multiple Ontologies in Protégé

wine 1 wine 2 wine 3 person person person person person

b) Travis was attempting to make muffins to take to a neighbor that had just moved in down the

Multiple Imputation for Missing Data in KLoSA

Archival copy. For current information, see the OSU Extension Catalog:

Cut Rite V9 MDF Door Library

Archdiocese of New York Practice Items

Update to A Comprehensive Look at the Empirical Performance of Equity Premium Prediction

Math Practice Use Operations

The Effect of Almond Flour on Texture and Palatability of Chocolate Chip Cookies. Joclyn Wallace FN 453 Dr. Daniel

International Collegiate Programming Contest South German Winter Contest. January 31, 2009

To: Professor Roger Bohn & Hyeonsu Kang Subject: Big Data, Assignment April 13th. From: xxxx (anonymized) Date: 4/11/2016

Multiply and Divide Rationals: Play Answer Sheet

This problem was created by students at Western Oregon University in the spring of 2002

Weather Sensitive Adjustment Using the WSA Factor Method

Calculating the Costs of Bur Management

Online Appendix to The Effect of Liquidity on Governance

Chapter 3: Labor Productivity and Comparative Advantage: The Ricardian Model

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

ENGI E1006 Percolation Handout

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Wine by Design Lisa Custer, PhD. Co-Experimenters: Chuck Bellante, Dr. Daniel McCarville, Dr. Douglas Montgomery

Labor Requirements and Costs for Harvesting Tomatoes. Zhengfei Guan, 1 Feng Wu, and Steven Sargent University of Florida

Objective: Decompose a liter to reason about the size of 1 liter, 100 milliliters, 10 milliliters, and 1 milliliter.

STEP1 Check the ingredients used for cooking, their weight, and cooking method. Table19 Ingredient name and weight of company A s Chop Suey

Final Report to Delaware Soybean Board January 11, Delaware Soybean Board

IMSI Annual Business Meeting Amherst, Massachusetts October 26, 2008

Comparison of Multivariate Data Representations: Three Eyes are Better than One

Whether to Manufacture

Efficient Image Search and Identification: The Making of WINE-O.AI

TEACHER NOTES MATH NSPIRED

National/Regional -Judging Criteria

Fair Trade and Free Entry: Can a Disequilibrium Market Serve as a Development Tool? Online Appendix September 2014

Jake Bernstein Trading Webinar

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

Cotton Crop Maturity Determination

The Board of Trustees of the University of Illinois,

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN

STUDY AND IMPROVEMENT FOR SLICE SMOOTHNESS IN SLICING MACHINE OF LOTUS ROOT

Eukaryotic Comparative Genomics

Applying the Product Rules of Powers to Scientific Notation

Economics 101 Spring 2019 Answers to Homework #1 Due Thursday, February 7 th, Directions:

Math Fundamentals PoW Packet Cupcakes, Cupcakes! Problem

HONDURAS. A Quick Scan on Improving the Economic Viability of Coffee Farming A QUICK SCAN ON IMPROVING THE ECONOMIC VIABILITY OF COFFEE FARMING

Presentation from the USDA Agricultural Outlook Forum 2017

ETHIOPIA. A Quick Scan on Improving the Economic Viability of Coffee Farming A QUICK SCAN ON IMPROVING THE ECONOMIC VIABILITY OF COFFEE FARMING

Functions Modeling Change A Preparation for Calculus Third Edition

Cotton Crop Maturity Determination

Table Reservations Quick Reference Guide

Rituals on the first of the month Laurie and Winifred Bauer

STACKING CUPS STEM CATEGORY TOPIC OVERVIEW STEM LESSON FOCUS OBJECTIVES MATERIALS. Math. Linear Equations

DELAWARE COMPENSATION RATING BUREAU, INC. Proposed Excess Loss (Pure Premium) Factors

Missing Data Treatments

Large scale networks security strategy

COMPARISON OF CORE AND PEEL SAMPLING METHODS FOR DRY MATTER MEASUREMENT IN HASS AVOCADO FRUIT

Mastering Measurements

Fleurieu zone (other)

The Fibonacci Sequence

Haystack at Scale in Australia & Data Driven Gap Analysis

Big Data and the Productivity Challenge for Wine Grapes. Nick Dokoozlian Agricultural Outlook Forum February

FIRST MIDTERM EXAM. Economics 452 International Trade Theory and Policy Spring 2010

Transcription:

Why PAM Works An In-Depth Look at Scoring Matrices and Algorithms Michael Darling Nazareth College The Origin: Sequence Alignment Scoring used in an evolutionary sense Compare protein sequences to find interspecies relationships Also find protein relationships within an organism Helps to explain life as we know it

Introduction: Scoring Sequences Find the relationship between sequences Compare matching and unmatching amino acids Simple Example: Sequences ACTCCA and GCACTA First, align the sequences side by side: ACTCCA GCACTA Use scoring system: Match=1, Mismatch=-1 First column is mismatch, second is match, etc. There are 3 matches and 3 mismatches, giving a score of: -1+1-1+1-1+1=0 With no mismatch penalty, score is 3 Gap Penalties Insertion of gaps provides a better relation Used to account for insertions or deletions Penalties include a subtraction from the score for the start of a gap, as well as for the length Consider the same example, but add gaps: ACTCCA GCACT A The creation of two gaps made a fourth match, but the score may not change depending on the scale of the penalty If origin penalty is -1 and length is 0, then we have 4 matches minus 2 gaps, giving us a score of 2. If origin penalty is -1 and length is length/2, we have 4 matches minus 2 gaps of length 2 (minus one each), bringing the score back to zero.

Creation of a Scoring Matrix Start by deriving a PAM (Accepted Point Mutation) Matrix M Ratio = Use data to find probability of protein substitutions in sequences Divide number of substitutions of specific amino acid by relative mutability ab where M ab is probability that the substitution can happenin nature, pb is the frequency of pb Relative mutability is the total number of times substituted by ANY other amino acid Substitutions P = Rl. Mutability occurenceof b Normalize this value with the probability of occurrence of the amino acids Resulting value is probability that a column amino acid is substituted by its corresponding row amino acid with 1% divergence This 1% substitution rate corresponds to one PAM unit, thus giving us PAM-1 PAM-1 A R N D C Q E G H I L K M F P S T W Y V A 0.9867 0.0002 0.0009 0.0010 0.0003 0.0008 0.0017 0.0021 0.0002 0.0006 0.0004 0.0002 0.0006 0.0002 0.0022 0.0035 0.0032 0.0000 0.0002 0.0018 R 0.0001 0.9913 0.0001 0.0000 0.0001 0.0010 0.0000 0.0000 0.0010 0.0003 0.0001 0.0019 0.0004 0.0001 0.0004 0.0006 0.0001 0.0008 0.0000 0.0001 N 0.0004 0.0001 0.9822 0.0036 0.0000 0.0004 0.0006 0.0006 0.0021 0.0003 0.0001 0.0013 0.0000 0.0001 0.0002 0.0020 0.0009 0.0001 0.0004 0.0001 D 0.0006 0.0000 0.0042 0.9859 0.0000 0.0006 0.0053 0.0006 0.0004 0.0001 0.0000 0.0003 0.0000 0.0000 0.0001 0.0005 0.0003 0.0000 0.0000 0.0001 C 0.0001 0.0001 0.0000 0.0000 0.9973 0.0000 0.0000 0.0000 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0001 0.0000 0.0003 0.0002 Q 0.0003 0.0009 0.0004 0.0005 0.0000 0.9876 0.0027 0.0001 0.0023 0.0001 0.0003 0.0006 0.0004 0.0000 0.0006 0.0002 0.0002 0.0000 0.0000 0.0001 E 0.0010 0.0000 0.0007 0.0056 0.0000 0.0035 0.9865 0.0004 0.0002 0.0003 0.0001 0.0004 0.0001 0.0000 0.0003 0.0004 0.0002 0.0000 0.0001 0.0002 G 0.0021 0.0001 0.0012 0.0011 0.0001 0.0003 0.0007 0.9935 0.0001 0.0000 0.0001 0.0002 0.0001 0.0001 0.0003 0.0021 0.0003 0.0000 0.0000 0.0005 H 0.0001 0.0008 0.0018 0.0003 0.0001 0.0020 0.0001 0.0000 0.9912 0.0000 0.0001 0.0001 0.0000 0.0002 0.0003 0.0001 0.0001 0.0001 0.0004 0.0001 I 0.0002 0.0002 0.0003 0.0001 0.0002 0.0001 0.0002 0.0000 0.0000 0.9872 0.0009 0.0002 0.0021 0.0007 0.0000 0.0001 0.0007 0.0000 0.0001 0.0033 L 0.0003 0.0001 0.0003 0.0000 0.0000 0.0006 0.0001 0.0001 0.0004 0.0022 0.9947 0.0002 0.0045 0.0013 0.0003 0.0001 0.0003 0.0004 0.0002 0.0015 K 0.0002 0.0037 0.0025 0.0006 0.0000 0.0012 0.0007 0.0002 0.0002 0.0004 0.0001 0.9926 0.0020 0.0000 0.0003 0.0008 0.0011 0.0000 0.0001 0.0001 M 0.0001 0.0001 0.0000 0.0000 0.0000 0.0002 0.0000 0.0000 0.0000 0.0005 0.0008 0.0004 0.9874 0.0001 0.0000 0.0001 0.0002 0.0000 0.0000 0.0004 F 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0008 0.0006 0.0000 0.0004 0.9946 0.0000 0.0002 0.0001 0.0003 0.0028 0.0000 P 0.0013 0.0005 0.0002 0.0001 0.0001 0.0008 0.0003 0.0002 0.0005 0.0001 0.0002 0.0002 0.0001 0.0001 0.9926 0.0012 0.0004 0.0000 0.0000 0.0002 S 0.0028 0.0011 0.0034 0.0007 0.0011 0.0004 0.0006 0.0016 0.0002 0.0002 0.0001 0.0007 0.0004 0.0003 0.0017 0.9840 0.0038 0.0005 0.0002 0.0002 T 0.0022 0.0002 0.0013 0.0004 0.0001 0.0003 0.0002 0.0002 0.0001 0.0011 0.0002 0.0008 0.0006 0.0001 0.0005 0.0032 0.9871 0.0000 0.0002 0.0009 W 0.0000 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0000 0.0001 0.0000 0.9976 0.0001 0.0000 Y 0.0001 0.0000 0.0003 0.0000 0.0003 0.0000 0.0001 0.0000 0.0004 0.0001 0.0001 0.0000 0.0000 0.0021 0.0000 0.0001 0.0001 0.0002 0.9945 0.0001 V 0.0013 0.0002 0.0001 0.0001 0.0003 0.0002 0.0002 0.0003 0.0003 0.0057 0.0011 0.0001 0.0017 0.0001 0.0003 0.0002 0.0010 0.0000 0.0002 0.9901 This PAM-1 Matrix was obtained from M.O. Dayhoff and colleagues, 1978

PAM Matrices on a Larger Scale The corresponding number, or PAM unit, of a PAM matrix represents the number of mutations per 100 residues How do we come up with PAM-2, PAM-3, PAM- 100, PAM-250?? Just use Matrix Multiplication: PAM-x = (PAM-1) x Questions to consider: For PAM-2, why do we use matrix multiplication and not just square each entry? The probability of A remaining A after one substitution squared should be the probability of A remaining A with two substitutions by the laws of probability, right? Why PAM-x = (PAM-1) x PAM Matrices consider probabilities of mutations Matrix Multiplication: Simple Example: Alanine(A) and Arginine(R).95.04» Assume P(A-A)=.95, P(A-R)=.04, P(R-A)=.05, P(R-R)=.87.05.95.87.04.05.95.95 +.05.04 =.87.04.95 +.87.04.95.05 +.05.87.04.05 +.87.87» From this we can see that PAM takes into consideration not only A remaining A with 2 substitutions, but adds to this the probability that A is substituted by R, which is then in turn substituted by A, thus making the original A an A after 2 mutations. Note: Numbers used in example are arbitrary

Log-Odds Score Matrix Odds Ratio: R = M» where R is our desired ratio, M ab is the probability that the ab p ab mutation is accepted by nature, p b is the frequency of b occurrence of b Dayhoff took 10 times the log of this result to get a score for each mutation» Logs are used for counting purposes» Logs are more efficient than multiplying at each position S ( a, b) = 10 log( R) Quick Overview: Needleman & Wunsch Set up matrix for optimization of alignment Want to minimize gaps but maximize matches 4 possibilities at each point in matrix:» Match, Mismatch, Gap in Seq. 1, Gap in Seq. 2 Scores are input into matrix, looking to maximize score Quick Model of Process: Seq.1 Seq.2 Diagonal lines are alignments, vertical are gaps in sequence 1, horizontal are gaps in sequence 2

Sources Krane, Dan E. and Michael L. Raymer. Fundamental Concepts of Bioinformatics. San Francisco: Pearson Education Inc, 2003 Pevsner, Jonathon. Bioinformatics and Functional Genomics. New Jersey: John Wiley & Sons, 2003