Monday, August 11, 2008

PAM: Projecting in time

In the last post, we looked at the way Dayhoff et al. generated the PAM1 matrix. That matrix contains transition frequencies for changing one amino acid to another, based on observed changes in closely related proteins. The form of the matrix presented has all values multiplied by 10000 to make it easier to scan by eye. The total of each column sums to 10000, for a total probability of 1 that the amino acid will do something: either change or stay the same. According to the text:

The probability of observing a change in a site containing alanine (the sum of all the elements MXA except MAA) is proportional to the mutability of alanine. The same proportionality constant, lambda, holds for all columns. The individual nondiagonal terms within each column bear the same ratio to each other as do the observed mutations in the matrix of Figure 80.


The quantity 100 X the sum of fi Mii gives the number of amino acids that will remain unchanged when a protein 100 links long, of average composition, is exposed to the evolutionary change represented by this matrix. This apparent evolutionary change depends upon the choice of lambda, in this case, chosen so that this change is 1 mutation.


The paper goes on to discuss the use of PAM1 for simulating mutations.

The transition matrix PAM1 has the amino acid frequencies built into it. (Go back and see how it was computed). We can also see this by applying the matrix to a sequence (any sequence) many times. When you do that, the sequence approaches asymptotically the observed amino acid composition (frequencies).

We obtain the PAM matrix for longer times by matrix multiplication. To see the rationale for this, consider a period of two PAMs (twice the unit time to give 1 % mutation). If we observe D => E during this whole period, we can calculate the probabilities by observing that the position holding D at time-zero was something at PAM1, and then something turned into E by PAM2. For each possible something, X, we need to multiply the probability that D => X by the probability that X => E, and sum this over all possible X. We obtain the first probability by going to the column for D and scanning down to the row for X; we obtain the second probability by going to the row for E and scanning out to the column for X. The sum over all X is the entry for column D row E for PAM2.

This is exactly what matrix multiplication does. When we do matrix A x matrix B (or A %*% B in R), we take a particular row of matrix A corresponding to the amino acid E (the final result), and a particular column of B (correspnonding to the starting amino D) and take each element pairwise of the two and multiply them together (for each intermediate amino acid), then sum the result. To go from PAMn to PAMn+1 we multiply PAMn %*% PAM1. (Even though matrix multiplication is not commutative, it seems not to matter whether we do PAMn %*% PAM1 or PAM1 %*% PAMn repeatedly. I'm not sure why this is so, but I've tried the R code both ways and get the same result).

Substitution matrices are commonly used in log odds form. To get there we have to do two things: normalization (get the odds) and then take the log. The normalization consists of dividing by the amino acid frequency. That is, for each column of the PAM matrix, we have a set of frequencies with which the starting amino acid will turn into something else. We normalize each of those probabilities by dividing by the frequency of the new amino acid. (The probability that A => R is normalized to the frequency of R). And since the PAM1 entries are observed occurrences divided by fj, the odds matrix is just observed occurrences divided by fi * fj.

The base of the logarithm varies with the application, but in this paper they use base 10.

Here is a screenshot of my R code with a corrected form of PAM1, where I've proofread all the entries to be consistent with what's in the paper. There are still a few minor differences in the result, but it tracks pretty closely. As a final twist, the amino acids have been reordered to group those with like properties.