Python for Bioinformatics: Neighbor-joining

Tuesday, November 23, 2010

Neighbor-joining

Neighbor-joining is a standard way of constructing phylogenetic trees that is fast and does a reasonable job of recovering the correct tree in the case of unequal branch lengths. Consider this tree with vastly unequal amounts of change on the branch to B. I found this example (and the figure above) on the internet, but now I've forgotten where (sorry).

Higgs and Attwood have a description of the method that explains a bit about why the calculation is the way it is, and what guarantees it offers.

Here's the distance data:

     A    B    C    D    E
B    5
C    4    7
D    7   10    7
E    6    9    6    5
F    8   11    8    8    8

Step 1 is to calculate the net divergence for each OTU:

r(A) = 30 = 5 + 4 + 7 + 6 + 8
r(B) = 42  etc.
r(C) = 32
r(D) = 37
r(E) = 34
r(F) = 43

Step 2 is to modify the distance matrix by subtracting the value of: (r_i + r_j)/(N-2) where N is the number of OTUs. For example, the new value for entry 'AB' is: 5 - ((30 + 42)/4) = -13, and that is the smallest value in the whole matrix.

We choose the next two otus to cluster based on the smallest value in the modified distance matrix. In fact, we don't use this modified matrix again, so I just tested each value in turn against the low value observed so far, and kept the relevant indexes. We make a "star" tree, and then group the 2 OTUs we chose:

Step 3. Calculate the distances of A and B from U using the data from the original distance matrix and the divergence values, as follows:

S(AU) = d(AB)/2 + [r(A) - r(B)/(2(N-2))]
      = 5/2 + (-12 / 8) = 1
S(BU) = d(AB) - S(AU) = 4

Step 4. Now calculate the distances from the new node U to the other nodes:

d(UC) = ( d(AC) + d(BC) - d(AB) ) / 2 = 3
d(UD) = ( d(AD) + d(BD) - d(AB) ) / 2 = 6
d(UE) = ( d(AE) + d(BE) - d(AB) ) / 2 = 5
d(UF) = ( d(AF) + d(BF) - d(AB) ) / 2 = 7

The result is a new distance matrix:

     U    C    D    E
C    3    
D    6    7
E    5    6    5
F    7    8    8    8

Notice that the result from this first step already looks a little more like the real tree. We repeat each of the steps until we've collapsed all the nodes. The result is an unrooted tree.

R code:

m = c(0,5,4,7,6,8,
   5,0,7,10,9,11,
   4,7,0,7,6,8,
   7,10,7,0,5,9,
   6,9,6,5,0,8,
   8,11,8,8,8,0)
dim(m)=c(6,6)
colnames(m)=c(
  'A','B','C','D','E','F')
rownames(m)=colnames(m)
d = as.dist(m)

library(ape)
tr=nj(d)
plot(tr,cex=2,
  tip.color=rainbow(6))
axisPhylo()

R chose a root to draw the tree that is different than in the other graphics. I forgot to fix this (with ape.root).

[UPDATE: I'm writing a Python version of neighbor-joining, and it turned up some errors in my arithmetic. So I updated the table here.