Friday, December 3, 2010

The Phylogenetic Bootstrap



image here
The image of the bootstrap is, or ought to be, a familiar one to every computer user. It is occurs in the phrase "boot the computer," which draws on James Joyce's phrase:
"Ladies who like distinctive underclothing should, and every welltailored man must, trying to make the gap wider between them by innuendo and give more of a genuine filip to acts of impropriety between the two, she unbuttoned his and then he untied her, mind the pin, whereas savages in the cannibal islands, say, at ninety degrees in the shade not caring a continental. However, reverting to the original, there were on the other hand others who had forced their way to the top from the lowest rung by the aid of their bootstraps. Sheer force of natural genius, that. With brains, sir."

The long quote here courtesy of Project Gutenburg (my post here).

Wikipedia has a different provenance (here).

The bootstrap is a relatively new statistical technique. It was applied by Felsentstein to phylogenies in 1985 (Evolution 39(4):783-791); he cites B. Efron (1979 Ann. Statist. 7:1-26). Neither paper is in PubMed, though Efron discusses Felsenstein's approach here. (A disclaimer: I have yet to actually read these, though I certainly plan to.)

In the phylogenetics context, the bootstrap simply samples columns from an alignment, with replacement, as many times as there are columns. A tree is constructed using the sampled data and evaluated for a node or clade of interest. The process is repeated many times (100 or 1000) and the number of times the majority rule clade is recovered is noted.

Until recently, I had wrongly conceived of the bootstrap as a sort of data sufficiency test, a measure of where we might be on a hypothetical saturation curve for data acquisition. But that's not it. What the bootstrap really is about is an assessment of what fraction of the data points support the hypothesis. As Felsenstein says:
"Majority-rule consensus trees can be used to construct a phylogeny showing all of the inferred monophyletic groups that occurred in a majority of the bootstrap samples. If a group shows up 95% of the time or more, the evidence for it is taken to be statistically significant."

I found the value quite surprising. I hadn't conceived of the bootstrap value as being like a p-value. Often I've looked at numbers for nodes/clades in phylogenies and thought, "70%, that's not so bad.."

It doesn't really matter how muddled my thinking is, what you might find useful here is the result of a simulation of the bootstrap in Python, including my testing it using tools PyCogent.

[UPDATE: A little more detail: I study the simplest possible case of a tree (A,B),(C,D). Data supporting the correct tree is in the form: 'AAGG'---that is otu A has 'A', otu B has 'A', otu C has 'G' and otu D has 'G'. Data not supporting the correct tree is in the form: 'TCTC'. We specify the fraction of columns supporting ('pro') and not supporting ('con') to be contained in the alignment. ]

The code to draw the bootstrap samples is trivial.

For the analysis, I ended up using the fact that the number of "connecting edges" for nodes A and B is 3 for the correct tree (themselves, plus their parent). I couldn't get the function sameTopology() to evaluate to True in comparisons with the prototype: ((A,B),(C,D)). A most surprising result of the simulation was how little effect substantial contradictory evidence has, and how steep is the curve. Here, the fraction of times the correct tree is recovered is almost never less than 1.0 until about 40% of the values in the sample are from contradictory alignments. That's a whole lot of contradictory data!

A secondary observation was that "uninformative" positions in the alignment don't seem to alter the value obtained at all.



code listing:

import random, sys
from cogent import LoadSeqs, DNA, LoadTree
from cogent.phylo import distance, nj
from cogent.evolve.models import F81, JC69
import numpy as np
import matplotlib.pyplot as plt
debug = False

# true tree is: ((A,B),(C,D))
# alignment that is consistent: AAGG
# alignment that is not: AGAG
# uninformative alignment: AAAA
tr = LoadTree(treestring='((A,B),(C,D));')

def makeAlignment(pro,con,neutral):
seqs = list()
for otu in 'ABCD':
temp = ['A'*neutral]
if otu in 'AB': temp.append('A'*pro)
else: temp.append('G'*pro)
if otu in 'AC': temp.append('T'*con)
else: temp.append('C'*con)
seqs.append(''.join(temp))
return seqs

def sample(seqs):
N = len(seqs[0])
iL = [random.choice(range(N)) for j in range(N)]
rL = list()
for seq in seqs:
temp = [seq[i] for i in iL]
rL.append(''.join(temp))
return rL

def evaluate_tree(aln):
d = distance.EstimateDistances(aln, submodel=JC69())
d.run(show_progress=False)
njtree = nj.nj(d.getPairwiseDistances())
if debug:
print d
print njtree.asciiArt()
print njtree.sameTopology(tr)
for otu in 'BCD':
print njtree.getConnectingEdges('A',otu)
L = njtree.getConnectingEdges('A','B')
return len(L) == 3

R = [35,40,42,44] + range(44,57) + [58,60,65]
C = []
for con in R:
pro = 100 - con
seqs = makeAlignment(pro,con,0)
if debug:
print '\n'.join(seqs)
count = 0
reps = 100

for i in range(reps):
#print i
s = sample(seqs)
#print '\n'.join(s)
dna = dict(zip('ABCD',s))
aln = LoadSeqs(data=dna, moltype=DNA)
if evaluate_tree(aln): count += 1
C.append(count*1.0/reps)
print con, count, reps

plt.scatter(R,np.array(C),s=250)
ax = plt.axes()
ax.set_xlim(30,70)
ax.set_ylim(-0.05,1.05)
plt.savefig('example.png')