Tuesday, November 23, 2010

UPGMA in Python 3: Sarich data


In Joe Felsenstein's book, Inferring Phylogenies, he uses some data from Vincent Sarich to demonstrate UPGMA. Although I was tempted to "declare victory and go home", I decided to test my version of UPGMA with that data. And that turned up bugs in both the main upgma and the plotting code! So I guess the moral is: test, test, and test again.

The top figure is a scan from the book. Here is what my program plotted:



It looks pretty good to me.

The data are immunological data from eight species in this order: dog, bear, raccoon, weasel, seal, sea lion, cat, monkey. The data are in this distance matrix:

  0   32   48   51   50   48   98  148
32 0 26 34 29 33 84 136
48 26 0 42 44 44 92 152
51 34 42 0 44 38 86 142
50 29 44 44 0 24 89 142
48 33 44 38 24 0 90 142
98 84 92 86 89 90 0 148
148 136 152 142 142 142 148 0


If you grab the zipped files (here) and run them, you'll see a lot of diagnostic output when debug == True. As well as all the branch lengths.