Monday, December 22, 2008

Python for Bioinformatics

I just noticed (and promptly bought) a new book with the same title as this blog, written by Jason Kinser. It looks really interesting, with chapters on Dynamic Programming, HMM, Clustering, Self-organizing Maps, PCA, Fourier Transforms, and more.

My reaction to the first half-dozen chapters is restrained, but I am still hopeful that the heart of the text will be interesting. His knowledge of Python is pretty good, though he doesn't seem to use list comprehension. Not that it matters, but he also doesn't seem to be very up-to-date about Macs (still hung up on the big-endian business which went out with the switch to Intel chips), and it looks like he wants to use Tkinter for a User Interface (Yeccch...). Here I take a look at the code for scoring a sequence alignment (Code 7-1).

He uses ord to convert a list of characters to decimal, then uses a vector operation from numpy to count the number of equal values (giving a vector of Booleans), then converts those to ints and finally sums them. Then, he does the whole thing again, to count mismatches. I thought, that's got to be slow. I just go through a zipped list of the sequences and compare each pair of characters. My code gives a different result for the case of two opposing '-' gap characters (he notes this issue in the text). A timing test shows his code is about 2.5x slower, which is a lot faster than I had guessed it would be. Of course, if we cared, we'd be using C. But, I think that if you use a complicated method to explain something this simple, you ought to have a very good reason for it.

from numpy import equal, not_equal
import random,time

def simpleScore(s1,s2):
a1 = map(ord, list(s1))
a2 = map(ord, list(s2))
# count matches
S = (equal(
a1,a2)).astype(int).sum()
# count mismatches
S = S - (not_equal(
a1,a2)).astype(int).sum()
# gaps
ngaps = s1.count('-') + s2.count('-')
S = S - ngaps
return S

seq1 = 'AGTCGATCGATT'
seq2 = 'AGTCGATCGAAT'
seq3 = 'AGTCGATCGA-T'
seq4 = 'AGTCGTACGA-T'

def simpleScore2(s1,s2):
S = 0
for c1,c2 in zip(s1,s2):
if '-' == c1 or '-' == c2:
S -= 2
elif c1 == c2: S += 1
else: S -= 1
return S

def test1():
L = [seq1,seq2,seq3,seq4]
for s1 in L:
for s2 in L:
print s1,'\n',s2
print str(simpleScore(s1,s2)),
print str(simpleScore2(s1,s2))
test1()

seq = list('ACGT')*250
random.shuffle(seq)
seq = ''.join(seq)
#print seq

# make another seq, a little bit different
s1 = seq[:]
s2 = list(seq)
D = {'A':'G','C':'T',
'G':'A','T':'C','-':'-'}
for i in range(50):
if random.random() > 0.1:
j = random.choice(
range(len(s2)))
s2[j] = D[s2[j]]
else:
s2[j] = '-'
s2 = ''.join(s2)
#print s2
print simpleScore(s1,s2)
print simpleScore2(s1,s2)

def test2(N=1000):
t1 = time.time()
for i in range(N):
simpleScore(s1,s2)
t2 = time.time()
for i in range(N):
simpleScore2(s1,s2)
t3 = time.time()
print 'S1', round(t2-t1,2),
print 'S2', round(t3-t2,2)

test2()