Wednesday, June 24, 2009

Motif Discovery 8

As promised, I've implemented a Gibbs Sampler motif search program. It's actually quite simple. If we have a discrete distribution of scores over all positions, considering an individual sequence that is sliding along a motif, we weight each position according to its score, and then choose a new position randomly based on the weights. Here's that part of the code:

def gibbsMove(L):
x = abs(min(L)) + 0.1
L = [e+x for e in L]
S = sum(L)
c = 0
r = random.random()
for i,s in enumerate(L):
c += s
f = c*1.0/S
if f > r: return ((s,i))
raise ValueError('f < r')


(I am still worried that the scoring function I'm using, as mentioned here, is incorrect and may be screwing up the position distributions).

The take home message from playing with the sampler is that it takes a lot of iterations for the distribution to settle down where it should be. For example, here is output from a run on a toy problem:

       numSeq =       5
motif length = 4
seq length = 13
cycles = 1000000
# runs = 1
freq distr = 1.0

4 ATAC ATCTATACAGTCA
3 ATAC CCAATACACATCC
8 ATAC TGCCTCGCATACG
8 ATAC AACTTATTATACA
1 ATAC AATACATTATCAA

target:
04 03 08 08 01

top positions: count score
04 03 08 08 01 629 18.4
04 05 08 08 01 501 15.7
04 09 08 08 01 450 15.7
03 02 07 07 00 414 13.7
05 04 09 09 02 401 15.7
04 03 08 06 01 336 13.0
04 03 08 05 01 336 13.0
04 03 08 08 08 335 14.2
00 03 08 08 01 332 13.0
05 04 09 07 02 318 13.0


It is tough to get separation, though admittedly most of these look like off-by-one solutions. In this problem, the number of positions to check for each sequence is SZ - M + 1 = 10. So the total number of positions is 105. With a million cycles (and discarding the first 20%), we have 8 samples per position on the average. So the sampler does seem to be finding our motif fairly efficiently (629 times here).

But it seems clear that the efficiency has to get a lot better with increasing length of the sequences. With a sequence length of 1000 and 10 sequences, the number of positions is roughly 100010 = 1030.

I am going to have to redo parts of the code in C or C++ in order to test larger sequence sets. So far, real problems do not yield to my sampler even with 5 x 106 cycles.