Saturday, June 20, 2009

Motif Discovery 6

I've been exploring results obtained using the code that I talked about in recent posts with this title, a simple greedy algorithm evaluating possible motifs within a set of sequences. This is a basic exploration of the problem without using the sophistication of the Gibbs Sampler, (which I do still promise to discuss). The parameters controlling output are:


N    the number of sequences given
M the length of the motif embedded in the sequences
SZ the length of each sequence
R the number of cycles of the basic greedy algorithm
T the number of independent runs
fL a list of frequencies used in deriving the motif
e.g. [0.9,0.9,0.8,0.8,0.7]


The scoring method is coded and described here, and comes from Lawrence et al (PMID 8211139).

The bottom line that I've come to in these simulations is that the simple version of the algorithm works pretty well. Of course, the results depend on the values chosen for the parameters above. Decreasing M and increasing SZ decreases the signal to noise ratio. I haven't played with N much, but I think more sequences should be better. Early trials led to setting R = 100---most cycles reached a plateau by 60 or so (these were tested with small values for M and SZ). I've changed the setting for T in relation to M and SZ in order to obtain positive results. It is probable that lower values of T might also show success.

Here is a table of my results. M, SZ and R are introduced above. FD refers to the frequency distribution for the motif, shown here. ratio refers to the fraction of the top 15 hits that are related to the motif. What it means to be related will be discussed later. A final point to note is that these results are "anecdotal" because they were all obtained with a single seed for the random number generator. Some tests were done with other seeds, but not a lot. So you can view this as a proof of principle.

M       SZ       T     FD      ratio
15 2000 1000 A 8 / 15
15 1000 1000 A 15 / 15
12 1000 1000 A 15 / 15
9 2000 2000 B 15 / 15
9 1000 1000 B 15 / 15
9 250 100 B 14 / 15

FD: A = 0.9 0.9 0.8 0.8 0.7
B = 0.9 0.9 0.8


Probably the most stringent test is for a motif of 9 nt embedded in sequences of SZ = 2000. Following is the output for that test. The first part is the list of the top 15 results, the index list of the best motif, and its score. The second part shows the motifs in more detail. There is a discrepancy in the scoring, it is due to the fact that in the first section, scores were calculated by excluding the sequence currently in slide mode, whereas in the second part I calculated the score by averaging the results obtained by treating every sequence in the motif as in slide mode. The test runs pretty slowly, but this is Python, after all.


1802  585  132 1480 1619  830 1785  682  678 1422     111.94
1802 585 132 1480 1619 830 1785 682 678 1422 111.94
1802 585 132 1480 1619 830 1785 682 678 1422 111.94
1802 585 132 1480 1619 830 1785 682 678 1422 111.94
1802 585 132 1480 1619 830 1785 682 678 1422 111.94
1801 584 131 1479 1618 829 1873 681 677 1421 103.14
1801 584 131 1479 1618 829 1873 681 677 1421 103.14
1801 584 131 1479 1618 829 1873 681 677 1421 103.14
1802 585 132 1480 1619 830 1785 682 678 1422 101.04
1802 585 132 1480 1619 830 1785 682 678 1422 101.04
1803 264 133 1481 1620 831 1786 683 679 1423 100.59
1803 264 133 1481 1620 831 1786 683 679 1423 100.59
1803 264 133 1481 1620 831 1786 683 679 1423 100.59
1803 264 133 1481 1620 831 1786 683 679 1423 100.59
1803 264 133 1481 1620 831 1786 683 679 1423 100.59
------------------------------
rank = # 1
target found
ATGAGCAGG ATGAGCAGG
ATGAGCAGT ATGAGCAGT
ATGAGCAGT ATGAGCAGT
TTTATCAGT GTGAGCAGA
ATGAGCAGT ATGAGCAGT
ATGAGCAGT ATGAGCAGT
ATGAGCAAT ATGAGCAAT
ATGAGCAGT ATGAGCAGT
ATGAGCAGT ATGAGCAGT
ATGAGCAGT ATGAGCAGT
103.0 107.7
------------------------------
3012.6 sec