Sunday, January 2, 2011

DNA binding sites 6

A few weeks ago I had some posts about finding binding sites in DNA using a simple PSSM (position specific scoring matrix; last post here). The time required to evaluate all the potential sites in a bacterial genome of ≈ 5 million bp is prohibitively long (> 1 min on my machines), so I wanted to explore ways to do it faster. I started looking at Cython (here), but then realized that I need to brush up my C skills first (here).

I also had an idea that I thought at first would make the code faster, though probably it doesn't. Then I thought it would make the code clearer, though I realized after actual implementation that it doesn't do that either. What it does do is make our pass through the sequence simpler in logical terms, and it's the basis of my C-program to evaluate sites or motifs in a DNA sequence. We construct a circularly linked list, as illustrated schematically in the graphic. Each item (node) in the list holds the current value for an accumulating score that will become the score for an individual site in the sequence.



Probably it's best to illustrate with an example. Let's say we have a scoring system used to evaluate potential sites. For example, a site with a in position 1 (we'll use 1-based indexing here) contributes score a1, c in position 2 adds c2, etc.

Suppose we're looking for sites of length N = 4, and we have this sequence: acgtg. We construct a list like the following (shown here in rows to make the layout simple). After the initial priming, we end up with this:

-> 1
2 g1
3 c1 g2
4 a1 c2 g3

I set up a toy example with scores of
0.01 .. 0.04 for a1 .. a4; 0.11 to 0.14 for c1 .. c4, etc.
Output from the priming phase looks like this:

nt = a
item = 4 before: 0.00, add: 0.01 after: 0.01
nt = c
item = 3 before: 0.00, add: 0.11 after: 0.11
item = 4 before: 0.01, add: 0.12 after: 0.13
nt = g
item = 2 before: 0.00, add: 0.21 after: 0.21
item = 3 before: 0.11, add: 0.22 after: 0.33
item = 4 before: 0.13, add: 0.23 after: 0.36
1 0.000000
2 0.210000
3 0.330000
4 0.360000
end priming

Now we consider the next nucleotide in the sequence: t. We add the appropriate scores for t to each item in the list, then read the score for the item that contains a total of N scores, and finally, zero that item. The score we read is the score for the site that has sequence acgt. Schematically:

t

1 t1
2 g1 t2
3 c1 g2 t3
-> 4 a1 c2 g3 t4

1 t1
2 g1 t2
3 c1 g2 t3
-> 4

Output looks like this:

t
item = 1 before: 0.00 add: 0.31 after: 0.31
item = 2 before: 0.21 add: 0.32 after: 0.53
item = 3 before: 0.33 add: 0.33 after: 0.66
item = 4 before: 0.36 add: 0.34 after: 0.70
zeroing item = 4 final score = 0.70

The next nucleotide is g:

g

1 t1 g2
2 g1 t2 g3
-> 3 c1 g2 t3 g4
4 g1

1 t1 g2
2 g1 t2 g3
-> 3
4 g1



g
item = 4 before: 0.00 add: 0.21 after: 0.21
item = 1 before: 0.31 add: 0.22 after: 0.53
item = 2 before: 0.53 add: 0.23 after: 0.76
item = 3 before: 0.66 add: 0.24 after: 0.90
zeroing item = 3 final score = 0.90


The site acgt has score:
a1 + c2 + g3 + t4 = 0.01 + 0.12 + 0.23 + 0.34 = 0.70


The site cgtg has score:
c1 + g2 + t3 + g4 = 0.11 + 0.22 + 0.33 + 0.24 = 0.90

Zipped files on Dropbox (here). These will change in the next few days as I debug and test the project more...

The C code for the linked list looks like this:

item *make_circular_linked_list(int N) {
int i;
item *current, *first, *previous;
for (i=0; i<N; i++) {
current = (item *) malloc(sizeof(item));
current->num = i+1;
current->score = 0;
if (i==0) { first = current; }
else { previous->next = current; }
previous = current;
}
current->next = first;
return first;
}