Saturday, November 13, 2010

Silent sites in EMBOSS

EMBOSS includes a program called silent that determines positions within a coding sequence which can be mutated to give a convenient restriction site without changing the encoded protein.

$ silent hemA.txt
Find restriction sites to insert (mutate) with no translation change
Comma separated enzyme list [all]:
Output report [hema.silent]: emboss.results.txt

The column headings and a sample line are as shown, when run on the sequence of the Salmonella typhimurium hemA gene:

  Start     End  Strand EnzymeName RS-Pattern     Base-Posn    AAs Silent Mutation
327 332 + SpeI ACTAGT 330 L.L Yes G->A

In an early post I blogged about this and posted some code.

It seemed like a good idea to compare results from the two programs. One problem is the redundancy of restriction enzymes with respect to cleavage sites. Every position reported by silent in EMBOSS has a number of enzymes, some as many two dozen or so (each one listed as a separate hit).

If you've ever worked with these enzymes you know that they have personalities, some are highly active, some produce overhangs, some are stable, some not so good. Filtering out the redundancy in a smart way is a bit of a pain. That's why one of the optional inputs to the program is a list of enzymes that you like. I ran silent with the default setting (lots of hits), and then for this analysis I made a short list of "good_enzymes."

To repeat what we did previously I copied out Restriction_Dictinary.py from the biopython-1.55 source. Then I went back to the old post and got three files (REnzymes, GeneticCode2, extrasites).

I ran REnzymes.py and it looks like it works even though the format of Restriction_Dictinary.py has changed (and is really unrecognizable to me). I modified the old extrasites.py slightly to make the sequence uppercase. Rather than mess with that script any more, I write the results to disk as my.results.txt.

$ python extrasites.py > my.results.txt

So now the problem is to load to the two different textfiles with results and compare the data. A classic everyday bioinformatics problem. My output looks like this:

codon 123 GCG => GCT
AAAAAA GCG TTTGCG
AAAAAA GCT TTTGCG
HindIII AAGCTT

The original sequence is on the second line and the mutated sequence on the third. The affected codon is set off from the surrounding sequence by a space on each side.

The code to analyze the differences is pretty ugly. I listed it below but I hope you don't look at it unless you're stuck. The output, shown next, reveals that for the most part the two sets of results are congruent. EMBOSS results are output on a single line starting with 'E', while my results are output on three lines, the first starting with 'M'. The results were sorted by order of position in the sequence. The EMBOSS Base-Posn has been converted to a codon for consistency.

A few significant differences can be seen. Mostly they involve a single enzyme, KasI. I haven't sorted that out yet. Overall, I think the agreement is very good.

output:

$ python analyze.py 
E (40, 13, 'GGCGCC', 'KasI')
M --- 13 KasI
AAAACG GCA CCTGTA
AAAACG GCG CCTGTA
E (145, 48, 'GTCGAC', 'SalI')
M --- 48 SalI
GTGCTG TCA ACCTGT
GTGCTG TCG ACCTGT
E (199, 66, 'CTGCAG', 'PstI')
M --- 66 PstI
AACCTG CAA GAAGCG
AACCTG CAG GAAGCG
E (322, 107, 'TCTAGA', 'XbaI')
M --- 107 XbaI
AGCGGT CTG GATTCA
AGCGGT CTA GATTCA
E (331, 110, 'ACTAGT', 'SpeI')
M --- 110 SpeI
GATTCA CTG GTGCTG
GATTCA CTA GTGCTG
E (373, 124, 'AAGCTT', 'HindIII')
M --- 124 HindIII
AAAAAA GCG TTTGCG
AAAAAA GCT TTTGCG
E (481, 160, 'GGCGCC', 'KasI')
M --- 160 KasI
ATCGGC GCT AGCGCC
ATCGGC GCC AGCGCC
E (526, 175, 'AGATCT', 'BglII')
M --- 175 BglII
GCCCGC CAA ATCTTT
GCCCGC CAG ATCTTT
E (541, 180, 'GTCGAC', 'SalI')
M --- 180 SalI
GAATCG CTC TCGACG
GAATCG CTG TCGACG
M --- 180 SalI
GAATCG CTC TCGACG
GAATCG TTG TCGACG
E (571, 190, 'GGCGCC', 'KasI')
E (589, 196, 'ACTAGT', 'SpeI')
M --- 196 SpeI
ATTGAA CTG GTGGCG
ATTGAA CTA GTGGCG
E (685, 228, 'GGCGCC', 'KasI')
E (719, 239, 'CTGCAG', 'PstI')
M --- 240 PstI
GCCCGT TTG CAGGAT
GCCCGT CTG CAGGAT
M --- 248 SalI
ATTATC AGT TCGACC
ATTATC TCG TCGACC
M --- 295 MluI
GCGAAC GCT TATCTT
GCGAAC GCG TATCTT
E (904, 301, 'GTCGAC', 'SalI')
M --- 301 SalI
AGCGTC GAT GATTTA
AGCGTC GAC GATTTA
M --- 303 PstI
GATGAT TTA CAGAGC
GATGAT CTG CAGAGC
E (952, 317, 'CTGCAG', 'PstI')
E (952, 317, 'CTGCAG', 'PstI')
M --- 317 PstI
CAGGCT GCG GCAGTA
CAGGCT GCA GCAGTA
M --- 317 PstI
CAGGCT GCG GCAGTA
CAGGCT GCT GCAGTA
E (1021, 340, 'GGCGCC', 'KasI')
M --- 340 KasI
GCCCAG GGG GCCAGC
GCCCAG GGC GCCAGC
E (1130, 376, 'CTGCAG', 'PstI')
M --- 377 PstI
GCCATC TTG CAGGAT
GCCATC CTG CAGGAT
M --- 377 PstI
GCCATC TTG CAGGAT
GCCATC CTG CAGGAT
E (1135, 378, 'AGATCT')
M --- 378 BglII
ATCTTG CAG GATCTG
ATCTTG CAA GATCTG

code listing:

import REnzymes
RE = REnzymes.REnzymes()
good_enzymes = ['PstI', 'XbaI', 'SpeI','HindIII',
'SalI', 'KasI', 'SpeI', 'PstI',
'BglII', 'MluI','Bcl']

def load_data(fn):
FH = open(fn,'r')
data = FH.read()
FH.close()
return data.strip()
#===================================
# part 1: EMBOSS results
data = load_data('emboss.results.txt')
data = data.split('\n\n')[2]
assert data[:7] == ' Start'
eL = list()
for e in data.strip().split('\n')[1:]:
e = e.split()
x, j, flag, enz, seq, i = e[:6]
if not enz in good_enzymes: continue
if not '+' in flag: continue
codon = int(i)/3
i = int(i)+1
eL.append((i,codon, seq, enz))
#===================================
# part 2: my results
data = load_data('my.results.txt')
data = data.split('\n\n')[:-1]
mL = list()
for entry in data:
lines = entry.split('\n')
codon = lines[0].split()[1]
# bug in original
codon = int(codon) + 1
enz,seq = lines[3].split()
if not enz in good_enzymes:
continue
L = ['---', str(codon), enz]
L += [lines[1].strip()]
L += [lines[2].strip()]
mL.append(L)
#===================================
# part 3: show
e = eL.pop(0)
m = mL.pop(0)
while eL or mL:
if int(m[1]) < int(e[1]):
print 'M', ' '.join(m[:3])
print ' ', '\n '.join(m[3:])
m = mL.pop(0)
else:
print 'E', e
e = eL.pop(0)
if not mL:
print 'E', e[:3]
if not eL:
print 'M', ' '.join(m[:3])
print ' ', '\n '.join(m[3:])