Python for Bioinformatics: Silent sites in EMBOSS

EMBOSS includes a program called silent that determines positions within a coding sequence which can be mutated to give a convenient restriction site without changing the encoded protein.

$ silent hemA.txt
Find restriction sites to insert (mutate) with no translation change
Comma separated enzyme list [all]: 
Output report [hema.silent]: emboss.results.txt

The column headings and a sample line are as shown, when run on the sequence of the Salmonella typhimurium hemA gene:

  Start     End  Strand EnzymeName RS-Pattern     Base-Posn    AAs Silent Mutation
    327     332       + SpeI       ACTAGT               330    L.L    Yes     G->A

In an early post I blogged about this and posted some code.

It seemed like a good idea to compare results from the two programs. One problem is the redundancy of restriction enzymes with respect to cleavage sites. Every position reported by silent in EMBOSS has a number of enzymes, some as many two dozen or so (each one listed as a separate hit).

If you've ever worked with these enzymes you know that they have personalities, some are highly active, some produce overhangs, some are stable, some not so good. Filtering out the redundancy in a smart way is a bit of a pain. That's why one of the optional inputs to the program is a list of enzymes that you like. I ran silent with the default setting (lots of hits), and then for this analysis I made a short list of "good_enzymes."

To repeat what we did previously I copied out Restriction_Dictinary.py from the biopython-1.55 source. Then I went back to the old post and got three files (REnzymes, GeneticCode2, extrasites).

I ran REnzymes.py and it looks like it works even though the format of Restriction_Dictinary.py has changed (and is really unrecognizable to me). I modified the old extrasites.py slightly to make the sequence uppercase. Rather than mess with that script any more, I write the results to disk as my.results.txt.

$ python extrasites.py > my.results.txt

So now the problem is to load to the two different textfiles with results and compare the data. A classic everyday bioinformatics problem. My output looks like this:

codon 123 GCG => GCT
AAAAAA GCG TTTGCG
AAAAAA GCT TTTGCG
HindIII AAGCTT

The original sequence is on the second line and the mutated sequence on the third. The affected codon is set off from the surrounding sequence by a space on each side.

The code to analyze the differences is pretty ugly. I listed it below but I hope you don't look at it unless you're stuck. The output, shown next, reveals that for the most part the two sets of results are congruent. EMBOSS results are output on a single line starting with 'E', while my results are output on three lines, the first starting with 'M'. The results were sorted by order of position in the sequence. The EMBOSS Base-Posn has been converted to a codon for consistency.

A few significant differences can be seen. Mostly they involve a single enzyme, KasI. I haven't sorted that out yet. Overall, I think the agreement is very good.

output:

$ python analyze.py 
E (40, 13, 'GGCGCC', 'KasI')
M ---  13  KasI
  AAAACG GCA CCTGTA
  AAAACG GCG CCTGTA
E (145, 48, 'GTCGAC', 'SalI')
M ---  48  SalI
  GTGCTG TCA ACCTGT
  GTGCTG TCG ACCTGT
E (199, 66, 'CTGCAG', 'PstI')
M ---  66  PstI
  AACCTG CAA GAAGCG
  AACCTG CAG GAAGCG
E (322, 107, 'TCTAGA', 'XbaI')
M ---  107  XbaI
  AGCGGT CTG GATTCA
  AGCGGT CTA GATTCA
E (331, 110, 'ACTAGT', 'SpeI')
M ---  110  SpeI
  GATTCA CTG GTGCTG
  GATTCA CTA GTGCTG
E (373, 124, 'AAGCTT', 'HindIII')
M ---  124  HindIII
  AAAAAA GCG TTTGCG
  AAAAAA GCT TTTGCG
E (481, 160, 'GGCGCC', 'KasI')
M ---  160  KasI
  ATCGGC GCT AGCGCC
  ATCGGC GCC AGCGCC
E (526, 175, 'AGATCT', 'BglII')
M ---  175  BglII
  GCCCGC CAA ATCTTT
  GCCCGC CAG ATCTTT
E (541, 180, 'GTCGAC', 'SalI')
M ---  180  SalI
  GAATCG CTC TCGACG
  GAATCG CTG TCGACG
M ---  180  SalI
  GAATCG CTC TCGACG
  GAATCG TTG TCGACG
E (571, 190, 'GGCGCC', 'KasI')
E (589, 196, 'ACTAGT', 'SpeI')
M ---  196  SpeI
  ATTGAA CTG GTGGCG
  ATTGAA CTA GTGGCG
E (685, 228, 'GGCGCC', 'KasI')
E (719, 239, 'CTGCAG', 'PstI')
M ---  240  PstI
  GCCCGT TTG CAGGAT
  GCCCGT CTG CAGGAT
M ---  248  SalI
  ATTATC AGT TCGACC
  ATTATC TCG TCGACC
M ---  295  MluI
  GCGAAC GCT TATCTT
  GCGAAC GCG TATCTT
E (904, 301, 'GTCGAC', 'SalI')
M ---  301  SalI
  AGCGTC GAT GATTTA
  AGCGTC GAC GATTTA
M ---  303  PstI
  GATGAT TTA CAGAGC
  GATGAT CTG CAGAGC
E (952, 317, 'CTGCAG', 'PstI')
E (952, 317, 'CTGCAG', 'PstI')
M ---  317  PstI
  CAGGCT GCG GCAGTA
  CAGGCT GCA GCAGTA
M ---  317  PstI
  CAGGCT GCG GCAGTA
  CAGGCT GCT GCAGTA
E (1021, 340, 'GGCGCC', 'KasI')
M ---  340  KasI
  GCCCAG GGG GCCAGC
  GCCCAG GGC GCCAGC
E (1130, 376, 'CTGCAG', 'PstI')
M ---  377  PstI
  GCCATC TTG CAGGAT
  GCCATC CTG CAGGAT
M ---  377  PstI
  GCCATC TTG CAGGAT
  GCCATC CTG CAGGAT
E (1135, 378, 'AGATCT')
M ---  378  BglII
  ATCTTG CAG GATCTG
  ATCTTG CAA GATCTG

code listing:

import REnzymes
RE = REnzymes.REnzymes()
good_enzymes = ['PstI', 'XbaI', 'SpeI','HindIII',
                'SalI', 'KasI', 'SpeI', 'PstI',
                 'BglII', 'MluI','Bcl']

def load_data(fn):
    FH = open(fn,'r')
    data = FH.read()
    FH.close()
    return data.strip()
#===================================
# part 1:  EMBOSS results
data = load_data('emboss.results.txt')
data = data.split('\n\n')[2]
assert data[:7] == '  Start'
eL = list()
for e in data.strip().split('\n')[1:]:
    e = e.split()
    x, j, flag, enz, seq, i = e[:6]
    if not enz in good_enzymes:  continue
    if not '+' in flag:  continue
    codon = int(i)/3
    i = int(i)+1
    eL.append((i,codon, seq, enz))
#===================================
# part 2:  my results
data = load_data('my.results.txt')
data = data.split('\n\n')[:-1]
mL = list()
for entry in data:
    lines = entry.split('\n')
    codon = lines[0].split()[1]
    # bug in original
    codon = int(codon) + 1
    enz,seq = lines[3].split()
    if not enz in good_enzymes:
        continue
    L = ['---', str(codon), enz]
    L += [lines[1].strip()]
    L += [lines[2].strip()]
    mL.append(L)
#===================================
# part 3:  show
e = eL.pop(0)
m = mL.pop(0)
while eL or mL:
    if int(m[1]) < int(e[1]):
        print 'M', '  '.join(m[:3])
        print ' ', '\n  '.join(m[3:])
        m = mL.pop(0)
    else:
        print 'E', e
        e = eL.pop(0)
    if not mL:
        print 'E', e[:3]
    if not eL:
        print 'M', '  '.join(m[:3])
        print ' ', '\n  '.join(m[3:])

Saturday, November 13, 2010

Silent sites in EMBOSS