Friday, November 12, 2010

EMBOSS: first explorations

This is a quick update on my exploration of EMBOSS. It's not a set of detailed notes to follow or anything like that. I'm just trying to get a feeling for how to use it, and recording my impressions.

As mentioned last time, I can launch a window with the JEMBOSS GUI like so:

$ /usr/local/share/EMBOSS/jemboss/runJemboss.sh


On Mac OS X, it installed into /usr/local/share... I found a tutorial here:

/usr/local/share/EMBOSS/doc/tutorials/emboss_tut.tar.gz

Unpack and grab emboss_tutorial.pdf. Nice short tutorial. We get info about a program like say wossname by one of these:

wossname -opt
wossname -help
wossname -outfile ~/Desktop/file.txt
wossname > help.txt

Even I can figure out that what wossname does is to search for info on a specific topic, say "restriction":

$ wossname
Finds programs by keywords in their short description
Text to search for, or blank to list all programs: restriction
SEARCH FOR 'RESTRICTION'
rebaseextract Process the REBASE database for use by restriction enzyme applications
recoder Find restriction sites to remove (mutate) with no translation change
redata Retrieve information from REBASE restriction enzyme database
remap Display restriction enzyme binding sites in a nucleotide sequence
restover Find restriction enzymes producing a specific overhang
restrict Report restriction enzyme cleavage sites in a nucleotide sequence
showseq Displays sequences with features in pretty format
silent Find restriction sites to insert (mutate) with no translation change

I put a FASTA-formatted sequence file on my desktop and do:

$ remap example.fasta
Display restriction enzyme binding sites in a nucleotide sequence
Comma separated enzyme list [all]:
Minimum recognition site length [4]:
Output file [da247.remap]:

EMBOSS An error in remap.c at line 244:
Cannot locate enzyme file. Run REBASEEXTRACT


$ REBASEEXTRACT
Process the REBASE database for use by restriction enzyme applications
REBASE database withrefm file:
Error: Input file is required


Google led me to REBASE. I was confused because the example input files shown on a help page here don't match the three files I downloaded. But then I noticed that (some of) the example output files match. REBASE/embossre.enz looks like link_emboss_e.txt.

Actually reading the docs: the input file must be the "withrefm" file of a REBASE distribution. For example, the withrefm file for REBASE version 005 is at: ftp://ftp.neb.com/pub/rebase/withrefm.005. But the link is dead.

Google: site:rebase.neb.com withrefm

No matches. So, I can't find the required input files on the REBASE site.

[UPDATE: the files are there, see the end of the post]

I found what look like equivalents for two of the four output files including REBASE/embossre.enz. The reference file format looks like it's changed. And the fourth file REBASE/embossre.equ isn't there.

Check out the files in

$ ls /usr/local/share/EMBOSS/data/REBASE/
dummyfile embossre.enz embossre.equ embossre.ref embossre.sup

Some look good, like embossre.equ, but some are dummy files, e.g. embossre.enz.

Try this:

sudo mv embossre.enz embossre.enz.old
sudo cp ~/Desktop/link_emboss_e.txt /usr/local/share/EMBOSS/data/REBASE/embossre.enz

$ remap -sequence example.fasta
Display restriction enzyme binding sites in a nucleotide sequence
Comma separated enzyme list [all]:
Minimum recognition site length [4]: 6
Output file [da247.remap]: results.txt

The result looks like things are working:


DA247


NheI
| Cac8I Cac8I
| | BmtI | StuI NspI XmnI
\ \ \ \ \ \ \
GATGAACGCTAGCGGCAGGCCTAACACATGCAAGTCGAGGGAGAAGCCCTTCGGGGCGGA
10 20 30 40 50 60
----:----|----:----|----:----|----:----|----:----|----:----|
CTACTTGCGATCGCCGTCCGGATTGTGTACGTTCAGCTCCCTCTTCGGGAAGCCCCGCCT
/ / / / / / /
| | NheI | StuI NspI XmnI
| Cac8I Cac8I
BmtI

D E R * R Q A * H M Q V E G E A L R G G
M N A S G R P N T C K S R E K P F G A E
* T L A A G L T H A S R G R S P S G R K
----:----|----:----|----:----|----:----|----:----|----:----|
S S R * R C A * C M C T S P S A R R P P
X H V S A A P R V C A L R P L L G E P R
I F A L P L G L V H L D L S F G K P A S


So, what I might do is look harder for withrefm, but it seems to work, at least. Not crazy about the output format, they should do something like:

         .         .         .         .         .            60
GATGAACGCTAGCGGCAGGCCTAACACATGCAAGTCGAGGGAGAAGCCCTTCGGGGCGGA
CTACTTGCGATCGCCGTCCGGATTGTGTACGTTCAGCTCCCTCTTCGGGAAGCCCCGCCT


[UPDATE: Should've looked harder. The files I need are right there on the REBASE downloads page (#5 and #31).


$ sudo rebaseextract
Password:
Process the REBASE database for use by restriction enzyme applications
REBASE database withrefm file: ~/Desktop/link_withrefm.txt
REBASE database proto file: ~/Desktop/link_proto.txt
c-98-236-78-154:Desktop telliott_admin$ cat /usr/local/share/EMBOSS/data/REBASE/embossre.enz
..

Repeat remap and it looks good. It would be nice if the output of enzymes with the number of sites listed the position of each cut.

# Enzymes that cut  Frequency Isoschizomers
Acc65I 1 Asp718I
AflIII 2
ApoI 1 AcsI,XapI
BbvII* 1 BpiI,BpuAI,BstV2I,BbsI
BmtI 1 BspOI
BsaAI 1 BstBAI,Ppu21I
..