Monday, May 12, 2008

Using formatdb

There are several reasons to run BLAST locally rather than over the internet. One is speed. However, the size of the non-redundant (nr) database is enormous (zipped files total ≈10 GB). For 1E3 or even 1E4 runs a day, it makes a lot of sense to let them maintain the database and do BLAST over the web. Particularly when one can bundle queries (I don't know the limits for this, but 6 rRNA sequences works fine). It says somewhere in the docs that you can submit a new job once you receive the 'RID.' Hey, just run the job overnight.

A more common use is to search a custom-made database. A database is simply a text file of FASTA formatted sequences. In order to use one, it must first be processed by the formatdb program provided with the BLAST programs. After formatdb runs, it creates three new files: *.nhr, *.nin and *.nsq. The original database file is no longer needed afterward.

The documentation that comes with formatdb looks forbidding. The program is sophisticated, but its use doesn't have to be complicated. I do not use a configuration (.formatdbrc) file. The following program employs a hard-coded path to the binary. One subtle point about file paths: the shell understands '~' to mean the user's home directory, but formatdb does not. You have to use the full path. Here I obtain that using a Foundation function called NSHomeDirectory(). If the program does not run for this reason, the logfile contains no hint about why. It's interesting that the blast programs can handle '~' correctly.

Minimal flags passed to formatdb include:

-i  path to database file
-p  protein? F
-o  parse SeqId and do fancy stuff? F
-l  path to write logfile

Here is a very nice tutorial that covers the whole process of installing BLAST locally.

import os, sys, subprocess
from Foundation import NSHomeDirectory

def run():
    prog = '~/bin/blast'
    prog += '/programs/blast-2.2.18/bin/formatdb'

    home = NSHomeDirectory()

    directory = home + '/Desktop/analysis.seq/db.all/'
    cmd = prog + ' -i' + directory + 'db.txt'
    cmd += ' -p F -o F -l '
    cmd += directory + 'logfile.txt'

    p = subprocess.Popen(cmd, shell=True)
    sts = os.waitpid(p.pid, 0)

if __name__ == "__main__":
run()

Sunday, May 11, 2008

Parsing Pubmed with ElementTree

Continuing with the theme of using stock Python to parse XML from NCBI, I wrote a script that parses Pubmed entries. It's pretty simple although it did take me a while to get started. Here is the first part of the parsing function:

def parseArticle(article):
    D = dict()
    D['pmid'] = article.findtext(
        'MedlineCitation/PMID')

    journalinfo = article.find(
        'MedlineCitation/MedlineJournalInfo')
    D['journal'] = journalinfo.findtext('MedlineTA')

    a = article.find(
        'MedlineCitation/Article')
    D['volume'] = a.findtext(
        'Journal/JournalIssue/Volume')

There is one thing about it that's hackish. If I have a file with a single article, or with multiple articles that I've copied and pasted from Genbank, there is no proper root element and ElementTree chokes when I call ElementTree.parse(filename). What I do is to catch the exception and handle it by pasting the data into a temporary file with the added root element. Then I feed the temporary file to ElementTree.

I did a Pubmed search for my graduate advisor (E.P. Geiduschek) and my post-doctoral mentor (J.R. Roth) and pasted all 388 records to the Clipboard, then from the clipboard I sent the XML to file. The script handles this input. The first part of the printout is:

title Dissection of the Bacteriophage T4 Late Promoter Complex.
authorList Nechaev S, Geiduschek EP
journal J Mol Biol
year 2008
volume None
pages
pmid 18455735
abstract Activated transcription of the bacteriophage T4 la

Saturday, May 10, 2008

Standalone BLAST

From the top of the main BLAST page click Help. This is the link. Download the tarred files labeled macosx-universal. It's a big download (44.7 MB, 98.1 MB after it's unpacked). I put the directory in /bin in my home directory. There is documentation for all the programs, and more information on the NCBI web site. For example, here is info on bl2seq, to compare two sequences using BLAST.

To show how this works, I wrote a script that loads my favorite DNA sequence from my desktop and mutagenizes it. This is the function mutagenize():
import os, random, subprocess
R = random.Random()

def mutagenize(seq, percent=10):
    L = list(seq)
    D = { 'A':'CGT', 'C':'AGT', 'G':'ACT', 'T':'ACG' }
    N = int(percent / 100.0 * len(seq))
    X = len(seq)
    for i in range(N):
        j = R.choice(range(X))
        nt = L[j]
        L[j] = R.choice(D[nt])
    return ''.join(L)

This is the code to call the bl2seq binary:
def blast2(infile1,infile2):
    program_directory = '~/bin/blast/programs/'
    exe = program_directory + 'blast-2.2.18/bin/bl2seq'
    cmd = exe
    cmd += ' -i ' + infile1
    cmd += ' -j ' + infile2
    cmd += ' -p ' + 'blastn'
    cmd += ' -o ' + '~/Desktop/blastresults.txt'

    p = subprocess.Popen(cmd, shell=True)
    sts = os.waitpid(p.pid, 0)
    return sts[1] # exit status

Of course, one could use the Biopython module to help with this. This is part of the file:
 Score = 1572 bits (793), Expect = 0.0
Identities = 1138/1253 (90%)
Strand = Plus / Plus


Query: 1 atgacccttttagcgctcggtattaaccataaaacggcacctgtatcgctgcgagaacgc 60
||||||||||||||||| ||| ||||||||||||||||||||| ||||||||||||||||
Sbjct: 1 atgacccttttagcgctaggtgttaaccataaaacggcacctgaatcgctgcgagaacgc 60


Query: 61 gtaacgttttcgccggacacgcttgatcaggcgctggacagcctgcttgcgcagccaatg 120
|||| |||||||| ||||| |||| | || | ||||||| |||||||||||||||||||
Sbjct: 61 gtaatgttttcgcgggacaggcttaaacaagtgctggacggcctgcttgcgcagccaatt 120

The results are saved to a file on my desktop. It is text, apparently there is no option as yet to save the results as XML.

Fun with the Genetic Code

One of the first projects for a beginning Python Bioinformatics coder is to construct a dictionary holding the genetic code and use it to translate genes. The other day I posted code for a module defining a class that constructs a Python dictionary holding the Genetic Code using list comprehension. The same class makes a "reverse" codon dictionary where the keys are amino acids and the values are lists of synonomous codons. It also makes a third dictionary where the keys are codons and the values are lists of synonomous codons, but without the codon that is the key.

Let's exercise the reverse dictionary:

import random, GeneticCode2
GC = GeneticCode2.GeneticCode()

D = GC.D
rD = GC.rD

def singleChanges(codon):
    L = list()
    nts = 'ACGT'
    for i,wt in enumerate(codon):
        for mut in nts:
            if wt == mut: continue
            L.append(codon[:i] + mut + codon[i+1:])
    return L

R = random.Random(137)
for i in range(5):
    codon = R.choice(D.keys())
    print codon
    for syn in singleChanges(codon):
        print syn,
    print

ATC
CTC GTC TTC AAC ACC AGC ATA ATG ATT
TAT
AAT CAT GAT TCT TGT TTT TAA TAC TAG
CGC
AGC GGC TGC CAC CCC CTC CGA CGG CGT
AAT
CAT GAT TAT ACT AGT ATT AAA AAC AAG
AGT
CGT GGT TGT AAT ACT ATT AGA AGC AGG

aaL = set(D.values())
#aaL.remove('*')
aminoacids = ''.join(aaL)
print aminoacids

for aa in aminoacids[:2]:
    print aa
    codons = rD[aa]
    for codon in codons:
        print codon,
        for c in singleChanges(codon):
            print D[c],
        print
print '*'
print 'TAG',
for c in singleChanges('TAG'):
    print D[c],
print

ACEDGFIHKMLNQPSRTWVY
A
GCA T P S E G V A A A
GCC T P S D G V A A A
GCG T P S E G V A A A
GCT T P S D G V A A A
C
TGT S R G Y S F * C W
TGC S R G Y S F * W C
*
TAG K Q E S W L * Y Y

By mutations altering only a single nucleotide, alanine can be replaced by only a limited set of amino acids: T, P, S, E, D , G or V. That's not so surprising, since there are only nine related codons in total. There are seven amino acids which can change to an amber codon by a single base substitution. It might be interesting to explore further and ask whether the code is structured so that amino acids whose codons are related through single base changes are more similar than randomly chosen pairs.

Friday, May 9, 2008

Biopython for BLAST

I think Biopython is very nice. My only complaints are small ones (and since it's open source, developed and maintained by people working on their own time, complaints are not really fair or appreciated). Still, I sometimes compare the situation with R. R has a beautiful GUI with a "Package Manager" which will automatically download and install not only a package (similar to Biopython) but all of its dependencies as well. Instead, with Biopython I have to go out and load the dependencies for myself (and make sure I get the correct version, which is not always the latest one). In my half-dozen or so installations of Biopython, sometimes this has been easy, and sometimes not. Still, I am *very* grateful on the whole.

I've used Biopython to BLAST a lot, importing NCBIWWW from Bio.Blast. It looks well-maintained and says that it supports all of the extensive API advertised by NCBI for BLAST.

I wanted to understand how it works, so I decided to dissect it. The first thing I did was to make a copy of NCBIWWW.py on my Desktop and put in some print statements to look at various values. After playing with that for a while, I copied the key function and started pruning until I had only the code required for it to work. My minimal example is here.

Actually, it is not completely minimal because it still uses urllib.urlencode(), and urllib2.Request() rather than just calling urllib2.open(url) as we did for fetching FASTA sequences from Genbank. The explanation of how these work promises to be here but I haven't looked at that much yet.

In any case, the Biopython code follows the sequence designed by NCBI where we first send them our request called 'PUT', then they send us back an 'RID'. In the second phase, we poll them periodically with a 'GET' request with the 'RID' value. If the search has not been completed, the result we get is actually an html page and 'Status=waiting' is present in the page source. If we have specified FORMAT_TYPE=XML and the search has finished, then the result is actually the final xml page, and 'Status=' is not found. If we have not specified XML, then we get html. If saved to disk, it won't have the images (why not?), but it will display in a browser.

The BLAST interface used by Biopython is apparently the standard Blast.cgi you would contact with a browser, and the results are *not* XML by default, as I had thought, they are HTML. I got confused somehow a long time ago; as I remember, I thought that you are supposed to set the -m 7 flag for the standalone BLAST program to get XML output, which was explained as ALIGNMENT_VIEW. But an ALIGNMENT_VIEW is different. And looking at the docs for standalone BLAST, -m is not even listed as valid input. This is likely to remain a mystery---another senior moment.

The minimal requirements are pretty minimal. For the first part we need the database, program, and sequence. For the second part we need the RID, and it helps to tell them we want XML. That's it.

Thursday, May 8, 2008

Bioinformatics textbooks

I just did a search on Amazon in books for Bioinformatics and sorted by Bestselling. Among the first page of hits are only two that are remotely relevant (and the first of these is Michael Behe's Darwin's Black Box!). Sorting by Relevance seems to work better.

#1 is Bioinformatics For Dummies
#2 is Xiong, Essential Bioinformatics
#3 is a book by Pevzner
#4 and #5 are the Tisdall books on Perl, etc.
#6 is Mount
#9 is Statistical Methods In Bioinformatics
#11 is Gentleman et al., Bioconductor

So, there are a number of books I recognize and six that I own (Dummies, Tisdall 1 & 2, Mount, Ewen and Gentleman) but not one I would recommend as a Bioinformatics textbook. Xiong looks possible, but quite simple, and I haven't bought it yet. Mount covers the right topics, and was there very early, but can't stop from using 12 words when 3 will do. I studied him as (I hear) one studies the Talmud. There is no way I could recommend it.

The books I have and like very much are:

Higgs & Attwood
Deonier, Computational Genome Analysis
Introduction to Computational Genomics
Zhang
and (not really Bioinformatics):
Inferring Pylogenies
Hall, Phylogenetic Trees Made Easy

Scripting NCBI using eUtils and Python

Using the urllib2 module in the Python standard library, we could send a request directly to the Entrez server that handles normal interaction. But that would not be nice. Instead, NCBI asks that we use eUtils. In addition, we should run requests after hours (for > 100 requests) and make no more than one request every 3 seconds. Finally, they would like to have (but do not require) an email address and other information to help them help people who run really big jobs. This isn't necessary for us. We just have to figure out how eUtils works, and limit our requests.

For limiting the requests to NCBI, I use a class from Biopython called RequestLimiter. The first example is almost the same as one I posted about the other day, but it includes the RequestLimiter. It uses EFetch to grab some FASTA-formatted sequences from Genbank. The previous post showed how to use the Biopython Fasta parser with the resulting data. Here we do a bit of random testing to see how robust my code is. I pick ids at random and ask for them. It seems to work fine. Occasionally we see this:

# Error: 106820 is confidential: access denied

The second example extends this to use WebEnv. It has two parts: we use ESearch to send a request to NCBI, but we ask them to save the results. They send us back a kind of cookie called WebEnv that allows us to access the results later. I use the xml.etree module to do the parsing of XML returned from ESearch. I mentioned this briefly before. One important note is that etree is only installed by default with Python as of version 2.5 (which ships with Leopard). Otherwise you'll have to install it. It should be easy. Once we have the value for the cookie we use EFetch to get the data. Much more complicated scenarios are possible and described in various places, but I haven't tried them yet. This is just to get you going.

Python for simulations (2)

The Central Limit Theorem is supposed to be very important for Bioinformatics. The Deonier book, for example, discusses it starting on p. 79 and even gives a proof. My contention is that it is unnecessary for a biologist working in bioinformatics to understand this proof. It would be nice but it is an unnecessary distraction. We are like engineers and not mathematicians. We should know that one can prove, but not necessarily know how to prove that the maxima and minima of a curve occur when the slope is zero, we just want to know how to calculate these values. It's like Treasure of the Sierra Madre, when the bandit says: 'lemmas, we don't need no stinking lemmas.' Or something. But what is nice, is to show a demonstration that the Central Limit Theorem predicts results that are borne out by experiment. In short, we need more simulations.

Here is a Python script that generates a list or vector of numbers from five distributions: uniform (0 to 1), normal (mean = 0, stdev = 2), and three beta distributions. One of these has alpha, beta = 5, 1; the others are 2, 5 and 0.5, 0.5. We generate 2E5 numbers from each distribution and save them to disk (14.3 MB file). We then take groups of 50 from each distribution and sum them, and save those results to disk in a separate file. We use R for plotting. I didn't time them but Python and R each take about 10 seconds on my machine.

It looks like the mathematicians may be on to something:

Wednesday, May 7, 2008

Thinking aloud

I noticed recently that some words I have been saying silently to myself for years (and never heard aloud), I mispronounce. For example, I got sucked into Project Euler for awhile. Come to find out from Wikipedia, it is not 'eu' like in 'eubacteria' but 'oi' like in 'oiler.' Or FASTA, which is 'FAST-Aye' according to the same source. The silliest one is 'GUI', which is apparently pronounced 'gooey' by the cognoscenti. You'll never hear me saying that...it sounds really stupid.

Another issue is grammatical. You probably know that messenger RNA is not an apt name, messengers don't get carried or translated, messages do. And the Genetic Code has a quite specific meaning. However, editors at the New York Times and elsewhere apparently like to use it as a synonym for DNA sequence, as in 'Scientists have deciphered the genetic code of the honeybee.' Just google 'site:nytimes.com genetic code' if you don't believe me. I wrote to the author of one of these articles to inform her of her nomenclatural error, and she said that the editors of the New York Times simply don't care what we scientists think. They decided that people feel more comfortable with their usage, and they are not bothered that it's incorrect! It seems a pretty cavalier attitude from the top U.S. newspaper of record.

Python for BLAST XML

I think it was probably the Biopython tutorial that first made me aware of NCBI's policy: use XML (or ASN.1) if you want us not to break your code. They change HTML and plain text output at will in order to improve the experience for a human reading it, and there is no guarantee when things might change again. (I couldn't find a page at NCBI to cite in a quick search but I remember seeing this disclaimer). Scripted interaction with NCBI is supposed to use eutils, many of the eutils (like ESearch) return only XML, while others (like EFetch) respond to rettype=fasta or rettype=text. [This is the executive summary, using WebEnv one can gain more control over the output, but that is another post.] NCBI makes extensive use of another format called ASN.1 but have surrendered to the popularity of XML for data exchange over the web.

XML can be a complicated beast, however NCBI XML documents seem pretty regular. If we have saved BLAST results in XML format, how can we parse the result? One way is to use Biopython. There is a good description of how to do this in the Biopython tutorial (see the link above). This is what I have been doing. However, there is a really lightweight and simple alternative. As of version 2.5, stock Python comes with at least part of the ElementTree library. Additional explanation may be found here, here, and here.

I stripped out most of a BLAST results XML file to give a simplified overview. It's a hierarchy of open and close tags. The part of the hierarchy we are interested in has data like this:


There is a separate Iteration for each FASTA formatted sequence in the input--i.e. each separate BLAST query. There is a Hit for each sequence in the database that shows a match. And any individual Hit may have more than one section that matches, these are called Hsp for high-scoring segment pair. We navigate down the tree until we find what we're looking for. We can jump levels: e.g. given a hit we do

for hsp in hit.findall('Hit_hsps'):
    identities = hsp.findtext('Hsp/Hsp_identity')


Here is a script to parse an example file (remember to change the file extension if you want to run it). And this is what it prints:

blastn
gi|74145426|gb|DQ174269.1|
Achromobacter xylosoxidans 16S ribosomal RNA gene, partial sequence
DQ174269
id: 476 len: 476 %id: 100.0
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAACGGCAGCACGG
||||||||||||||||||||||||||||||||||||||||||||||||||
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAACGGCAGCACGG

gi|15384334|gb|AF411020.1|
Achromobacter xylosoxidans strain AU1011 16S ribosomal RNA gene, partial sequence
AF411020
id: 476 len: 476 %id: 100.0
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAACGGCAGCACGG
||||||||||||||||||||||||||||||||||||||||||||||||||
ATTGAACGCTAGCGGGATGCCTTACACATGCAAGTCGAACGGCAGCACGG

I'm sure you can take it from there. Here is a page I wrote a while back about parsing Medline using ElementTree. It has added eye candy.

Python for engineering restriction sites

Sometimes one needs to mutate a segment of DNA to introduce a new restriction site, with the requirement that the change be silent. I'll show here a program I wrote a few years ago to do that. At the time, when I searched, I could not find such a program available on the web.

The first thing we need is a list of restriction enzymes and their sites. You can get lists from New England Biolabs and others in various formats (which typically have to be massaged a bit to be useful). I used a module from Biopython called Restriction_Dictionary. I wrote a module to filter and reformat this information the way I wanted it. I also used a second module of my own called GeneticCode2. Since these modules need to be imported by the main script they are required to have the '.py' file extension. So if you decide to try out this example, be sure to change the filenames appropriately. As usual, I have posted '.txt' files so they'll simply show up in your browser.

The GeneticCode file is interesting in its own right. It makes the dictionary to hold the Genetic Code in just a few lines.



However, because it's done by code I needed to proofread it carefully, and to do that, I needed to get it to print out as the standard code is displayed, and to do that, I had to write my own comparison function. It's worth a look.

The main script is called extrasites. The way it works is to go through the gene codon by codon, changing each codon to all of its synonymous codons, then evaluating whether there is a new restriction site. It's restricted to 6-cutters because they are the most useful. Therefore, we evaluate the potential mutants in the context of two flanking codons both upstream and downstream to cover the full 6 bp.

Here is part of a run for the Salmonella typhimurium hemA gene:

codon 8 AAC => AAT
GGTATT AAC CATAAA
GGTATT AAT CATAAA
VspI ATTAAT

codon 12 GCA => GCG
AAAACG GCA CCTGTA
AAAACG GCG CCTGTA
KasI GGCGCC

[snip]

codon 416 CTG => CTT
CTCGGG CTG GAGTAG
CTCGGG CTT GAGTAG
BpuEI CTTGAG

codon 416 CTG => CTC
CTCGGG CTG GAGTAG
CTCGGG CTC GAGTAG
PaeR7I CTCGAG

time elapsed = 0.065 seconds

Slides for Introduction to Python

As I mentioned, I'm planning to teach an introductory course on Bioinformatics next spring, something like "Bioinformatics for Biologists." The target audience is graduate students in Microbiology and other fields of molecular biology. My thought is that we have too few biologists who are doing Bioinformatics compared to the hordes of CS, math and statistics types. We don't necessarily need more tools, just train people to use simple ones well. I'm planning to use Python a lot in the course for simulations and demos of algorithms.

So I'm working on an introductory lecture about Python. The problem is, I don't have time to teach the students to really use Python. I'm guessing I can probably spend 6 hours on both Python and basic math. I hope I can cover enough in these few hours to allow students to understand my examples. Anyway, I have posted the slides for part of this to .Mac, but be warned that it is a 4.4 MB pdf. I hope that most of it will be comprehensible without the narration.

I have learned an enormous amount from other people's slides on the web. Sadly, there seems to be a trend to requiring a University login to see them. On the other hand, I would gladly post all of my lecture slides but I can't without careful attention, because I shamelessly steal images from the web to illustrate them. I call that Fair Use, but it wouldn't be fair anymore if I distributed them to the web.

Python for simulations


I'd like to show an example of how Python is useful for simulations. I saw it in a terrific book called Probability by Grinstead & Snell. You can get a free pdf of the book online, but after reading that for a while I went out and bought my own copy. They illustrate use of the Monte Carlo method for integrating a function. I always thought it was Johnny von Neumann who coined the name, but the Wikipedia article attributes it to Stanley Ulam and his gambling uncle who was partial to a casino at Monte Carlo.

In any event, it works just like Battleship (png).

We throw darts at a section of the 2D plane, let's say a square containing points 0 < x <= 1 and 0 < y <= 1, by getting two random numbers, x and y. We then evaluate the function f(x) and if y <= f(x), we call that a "hit." We calculate the fraction of darts that hit the target.


Below is a Python program that integrates y = x2. We can see at a glance that the shaded area is somewhat less than 1/2, but how much less? If you remember your calculus, you'll know our function integrates to 1/3 x3, evaluated between 0 and 1, which equals 1/3. Python gives us 3 significant digits from a million trials. Pretty slick. It can also be quite inefficient. I should be able to use my recent study of elementary statistics to calculate the expected accuracy for a given number of trials, but it eludes me right now. I do know that you'll use up a lot of darts trying to get an approximation for pi by using a circle as the target.



Here is the R code for the plot:
f <- function(x) { return (x*x) }
plot(f,0,1,cex=2)
xvals = seq(0,1,length=100)
yvals = f(xvals)
x = c(xvals,rev(xvals))
y = c(rep(0,100),rev(yvals))
polygon(x,y,col='gray')
text(0.2,0.7,expression(y == x^2),cex=2)
lines(c(0,1),c(0,1),lty=2)

Simple FASTA manipulation

As you know, the FASTA format is a text format for sequences consisting of a single-line description which starts with '>' and is then followed by the sequence. Multiple entries are allowed in a single file. The sequence part of the entry may contain newlines or not. There may be a single newline between FASTA entries or one or more blank lines. For DNA, the commonly used symbols are A, C, G, T and N, but there are other legal characters as well. Object oriented approaches (like Biopython) will turn a DNA sequence into an object, with methods and so on, but I prefer to deal with sequences as simple strings.

When I retrieve a set of sequences from Genbank, entries are separated by a single space. The sequence is upper case with 70 characters per line. This is useful because if we read the data as text, we can get the individual entries by splitting on two consecutive newlines. The following code uses the urllib2 modeule from the standard library, and obtains two sequences from Genbank:



Usually I will then remove all the newlines (with split()) and reformat at the desired line length. I wrote a simple function to do that:



There are certainly more sophisticated ways to handle FASTA format. Biopython has a parser that works like this (after doing from Bio import Fasta):



This morning I came across a library that I didn't know about previously, which handles interaction with NCBI via eutils. I plan to post on my simple-minded code for scripting with eutils, and then later when I have time I'll take a look at the new library and report back. I also need to work on my picture formatting skills, so the code snapshots will all be the same size. :)

File formats

Let's talk about two aspects of file names and formatting relevant to what I'm doing here. The first is the file extension. The files I use for programming are plain text files. For typical use, I put a '.py' extension on the end of program file names, so that Smultron or BBEdit will do the syntax coloring for Python. It doesn't matter to Python. I can do 'python program.txt' in Terminal and Python will run the script just fine. I could even define '.txt' in Smultron's Preferences so that it would be colored for Python syntax, but that is distracting if a file isn't Python code.

Blogger won't let me upload text files, as far as I can tell, so I am putting my example code files on the .Mac server and linking to them. If I link to a '.py' file, when I click on the link the file is automatically downloaded to my computer. I (and I bet you also) find that behavior disturbing, unless the link advertises itself (say as 'pdf'). My solution is to change the file extension to '.txt'. That makes the linked file appear in the browser window. It's not perfect---the files appear to have 8 spaces rather than 4 for the indents. But it works well enough and the script should run for you without modification.

Previously, I showed code in html files, either with commentary or not. I adopted a strategy I saw Peter Norvig use, where the code is in a colored block, like in my very first post. It looks nice, but it's extra work, and you need to copy and paste before you can run it.

The second issue is new lines. As you probably know, there are three conventions in the world: Windows, Mac (old style), and Unix (including OS X). I use Unix NL exclusively (technically, it's called LF, line feed, 0x0A or '\n'). Here is a program to show that:



And, as you've just seen, when I want to emphasize a particular section of code, I'll paste a screenshot as a '.png' image. One last peculiarity, I've gotten in the habit of using single quotes, simply because I don't have to shift---I get fewer typing errors. Python lets you use either single or double quotes around strings, as long as they match.

Tuesday, May 6, 2008

Fun with Sudoku

In a previous life, I was interested in Sudoku for a while. I never got into doing the puzzles, but I wanted to write a Sudoku solver in Python. It seemed like a fun problem, and it is not too hard, until you get to puzzles which have no two-way decision points where one decision leads to an invalid puzzle, forcing the other path. Then I stumbled across this. Naturally, Peter Norvig is way smarter than me, that's one reason Google pays him so much. If you don't believe me, watch this. Norvig says: "as computer security expert Ben Laurie has stated, Sudoku is "a denial of service attack on human intellect." That sounds about right to me.

Anyway, while I was thinking about Sudoku, I read the Wikipedia entry. It says: "The numerals...are used for convenience...any set of distinct symbols will do, such as letters, shapes or colours." And that got me thinking... I ended up writing a Cocoa application for Color Sudoku. At the time (late 2005), nobody else had implemented one that I could find on the web. However, a quick Google search shows that the situation has changed.

Still, I don't think anybody has done it like I did. It is so long ago that I really don't feel like reading through the code. Like many projects, it grew without apparent direction. But the app still works under Leopard and it is fun to play. It is in my public folder at .Mac. There are other goodies there as well. Here is a screenshot:



Each number is represented by a different color. If a given position in the 9 x 9 grid is completely determined, it is filled in with a solid color. Otherwise, the possible values for the square are given as smaller squares. You click on squares to make them go away, when you deduce that they are not possible values (or command-click to select one). I find it very easy and intuitive to play, and it beats the heck out of crossing out numbers.

I notice that the version in the folder does not bring the window back with command-N. I only learned how to do that later.

One more point. There is a lot of Perl used for Bioinformatics. As far as I can tell, there is no reason to write any new project in Perl. I know it is partly a matter of taste. Still, this guy is very proud that his Perl Sudoku solver is only four lines of completely obscure code (actually, I see he has made it three lines now). Not all Perl is like this, but I found that after a week or two, I simply could not understand how my programs worked without a lot of effort. It's not like that in Python. Here are a couple of citations on point:
Programs must be written for people to read, and only incidentally for machines to execute.

- Abelson & Sussman, SICP

How do we convince people that in programming simplicity and clarity—in short: what mathematicians call "elegance"— are not a dispensable luxury, but a crucial matter that decides between success and failure?

- E. W. Dijkstra

Python for remote control

Python comes with an extensive standard library. One of the modules in the library is called 'subprocess.' This link provides a fairly detailed view of how it works. I will just give a simple example.

Suppose I have a script running in Python and I want to plot some data in R. Perhaps I am doing this repetitively, so naturally I'd like to automate the R interaction. After constructing a list holding the text for the R commands, I just do something like this:


import os, subprocess

FH = open('.temp.Rcode.txt','w')
FH.write('\n'.join(L))
FH.close()

cmd = 'R CMD BATCH ' + os.getcwd()
cmd += '/' + '.temp.Rcode.txt'
obj = subprocess.call(cmd, shell=True)

os.remove('.temp.Rcode.txt')
os.remove('.temp.Rcode.txt.Rout')


I build a text file containing a set of R commands and then I pass it to the R interpreter using subprocess. The file is invisible in the directory because it starts with '.' A nice addition to this example would be to monitor the return code for the process (success or failure) or to wait for the subprocess to finish, in case I need to use the results for something else. The variable obj in the above example contains the return code. To wait, the object returned by subprocess.Popen(cmd, shell=True) has a method wait() which pauses execution until the process is terminated.

Python for text processing

I'm planning to teach a course on Bioinformatics next spring. We're going to use this book by Higgs and Attwood. It has great coverage of the important topics, teaches some of the math, and is very clearly written. I'm working up a bunch of examples (maybe too many) showing the power of Python for modeling and simulations. To simplify the graphics, I usually save text files containing data which I then open and plot using R.

Here is an example for text processing. We start with the most famous novel in English. Here is a scan of the first page of chapter 1 from my bound copy:



I love the layout. I got the text from Project Gutenberg (previous link) and modified it by stripping out some header text. We use python to open it and count the combinations of two letters. All 26 x 26 possibilities are saved in a dictionary, like this:



After writing the results to disk, we open it with R and plot. The combined code weighs in at a hefty 70 lines or so. It looks pretty impressive. The size of the circle is proportional to the logarithm of the count.



Update: I realize now that this might not be exactly what you want. One might be interested in the differences in frequency of the second letter, given the first letter. For that, we need to normalize for the frequency of the first letter.

Home-grown plotter

For various reasons, which probably are not sufficient to justify the time I've invested, I decided to try to plot phylogenetic trees myself. Perhaps one reason is that I could not find a program with a GUI for OS X. Another reason is that programming small examples is the best way I know to learn Bioinformatics.

The Newick format is commonly used for trees. For example: (A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

Some examples have the internal nodes named, but the default output I started with from R's 'ape' package does not. So, keeping things simple to begin with, my code only handles trees with unlabeled internal nodes. (We assign those nodes labels of our own internally). It does allow trees which are not strictly bifurcating, like the above sample.

I'll use as an example of some rRNA sequences for species in the genus Campylobacter. The tree as written to a text file looks like this:

(ATCCBAA-381:0.03654844585,(((((L04322.1:0.0007711714468,L06977.1:0.001632679337):0.004503354114,
DQ087191.1:0.005977482141):0.009591658445,L04313.1:0.01628156841):0.003053993324,RM1221:
0.05317691099):0.00527415189,(L06975.1:0.0007282093916,L06974.1:7.179077908e-05):0.02100563578):
0.006031674847,(((DQ174168.1:0,L04320.1:0):0.001540638157,AM420221.1:
0.0008632126266):0.01623349904,AY005038.1:0.01750604341):0.002187441243);

Here is the output from R for a plot, where I have substituted standard species names for the Genbank IDs:



I'll try to show in future posts how my code works. (You might download it and try it on this sample tree). In outline we:
• capture the titles and substitute simple labels
• parse the structure of individual nodes
• walk the tree and measure distances from each node to the next and the root
• save the information to a text file with fields:
   - label
   - distance to root
   - distance from bottom of plot
   - distance to next node
   - y coordinate of first subnode (for internal nodes)
   - y coordinate of last subnode (ditto)



Above is the image produced by my Cocoa application that does the plotting (though one could certainly write R code to do it using the information we saved). This basic set of programs works. One thing I need to do now is much more extensive testing (boring). Also, it would be nice to have some shiny GUI elements. Imagine a joystick to move labels around, or being able to select subtrees for manipulation (like coloring).

The tree is the same as the first one. You would be able to see it more easily if we were using the species names, but I haven't done that part yet. Some nodes have been rotated, but that's legal. The only other difference is a new OTU (operational taxonomic unit) or species (labeled L) that I added to my database since I made the first figure a few weeks ago.

Plotting phylogenies

I've been doing a bit of phylogenetic analysis recently on sequences of 16S rRNA genes found in environmental samples (actually, dental plaque). To illustrate this, let's grab a few sequences and generate a phylogenetic tree. Genbank and EMBL are the best known sequence repositories. Another source is the Ribosomal Database Project. Use the hierarchy browser to go to Firmicutes/Lactobacillales/Streptococcacaeae/Streptococcus and select some sequences. If I mouse over an identifier I get a popup window where I can select FASTA.  I copy the sequence from the resulting window and paste it to a text file.

The FASTA format is defined here. There are various problems one can run into with long titles, including the presence of spaces, character limits, and a requirement for unique names within those limits. For now, I'll just remove most of the title, leaving '>' plus the Genbank id from the end of the line. (If the identifiers are already known, I have a Python script that will retrieve the sequences from Genbank automatically). Now I have 5 sequences saved in 'strep.txt':

X58303 Streptococcus mutans
AF003928 Streptococcus sanguinis
AF003930 Streptococcus pneumoniae
AB002521 Streptococcus pyogenes
AF003933 Streptococcus parasanguinis

Next I run clustal to get a sequence alignment. I got clustalx and clustalw from here. (Click on Download Software). Load the sequences into ClustalX (2.0.5). Under Alignment => Output Format Options select 'FASTA format' and then select 'Do Complete Alignment.' It looks like this:



Now, to plot the tree. ClustalX has given us an output file named 'strep.fasta.' I start up R (get it here). I have previously used the Package Manager to install a phylogenetics package called 'ape'. For more documentation you can search the R web site for 'ape'. This is my interaction with the R interpreter:



And here is the plot of the tree:



Getting started with Python

There are a lot of possibilities for writing and executing Python code. On the Mac it's easy because Python is pre-installed (PyObjC too, if you install the Developer Tools along with Leopard). Despite what python.org says, I haven't felt the need yet to upgrade from what came with the OS installation. For my text editor, I use a free, open-source, syntax-coloring text editor called Smultron. I like it so much I really need to make a donation. Posted below is a snapshot of a simple program in the editor. I named the file 'factorial.py'. The extension '.py' lets Smultron know how to color the code.

Once the file is saved to disk, I just open Terminal, cd to the directory where the file is, and type 'python factorial.py'. Actually, I can type 'python fac' and press tab for command-completion. One nice thing about this setup is I just click to switch between the open file window and the Terminal window. Terminal has command history (up arrow) as well, so it's easy to rerun a script.  I usually use command-K in Terminal to clean up the window if the output is extensive.

As for finding bugs, I use print statements to check the values of things as I go along. Norm Matloff would not approve, but it works for me, even for complex programs. When I'm sure of a segment, I just comment out the print statements. I pasted the output into the code file. It looks like this:

Monday, May 5, 2008

def __init__(self):


Never thought I'd see the day when I start a blog (today). What I hope to do is provide hints and code that will help people using Python, particularly for exploring Bioinformatics. Since I am an unabashed Mac 'fanboy' a substantial part of what I put here will be Mac and Cocoa-specific, but certainly not all.

Here is a link to my .Mac pages. I show a bunch of simple examples using Cocoa and PyObjC. Something about .Mac, they don't allow Google to index the site, and these pages were never really visible to the web. It has been a while since I wrote much of that stuff, so parts may be broken.

PyObjC is wonderful, but I found it confusing to start with because the documentation is limited and much of it outdated. Also, simple examples tend not to be simple enough, in my experience. Yesterday I started playing with PyObjC again, and learned that things are a bit different now with XCode 3.0 and IB 3.0. In the old days, you would add an outlet to your classes in IB and then just call an appropriate method for that outlet in your code. Now, there is an extra step needed, a call to objc.IBOutlet() as shown in the code below. Actions setup in IB are also supposed to be decorated, although my first test showed that this was not actually required.

from Foundation import *
from AppKit import *
import MyView, MyDataLoader

class PyTreeAppDelegate(NSObject):
    myView = objc.IBOutlet()
    myDataLoader = objc.IBOutlet()

    def applicationDidFinishLaunching_(self, sender):
        NSLog("Application did finish launching.")

    def awakeFromNib(self):
        self.myDataLoader.loadExample()
        self.myView.setNeedsDisplay_(True)

    @objc.IBAction()
    def buttonPushed_(self,sender):
        doStuff()


Link to the code.

Update: I forgot to mention, you only get PyObjC (and of course, XCode and Interface Builder) if you install the Developer tools from the Leopard install disk.