Tuesday, December 28, 2010

simple C example 2

Continuing from last time (here), we put all the previous code (minus the function main) into a separate file print_bits.c and then put the declarations of our two functions in a header file: print_bits.h:

void print_byte(unsigned char);
void print_word(const char *);

We modify main, adding code to get information passed in from the operating system, and put that in a separate file bits.c:

#include <stdio.h>
#include "print_bits.h"
#include "cast.h"

int main(int argc, const char* argv[]) {
const char *greeting = "Hello world!\0";
printf("greeting: %s\n\n", greeting);
char c;
c = 'a';
printf("%c\n", c);
print_word(greeting);
int i;
printf("%s", "\n");
if (argc > 1) {
for (i=1; i < argc; i++) {
print_word (argv[i]);
}
printf("%s", "\n");
}
cast_int();
return 0;
}

We need to #include <stdio.h> to use printf, but we don't need the other C library headers here; they are placed in print_bits.c. There is a second set of new files (cast.c and cast.h) that I'll explain below.

Our project isn't very complicated but it still can benefit from using the make tool. We put a file in the same directory as we're working called Makefile. Ours has this code:


bits: bits.o print_bits.o cast.o
gcc -o bits bits.o print_bits.o

bits.o: bits.c
gcc -c bits.c

print_bits.o: print_bits.c
gcc -c print_bits.c

cast.o: cast.c
gcc -c cast.c

clean:
rm -f bits.o print_bits.o cast.o bits


As explained in the manual for make and Norm Matloff's introduction (here), these instructions consist of

target ... : prerequisites ...
recipe
...


When invoked by itself, make will try to build an executable named bits by linking 3 object files, these in turn depend on the listed code files. The -c compiler directive says not to do the linking at that stage. And clean is self-explanatory. One odd requirement is that after the colon, and on each line that's shifted out, it must be a tab character.

The point is that make will call the compiler only if the timestamp of last modification to a .c file shows it's necessary. So in a complex project, only those files that have been changed get re-compiled in each debug cycle.

[UPDATE: As pointed out in comments, there was a problem with an earlier version of this post, because I forgot the proper syntax for our personal header files, which is #include "print_bits.h" etc. rather than #include <stdio.h>. The search path is determined by the format of the #include. See here. My makeshift solution :) was to tell make where to look for our files using -I., which says to search for them in the current directory. ]

The last part of this little project explores casting. We have two int variables j and k, and a long (int) variable m, declared and assigned in turn (in cast.c). In order to examine what these look like in memory, we do this:

char *c = (char *) &j;

which assigns the address of j to a char pointer c. We tell the compiler that yes, we really want to do this, using a cast (char *). We examine the surrounding bytes as follows:

    int i;
int j = 10;
int k = 660;
long m = 21;
char *c = (char *) &j;
printf("address of c: %p%s", c, "\n\n");
for (i=-16; i < 4; i++) {
print_byte(c[i]);
printf(" %p%s", &c[i], "\n");
}
printf(" int: %ld%s", sizeof(int), "\n");
printf(" long: %ld%s", sizeof(long), "\n");
printf("size_t: %ld%s", sizeof(size_t), "\n");



$ make
gcc -c bits.c -I.
cc bits.o print_bits.o cast.o -o bits
$ ./bits

address of c: 0x7fff5fbff9d8

00010101 21 0x7fff5fbff9c8
00000000 0 0x7fff5fbff9c9
00000000 0 0x7fff5fbff9ca
00000000 0 0x7fff5fbff9cb
00000000 0 0x7fff5fbff9cc
00000000 0 0x7fff5fbff9cd
00000000 0 0x7fff5fbff9ce
00000000 0 0x7fff5fbff9cf
00000000 0 0x7fff5fbff9d0
00000000 0 0x7fff5fbff9d1
00000000 0 0x7fff5fbff9d2
00000000 0 0x7fff5fbff9d3
10010100 148 0x7fff5fbff9d4
00000010 2 0x7fff5fbff9d5
00000000 0 0x7fff5fbff9d6
00000000 0 0x7fff5fbff9d7
00001010 10 0x7fff5fbff9d8
00000000 0 0x7fff5fbff9d9
00000000 0 0x7fff5fbff9da
00000000 0 0x7fff5fbff9db
int: 4
long: 8
size_t: 8

We can see a number of things from this example. The pointer to char c (assigned the address of j using &j), when dereferenced, gives decimal 10 in the first byte, followed by three empty bytes. The sizeof an int on our system is 4 bytes, even though this is a 64-bit executable:

$ file bits
bits: Mach-O 64-bit executable x86_64

The Intel Core 2 Duo chip addresses memory in "litte-endian" fashion---it's the first byte that holds the value 00001010. And four bytes previous to this is the first byte corresponding to the int k, even though k was assigned after j. (k = 148 + 2*256 = 660). One other thing is that the long int m, whose size is 8 bytes, actually starts 12 bytes before k. I'm not sure why this position was chosen, but I guess it's a "boundary" or something. The fundamental unit of size:

sizeof(size_t) 

is 8 bytes.

And finally, if we force the compiler to compile a 32-bit executable by using the flag m32:

gcc bits.c print_bits.c cast.c -I. -m32 -o bits

the last part will print

   int:  4
long: 4
size_t: 4

Now we get 4 byte (32-bit) longs and size_t is also 4 bytes.

Zipped files on Dropbox (here).

simple C example 1

This post is in the category of notes to myself. Unless you're a complete newcomer to programming, it probably won't help you. But I hope it will help me to remember the essentials.

The smallest fundamental data type in C is the char. A char, whether signed or unsigned, occupies one byte (8 bits). We can assign to a char variable using either a single-quoted text character like 'a' or a numeric constant, whether decimal, octal or hexadecimal:

char c;
c = 'a';
c = 97;
c = 0141;
c = 0x61;
printf("%c\n", c);

The call to printf (declared in stdio.h) will print 'a' for each of the 4 assignments.

For the example I'm showing here, the #include statements we need are:

#include <stdio.h>
#include <string.h>
#include <math.h>

These give access to the printf, strlen and pow functions, which are declared in these .h files and defined elsewhere.

Our approach to examine the individual bits of a char is to use a combination of the bitwise AND operator (&) and a bit-shift operation, as shown in the following code fragment:

void print_byte(unsigned char x) {
int i, value = 0;
char A[8];
// read lowest to highest
for (i = 0; i < 8; i++) {
if (x & 01) { A[i] = '1'; }
else { A[i] = '0'; }
x >>= 1;
}

We examine the least significant bit of x by doing a bitwise AND using x and the octal constant 01 (00000001 in binary). We get each bit in turn by doing a rightward bit-shift operation, and assign the result directly to x. Since this approach reads the bits from right to left, I decided to just save the characters '0' and '1' to an array and then print them in reverse order.

Another approach would be to test the high value bit and print it directly. In order to do that, you would use this test: if (x & 0x80), and do a leftward bit-shift x <<= 1.


    int value = 0;
// print highest to lowest
for (i = 7; i > -1; i--) {
printf("%c", A[i]);
if (A[i] == '1') {
value += pow(2,i);
}
}
printf(" %3d ", value);
}

We use the pow (power or exponentiation) function from the math library to convert the value of the byte to decimal.

A second function applies the first one successively to the chars of a C string variable, defined as shown below, in main.

The variable that we'll print is a pointer to char or char *, that is, a variable which when de-referenced gives us back a char. So the string is actually an array of chars. It is distinguished by the double quotes and a termination marker. In C you normally need to know how many values are present in an array in order not to do something stupid, but in the case of a string the convention is to terminate with the \0 character. The code for print_word is:

void print_word(const char *s) {
int j;
char z;
printf("word: %s\n", s);
for (j = 0; j < strlen(s); j++) {
z = s[j];
if (z != '\n') {
printf("%c ", z);
print_byte(z);
printf(" %p \n", &s[j]);
}
}
}

int main() {
const char *greeting = "Hello world!\0";
printf("greeting: %s\n", greeting);
print_word(greeting);
return 0;
}

We combine these four sections of code in a file test.c and do:

$ gcc test.c -o test
$ ./test
greeting: Hello world!
word: Hello world!
H 01001000 72 0x100000e96
e 01100101 101 0x100000e97
l 01101100 108 0x100000e98
l 01101100 108 0x100000e99
o 01101111 111 0x100000e9a
00100000 32 0x100000e9b
w 01110111 119 0x100000e9c
o 01101111 111 0x100000e9d
r 01110010 114 0x100000e9e
l 01101100 108 0x100000e9f
d 01100100 100 0x100000ea0
! 00100001 33 0x100000ea1

For each character of the string we print the binary representation, the decimal equivalent, and the position of that char in memory. That last comes from here:

printf(" %p  \n", &s[j]);


So we can see that the individual chars of the string are laid out in successive bytes in memory.

This is all quite fundamental. The parts I have trouble remembering when I come back to C after a long absence are:

- the distinction between single and double quotes
- one can access the individual chars of a string and assign them to chars
- the formatting codes (% plus c, s, i, f, p etc.)
- remembering to #include
- remembering to add parentheses around this test: if (A[i] == '1')
- remembering to add brackets around the entire group of conditional statements in:
if (x) { a; b; c; }

My usage of brackets is non-standard, but I think it makes sense. I always start them at the end of a line, and terminate at a position spaced out to the first character of the line where they started (except when the expression is really short).

You'll notice const and unsigned modifiers in the code above, but they're not essential.

I think that's it. In the next post, I'll develop this into a multi-file project and introduce the make tool.

Thursday, December 23, 2010

Cython 3: my own .c source file

Yet more on elementary Cython. I have a working example with my own C code, thanks to help from the Cython users' list (here)---see the UPDATE below for one small change:

hello2.c:

#include <stdio.h>
void f();

void f() {
printf("%s", "Hello world!\n");
}

hello2_caller.pyx:

cdef extern from "hello2.c":
void f()

cpdef myf():
f()

setup.hello2_caller.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

sourcefiles = ['hello2_caller.pyx']
ext_modules = [Extension("hello2_caller",
sourcefiles
)]

setup(
name = 'Hello2 app',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)

build it:

python setup.hello2_caller.py build_ext --inplace

import and use it:

>>> import hello2_caller
>>> hello2_caller.myf()
Hello world!

The compiler complains:

In file included from hello2_caller.c:219:
hello2.c:3: warning: function declaration isn’t a prototype
hello2.c:5: warning: function declaration isn’t a prototype

But it works.

UPDATE: the problem is that these two declarations are not the same:

void f();
void f(void);

and we need to have the second one.


$ python setup.hello2_caller.py build_ext --inplace
running build_ext
cythoning hello2_caller.pyx to hello2_caller.c
building 'hello2_caller' extension
creating build
creating build/temp.macosx-10.6-universal-2.6
gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch ppc -arch x86_64 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c hello2_caller.c -o build/temp.macosx-10.6-universal-2.6/hello2_caller.o
gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -arch i386 -arch ppc -arch x86_64 build/temp.macosx-10.6-universal-2.6/hello2_caller.o -o hello2_caller.so



$ python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hello2_caller
>>> hello2_caller.myf()
Hello world!


Cython 2

I've been puzzling over the Cython docs and tutorial for quite a few hours now, trying to write a simple C function (in a .c file) and compile and call it from Cython. Like the example from before (here), except using my own function rather than a C math library call. I can't figure it out, so I posted questions on Stack Overflow and the Cython users list. While I'm taking a break I decided to play a bit with Fibonacci numbers.

The setup is simple: in cy.pyx (see below) we have two alternative implementations that compute the Nth element of the Fibonacci series. One version has Cython declarations that the variables are ints (and one for the function itself). The second version is plain Python. In a simple Python script cy.py we do

import pyximport
pyximport.install()
import cy

def f(N): cy.cyfib(N)
def g(N): cy.pyfib(N)

This translates the Python to C, builds it, and makes it accessible from within Python. We have to "wrap" the functions in Python calls in order to use timeit (at least, that's how I did it). Then we go to the command line and compare:

$ python -m timeit -s 'import cypy' 'cypy.f(1000)'
1000000 loops, best of 3: 1.37 usec per loop
$ python -m timeit -s 'import cypy' 'cypy.g(1000)'
1000 loops, best of 3: 232 usec per loop

The Cython version is quite a bit faster! Curiously cdef'ing the temp variable slows the timing down by a factor of 10. Hmm..

[UPDATE: There are issues!


>>> cy.cyfib(10)
55
>>> cy.cyfib(100)
-980107325
>>> cy.pyfib(100)
354224848179261915075L

I edit to use double instead of int:

>>> cy.cyfib(100)
3.54224848179262e+20
>>> cy.pyfib(100)
354224848179261915075L

It's rounded, but at least it's not negative!


]

import numpy

cpdef cyfib(int N):
cdef int a
cdef int b
cdef int counter
a = 1
b = 1
counter = 2
#cdef temp
while counter < N:
temp = a + b
a = b
b = temp
counter += 1
return b

def pyfib(N):
a,b = 1,1
counter = 2
while counter < N:
temp = a + b
a = b
b = temp
counter += 1
return b

Sunday, December 19, 2010

Cycle fun


 157 156 155 154 153 152 151 150 149 148 147 146 145
158 111 110 109 108 107 106 105 104 103 102 101 144
159 112 73 72 71 70 69 68 67 66 65 100 143
160 113 74 43 42 41 40 39 38 37 64 99 142
161 114 75 44 21 20 19 18 17 36 63 98 141
162 115 76 45 22 7 6 5 16 35 62 97 140
163 116 77 46 23 8 1 4 15 34 61 96 139
164 117 78 47 24 9 2 3 14 33 60 95 138
165 118 79 48 25 10 11 12 13 32 59 94 137
166 119 80 49 26 27 28 29 30 31 58 93 136
167 120 81 50 51 52 53 54 55 56 57 92 135
168 121 82 83 84 85 86 87 88 89 90 91 134
169 122 123 124 125 126 127 128 129 130 131 132 133


I've forgotten where I saw this---perhaps it was a Project Euler problem. In any case, you can see the pattern. It's a nice (fairly simple) exercise in Python to construct a list of lists and the functions to modify it. The great thing is that once you have the inner 3 x 3 circle correct, the rest is automatic. The code for this is in cycle.py.

The real reason I'm posting about it is that it makes a fun plotting project for matplotlib, especially in combination with the prime number generator that we worked up the other day (in primes.py).

The graphic at the top of the post was obtained by modifying the array to hold n % 10 for each value, and using the matplotlib color map called "winter." I think for that one I had n = 50, so that's a 101 x 101 grid.

The prime example :) has n = 40 and uses the 'autumn' colormap. That's the one that script.py is set up to generate at the moment. Is there a pattern or is it just random?

[UPDATE: I did a bit of testing. In tracing the results for the prime plot using a small n, I discovered that matplotlib has shown the matrix in a different way than I printed it at the top of the post. Their "heatmap" goes vertically for the first step and cycles clockwise; it is both a mirror image and rotated 180 degrees from the way I did it. Sorry. But, now that I'm done, I think I can just leave it that way... ]

Zipped project files on Dropbox (here).

DNA binding sites 5


Here is an example that shows how difficult the problem of searching for sites is, at least by the method we've used. I include two scripts in the zipped files: search.EC.py and plot.py. The site we want to search for is hard-coded in the file as crp.

As its name implies, the first script searches the E. coli sequence. You'll need to get the sequence first (and save it in the right place---see the script) before this will run. We slide a window over the length of the sequence, shifting one base at a time, and score what's showing in the window under the site's scoring scheme. Scores were multiplied by 10 (to convert to ints) and then saved to disk (22 MB or so).

In the second phase, the search.EC.py script also randomizes the sequence and re-runs the same search. The plot.py script filters the results for those above a threshhold and plots a histogram of the results (see the above graphic). The red bars are the results obtained with the authentic sequence and the yellow bars are with the randomized sequence.

The point is that although the extreme high values are clearly higher in the real sequence, for a scoring range like 110 - 120 (that is 11.0 - 12.0 in the original scheme), the ratio of the likelihoods for the two models (real E. coli v. random) is not even greater than 2. So, upon observing a site with a value of 11.5, say, its significance isn't clear.

One idea that would improve the significance is to note that the genome is subject to selection. For the Crp system to work properly, there has likely been selection against randomly placed sites, so the random model is not really the appropriate one to test against.

Again, zipped files here.

DNA binding sites 4



I grabbed the data for crp and purR sites in E. coli from George Church's server (here). This is the first part of the crp data set:

>aldB -18->4
attcgtgatagctgtcgtaaag
>ansB 103->125
ttttgttacctgcctctaactt
>araB1 109->131
aagtgtgacgccgtgcaaataa
>araB2 147->169
tgccgtgattatagacactttt


The graphics above are representations of the information analysis for crp and purR produced by site_score.py. Notice the different patterns. The important bases for crp are two short pentamers separated by one turn of the helix. Not so for purR, which suggests a different mode of binding.

According to (Schumacher 1994 PMID 7973627):

The DNA-binding domain contains a helix-turn-helix motif that makes base-specific contacts in the major groove of the DNA. Base contacts are also made by residues of symmetry-related alpha helices, the "hinge" helices, which bind deeply in the minor groove. Critical to hinge helix-minor groove binding is the intercalation of the side chains of Leu54 and its symmetry-related mate, Leu54', into the central CpG-base pair step. These residues thereby act as "leucine levers" to pry open the minor groove and kink the purF operator by 45 degrees.


It's that minor groove interaction that is giving the strong signal in the "middle" of the site.

Here is what we calculate for site scores for crp. Notice that the lac site is a relatively poor one:

$ python site_utils.py crp
tnaL 20.2
nupG2 18.7
lac 17.6
cdd 17.3
deoP2 16.7
malT 16.5
..
cya 12.8
..
crp 12.0
..
lac 8.8
..

avg for 100000 random seqs: -15.64
13.16
12.18
11.87
11.51
10.78
10.66
10.58
10.56
10.2
9.87
158 sites in random seq for cutoff = 5

DNA binding sites 3

Continuing with binding site analysis, first two posts here and here.

Tom Schneider also invented Sequence Logos, which display the information for binding sites in an intuitive, graphical way (Schneider 1990 PMID 2172928). In that paper, (following Shannon) they define an uncertainty measure for each position in an alignment:



where H(l) equals minus the sum over the four nucleotides of the frequency of each base b at that position times the log2(freq). Then, the information is:



where e(n) is a small sample correction factor. Thus, uncertainty plus information is constant, and approximately equal to 2 (bits). It's no coincidence that lacking any information about which nucleotide is present at some position in a sequence, you need to ask me two yes-no questions to obtain the identity. For example: is it a purine? Yes. Then is is adenine? Yes. Two questions, two bits.

The script site_score.py does this calculation for the fis sites example, and we plot our home-grown version of the logo as the graphic below.


You can compare that output to what is in the paper:



The colors are switched for the central position because the values for A and T are exactly equal, and we sorted to plot T on top, while Schneider did the reverse.

There is also a site on the web for making logos. To use that, we need to strip the names out of the sequence file.


FH = open('fis.sites.txt','r')
data = FH.read()
FH.close()
for line in data.strip().split('\n')[1:]:
print line.split()[1]


The only significant difference is at the middle position (11). We didn't use the reversed sequences, so we see mainly A at that position. This is an artifact of the web site's approach.



Zipped project files on Dropbox (here).

DNA binding sites 2

The first few sequences in fis.sites.txt are:

#format Schneider
1 tttgccgattatttacgcaaa
2 agtgactaaaatttacactca
3 gtggtgcgataattactcata
4 attgcatttaaaatgagcgtg
5 attggtcaaagtttggccttt

The first few lines of output from site_utils.py:

ttatgtacaaatagtaagaaatgtctgaga..
[45, 10, 26, 39]
0.567 -1.603 -0.2245 0.3605
ttggtatatatactatacacctatatttga..
[28, 22, 21, 49]
-0.1175 -0.4655 -0.5326 0.6898

For Schneider's approach each site is present in the alignment in both forward and reverse complement orientation. In the output we are looking at each column of the alignment. The first base 't' of the top line of output is from sequence #1, the second base 't' is from the reverse of sequence #1, the third base 'a' is from sequence #2, and so on. The counts are given next, and sum to N = 120 = 60 * 2. The scores are calculated as given previously:

2 + log2(freq) - 0.018

The scores for the first 11 sequences are:

1          12.2
2 11.8
3 9.0
4 6.5
5 12.2
6 8.5
7 8.4
8 4.6
9 12.0
10 5.3
11 10.4

If you look at the graphic from last time (or the paper) you'll see that we match. So I think we're doing things correctly.

The last thing we do when running this basic site_utils.py file as __main__ is to look at some random sequences. For starters, we calculate

avg for 100000 random seqs: -10.85

which is roughly -0.5 for each position in the 21 nucleotide alignment. According to my understanding the purpose of the correction term was to make this be zero, and I'm not sure why it isn't.

The top 10 sites from the random sequences had scores:

11.72
10.73
10.73
10.52
10.48
10.28
10.08
10.08
10.07
9.86

There were a total of 9 sequences with scores > 10, while only 14 of the 60 authentic sites scored that high. So depending on where our cutoff is for an authentic site we'll get a lot of false positives. For this example, with a cutoff of 5, I found 267 of them.

DNA binding sites 1

I haven't written much about classification of DNA binding sites or motifs on the blog. (I did a series on Gibbs sampling to find new motifs starting here).

But a couple of years ago, before the blog started, I did some work on this and there is a modest writeup on my pages at mac.com (here). At that time, I put a lot of effort into providing a Quartz GUI for the basic functionality, but I haven't tested it recently and I'd be pretty surprised if it still works. Still, that discussion (and see here) is reasonable. Let's see if we can improve upon it.

The reason the subject came up is in considering Cython (here). One use case that's definitely appealing is to predict new members of the family in a bacterial genome (given a set of known binding sites), where there are (just about) as many sites to be tested as base pairs. That can take a substantial time using Python (a few minutes).

To start working on this problem, we need some data. A repository of alignments for various transcription factors of E. coli is available from George Church's lab (here), but I'm going to use Tom Schneider's method (website) for constructing the scoring system, so I'll use the example from his paper about fis (Hengen 1997 PMID 9396807). Here is a graphic from that paper showing a few of the 60 sites they analyzed and the bit scores for each site. We're going try to to recreate this. I've got a file with the 60 sequences; you can get them and the code we'll be using from Dropbox (here).



As usual for these problems, we consider each position in the alignment to be independent, and add up the scores for all the columns to obtain a total score. The score for a single column of the alignment depends on the the count of nucleotides at that position. Suppose we have N = 20 sequences and this distribution:

A:5  C:5  G:5  T:5

We calculate the frequency as 5/20 = 0.25 for each base, and then the score is 2 + log2(freq) - correction. The correction is for the small sample size of known sites (see here to begin) and it's about 0.018 in this particular case, but we'll neglect it for this brief discussion. The method essentially computes a log odds score for the competing hypotheses of a real binding site versus random sequence.

The score for each base in this alignment, which seems to be random, is calculated as 2 log2(0.25) = 0. On the other hand, suppose we have:

A:18  C:1  G:1  T:0

Then the scores are:

A = 2 + log2(0.9) = 2 - 0.152 = 1.848
C = G = 2 + log2(0.05) = 2 - 4.322 = -2.322

We assign a pseudocount of 1 for T, even though no site had T, so the score for T is the same as for C and G. In this way we end up with an m x n matrix of floats, where m = 4 (ACGT) and n = the length of the alignment.

To score a candidate site, we observe each nucleotide in turn, and retrieve the corresponding score for that nucleotide and position in the alignment. For example, in the non-random case if the site contains A at that position we add 1.848 to the score, while if it has C we subtract 2.322 from the score. Here we see the great strength of this approach: we don't just reward sequences for being close to the "consensus", we penalize them for having a nucleotide that is rarely observed at the corresponding position in authentic sites.

That's the simple, basic idea. As I say, the files will be on Dropbox (here). Next time we'll show some results.

Saturday, December 18, 2010

Exploring Cython, baby steps

I'm looking at Cython. From the docs:
Cython’s two major use cases [are]: extending the CPython interpreter with fast binary modules, and interfacing Python code with external C libraries.

The idea is that

The source code gets translated into optimized C/C++ code and compiled as Python extension modules

The "Hello world" example works fine. I install Cython with easy_install. Following the tutorial, in hello.pyx I have:

def say_hello_to(name):
print("Hello %s!" % name)

I write another file setup.hello.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("hello", ["hello.pyx"])]

setup(
name = 'Hello world app',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)

And I do this:

$ python setup.hello.py build_ext --inplace
running build_ext
cythoning hello.pyx to hello.c
building 'hello' extension
gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch ppc -arch x86_64 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c hello.c -o build/temp.macosx-10.6-universal-2.6/hello.o
gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -arch i386 -arch ppc -arch x86_64 build/temp.macosx-10.6-universal-2.6/hello.o -o hello.so

From the interpreter we can import functions from hello.so:

>>> from hello import say_hello_to
>>> say_hello_to('Tom')
Hello Tom!

Now, to do something a bit more interesting. Suppose I want to use the sqrt function from the C library declared in math.h. I write a file cy_script.pyx:

cdef extern from "math.h":
double sqrt(double)

cdef double f(int i):
return sqrt(i)

print f(2)

The print statement is there for testing, as you'll see. We write another file setup.py with this:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules=[
Extension("cy_script",
["cy_script.pyx"],
libraries=["m"]) # Unix-like specific
]

setup(
name = "Cy_script",
cmdclass = {"build_ext": build_ext},
ext_modules = ext_modules
)

We build it as before:

$ python setup.py build_ext --inplace
running build_ext
cythoning cy_script.pyx to cy_script.c
building 'cy_script' extension
gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch ppc -arch x86_64 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c cy_script.c -o build/temp.macosx-10.6-universal-2.6/cy_script.o
gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -arch i386 -arch ppc -arch x86_64 build/temp.macosx-10.6-universal-2.6/cy_script.o -lm -o cy_script.so

In the interpreter:

>>> import cy_script
1.41421356237
>>> cy_script.f(3)
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'f'
>>> dir()
['__builtins__', '__doc__', '__name__', '__package__', 'cy_script']
>>> from cy_script import *
>>> dir()
['__builtins__', '__doc__', '__name__', '__package__', 'cy_script']

Notice that following the first statement we get execution of print f(2); that is, cy_script was actually imported, but the names defined in cy_script are not available to us.

I have to look into more to see how to do this. What's the difference between the first example and the second one?
[UPDATE: Answer from Stack Overflow, as always!
Substitute cpdef double f(int i):, and it'll work. ]

[UPDATE2: There's an even cooler way to do this:

>>> import pyximport
>>> pyximport.install()
>>> import hello
>>> hello.say_hello_to('Tom')
Hello Tom!
>>> import cy_script
1.41421356237
>>> cy_script.f(3)
1.7320508075688772

Details here. ]

Friday, December 17, 2010

More about primes

I took a quick look at our question from the primes.py script the other day (here) about the patterns within each decade: how many decades have primes ending in 1,3,7 or 7,9 and so on?

I modified the script to keep track of this information in a global variable L. I still kept pL because we need to use the primes in pL to test new candidates. I grabbed the first one million primes and looked at the patterns. The results are amazing, at least I think so:



(2, 3, 5, 7) 1
(1, 3, 7, 9) 1228
(1, 7, 9) 4799
(3, 7, 9) 4839
(1, 3, 7) 4843
(1, 3, 9) 4913
(1, 9) 17638
(3, 7) 17664
(7, 9) 17844
(1, 3) 17865
(3, 9) 45941
(1, 7) 46184
(1,) 152464
(7,) 152613
(9,) 152738
(3,) 152816
() 754197
1548587
1000001

We look through 1548587 decades to get the first 106 primes (actually one more than that, in order to count the pattern in the last decade correctly). Slightly less than half of these contain no primes, of the rest about 80% have a single prime. (Also of interest, the overall density of primes is reduced by about 1/6 between 105 and 106---not shown).

For the pairs and triples we observe an amazingly consistent distribution of primes considering only their last digit. Each of the four is generally equivalent except that the pairs 1,7 and 3,9 are about 2.6 times more likely than the other pairs. What property of integers can account for the amazing constancies and also this systematic difference?

Before getting too excited, I should probably head over to the prime pages and check out their lists. But that just seems incredible to me. Something deep is going on here (or not.. maybe it's something to do with the way a "decade" is defined. Have to think about that).

BTW, this algorithm isn't scaling well. 105 is really easy, 106 is agonizingly slow. Use something better for big problems.


def primes(L,N):
pL = [2,3,5,7]
count = 4
L.append(pL[:])
base = 0
while True:
temp = list()
base += 10
for i in [1,3,7,9]:
n = base + i
for p in pL:
prod = p*p
if prod == n or not n % p:
break
if prod > n:
count += 1
pL.append(n)
temp.append(n)
break
L.append(temp[:])
if count > N: return pL,count

L = list()
pL,count = primes(L,int(1e6))
D = dict()
for sL in L:
t = tuple([n % 10 for n in sL])
if not t in D:
D[t] = 1
else:
D[t] += 1

for k in sorted(D.keys(),key=lambda k:D[k]):
print str(k).rjust(14), str(D[k]).rjust(6)

print sum(D.values())
print count


[UPDATE: It turns out the patterns depend on the definition of a decade. For example, to have the decades go from 15 to 25, etc. I changed these four lines:

 2    pL = [2,3,5,7,11,13]
3 count = 6
5 base = 5
10 for i in [2,4,6,8]:

and now the pattern is different.


(7, 9, 3) 6020
(7, 1, 3) 6232
(7, 1) 22485
(9, 3) 22521
(1, 3) 22616
(7, 9) 22690
(9, 1) 28466
(7, 3) 44681
(7,) 147906
(3,) 148038
(1,) 170134
(9,) 170243
() 736554

I'm not saying I understand it, but the fact that it changes argues pretty strongly that it's an artifact. ]

NSTask

I investigated the use of NSTask (docs). According to the reference:
Using the NSTask class, your program can run another program as a subprocess and can monitor that program’s execution. An NSTask object creates a separate executable entity; it differs from NSThread in that it does not share memory space with the process that creates it.

I think it is possible to get info from the running process (beyond just the fact that it's still running) through stdout, but I haven't tried that yet. Here we just launch a task in three different ways, and use Objective-C for a change of pace. The question is more about permissions, and paths and such.

First, pass /usr/bin/python as the launchPath and an array with two arguments: script0.py (on my Desktop), and the string "to me". Notably, the script file is not executable or even readable by anyone but the user (mode is -rwx------). We set up the task and launch it in one line.

script0.py

import sys
try:
print 'Hello world ' + sys.argv[1] + '!'
except IndexError:
print 'Hello world, everybody!'

task0.m

#import <Foundation/Foundation.h>

int main (int argc, const char* argv[]) {
NSString *path = @"/usr/bin/python";
NSString *home = NSHomeDirectory();
NSString *script = @"/Desktop/script0.py";
NSString *arg1 = [NSString stringWithFormat:@"%@%@", home, script];

NSArray *args = [NSArray arrayWithObjects:arg1,@"to me",nil];
NSTask *t = [NSTask launchedTaskWithLaunchPath:path arguments:args];
NSLog(@"%@",[t description]);
[t waitUntilExit];
return 0;
}


$ gcc -o test task0.m -framework Foundation -fobjc-gc-only
$ ./test
2010-12-17 12:51:41.684 test[10281:903] <NSConcreteTask: 0x20000fa80>
Hello world to me!

Notice that the argument 'to me' came in as a single value. We didn't get this:

$ python script0.py to me
Hello world to!


In the second method we do the shebang or hashbang thing, script1.py is the same as script0.py except we add this as the first line: #! /usr/bin/env python and we make it executable first..

$ chmod +xu script1.py

The code for task1.m

#import <Foundation/Foundation.h>

int main (int argc, const char* argv[]) {
NSString *home = NSHomeDirectory();
NSString *script = @"/Desktop/script1.py";
NSString *path = [NSString stringWithFormat:@"%@%@", home, script];

NSArray *args = [NSArray arrayWithObject:@"to me"];
NSTask *t = [NSTask launchedTaskWithLaunchPath:path arguments:args];
NSLog(@"%@",[t description]);
[t waitUntilExit];
return 0;
}


$ gcc -o test task1.m -framework Foundation -fobjc-gc-only
$ ./test
2010-12-17 12:59:29.594 test[10447:903] <NSConcreteTask: 0x20000cb60>
Hello world to me!


The third method is similar to the first, except we set up the task before launching. This could be useful to set the environment, for example, or change StandardError, etc. We grab the user's environment (post here) and use that. You could change the working directory with setCurrentDirectoryPath, but I didn't try it yet. We re-use script0.py

#import <Foundation/Foundation.h>

int main (int argc, const char* argv[]) {
NSDictionary *eD = [[NSProcessInfo processInfo] environment];
NSTask *t = [[NSTask alloc] init];
[t setLaunchPath:@"/usr/bin/python"];
NSArray *args = [NSArray arrayWithObjects:@"script0.py",@"to me",nil];
[t setArguments:args];
[t setEnvironment:eD];
[t launch];
NSLog(@"%@",[t description]);
[t waitUntilExit];
return 0;
}


$ gcc -o test task2.m -framework Foundation -fobjc-gc-only
$ ./test2010-12-17 13:06:00.543 test[10592:903] <NSConcreteTask: 0x20000f1e0>
Hello world to me!

I see I posted about this before (here). It's so long ago I forgot! (Thanks, Google). In that case, we were in PyObjC and called a C program, so it's a little different. I haven't checked to see that the old code still runs with the current PyObjC. I'll try to do that.

Look ma, no server



About a month ago I had a series of posts about setting up a simple web server (first one here).

I can see a lot of potential for a form-based system which would let command-line-phobic users set the preferences for scripts through what looks like a web page, but doesn't actually go through the server. Instead the request would come to us (or rather our PyObjC-based application), and the app would execute the script, display (and save) the results as desired.

I have a small example of form-processing here. It uses the WebKit, which looks a little formidable at first, but can be used in a pretty simple fashion.

There is only one class that does anything: MyWebViewController.py. It has myWebView as an outlet. In Interface Builder I just dragged a Web View onto the window, and an NSObject cube onto the .. (whatever it's called that holds the cubes). The class for the object is set to be MyWebViewController, and since the outlet had already been specified in the file, IB let me drag from the cube to the view and set the outlet.

We have a standard web form as shown in the screenshot, form.html looks like this:

<Content-type: text/html>

<form name="input" action="nuthin" method="get">

First name: <input type="text" name="firstname" /><br />
Last name: <input type="text" name="lastname" /><br /><br />
<input type="radio" name="sex" value="male" /> Male<br />
<input type="radio" name="sex" value="female" /> Female<br /><br />
<input type="checkbox" name="vehicle" value="Bike" /> I have a bike<br />
<input type="checkbox" name="vehicle" value="Car" /> I have a car<br /><br />

<input type="file" name="datafile" size="40">
<input type="submit" value="Submit" />
</form>
</html>

The form was added to the project under Resources. In the code below, the controller class sets itself as each of three different delegates for the Web View, and then loads the form. We implement a single method for each delegate. The UIDelegate method allows us to run an openPanel and set the file name the user chooses (subject to it being the right type).

In the methods we do some print statements that show we have access to all the form data.

The sequence of events was that I filled in the form and chose the file, and then hit submit. I marked the end of the file dialog process with some dashes.

If you look closely you can see that we don't actually need to be the resourceLoadDelegate. The output marked with 'provisionalLoad..' which comes to us as the frameLoadDelegate, is all that's necessary. If you scroll out on the second of those, you'll see we have the data including the file name.

I think it's pretty slick, and it took all of an hour, since I had a working Web View from another project. :)

What I haven't got working yet is a POST request with a bigger data form: textarea. If you know how, please speak up.

provisionalLoad..
file:///Users/telliott_admin/Desktop/FormBrowser/build/Debug/FormBrowser.app/Contents/Resources/myform.html
None
identifier..
<NSURLRequest /Users/telliott_admin/Desktop/FormBrowser/build/Debug/FormBrowser.app/Contents/Resources/myform.html> <WebDataSource: 0x3629c20>
2010-12-17 10:42:45.555 FormBrowser[7308:a0f] Application did finish launching.
runOpenPanel..
------------------------------
provisionalLoad..
file:///Users/telliott_admin/Desktop/FormBrowser/build/Debug/FormBrowser.app/Contents/Resources/nuthin?firstname=Tom&lastname=Elliott&sex=male&vehicle=Bike&vehicle=Car&datafile=x.txt
None
identifier..
<NSMutableURLRequest file:///Users/telliott_admin/Desktop/FormBrowser/build/Debug/FormBrowser.app/Contents/Resources/nuthin?firstname=Tom&lastname=Elliott&sex=male&vehicle=Bike&vehicle=Car&datafile=x.txt> <WebDataSource: 0x30dfa10>



from Foundation import *
from AppKit import *
from WebKit import *
from objc import *

class MyWebViewController(NSObject):

myWebView = objc.IBOutlet()

def applicationDidFinishLaunching_(self, sender):
NSLog("WVC Application did finish launching.")

def awakeFromNib(self):
self.myWebView.setFrameLoadDelegate_(self)
self.myWebView.setResourceLoadDelegate_(self)
self.myWebView.setUIDelegate_(self)
self.loadWebForm()

def loadWebForm(self):
path = NSBundle.mainBundle().pathForResource_ofType_(
'myform','html')
url = NSURL.URLWithString_(path)
req = NSURLRequest.requestWithURL_(url)
wf = self.myWebView.mainFrame()
wf.loadRequest_(req)

def webView_didStartProvisionalLoadForFrame_(self,wv,wf):
print 'provisionalLoad..'
print wv.mainFrameURL()
print wv.mainFrame().provisionalDataSource().data()

def webView_identifierForInitialRequest_fromDataSource_(
self,wv,identifier,dataSource):
print 'identifier..'
print identifier, dataSource

def webView_runOpenPanelForFileButtonWithResultListener_(
self,wv,listener):
print 'runOpenPanel..'
panel = NSOpenPanel.openPanel()
panel.setCanChooseDirectories_(True)
panel.setCanChooseFiles_(True)
fileTypes = ['txt']
result = panel.runModalForDirectory_file_types_(
NSHomeDirectory() + '/Desktop',None,fileTypes)
if not result == NSOKButton:
return
path = panel.filename()
if not path: return
print '-'*30
listener.chooseFilename_(path)

Color Sudoku 0.1


I wrote a new version of Color Sudoku (version 0.1) using PyObjC and XCode. To remove a small square, click it. To choose a small square as the correct one, used CTL-click.

It should run on Snow Leopard. It's not that well tested, but works OK for me. The window size is locked.

It can load a puzzle from a (.txt) file in the format:
.1.....8.
etc

or
010000080
etc

The data can be on multiple lines. It just picks the first 81 charcters in the set { '0'..'9' + '.' }. If there aren't 81, it bails. There is a function that "cleans up" the puzzle upon first loading. It finds all the small squares that are in conflict with a square that is determined in the puzzle, and removes them. I haven't yet provided a way to turn this off, but I doubt you would want to.

There is a 'Go back' button (and CMD-Z works as well).

There are many stored puzzles. Have fun! Let me know if you break it (and how). Zipped XCode project files and build version of the app are on Dropbox (here).

Thursday, December 16, 2010

Python generators

If you want to know about generators, I don't find the docs all that helpful, but you could check the PEPs, like this one.

Below is possibly the world's simplest Python generator: it's a function that includes the yield statement in a loop. It takes a number as the single argument and increments it once for each pass through the loop. The loop is endless: while True, but that's not a problem because we ask for one value at a time. Eventually we tire of the game and kill the process.


def g(n):
while True:
n += 1
yield n

f = g(13)
print f
x = f.next()
print x

while x <= 17:
x = f.next()
print x



<generator object g at 0x1004c7f50>
14
15
16
17
18

Notice that if we really wanted the test to work properly, we should reverse the order of the statements in the second while loop.

We can extend this idea to the problem of finding prime numbers---one of the oldest computational problems. In the code below we cache and then serve up the first four values. After that, the algorithm is that we test each prime number p in turn against the candidate n:

first, construct p*p
if p*p == n, n is not prime, it's the square of a prime
elif p divides n evenly, n is not prime
elif p*p > n, n is prime
else: test the next p

If n passes all the tests, it joins the club. Since all the primes beyond 5 end in 1, 3, 7, or 9, we test no other candidates.

I did three runs, a short one with v (verbose) = True to help test our logic, and longer ones with N = 10000 and N = 100000.
We grab N primes and look at the distribution.

testing 11 : 2 3 5 prime
testing 13 : 2 3 5 prime
testing 17 : 2 3 5 prime
testing 19 : 2 3 5 prime
testing 21 : 2 3 is a factor
testing 23 : 2 3 5 prime
testing 27 : 2 3 is a factor
testing 29 : 2 3 5 7 prime
testing 31 : 2 3 5 7 prime
testing 33 : 2 3 is a factor
testing 37 : 2 3 5 7 prime
testing 39 : 2 3 is a factor
testing 41 : 2 3 5 7 prime
testing 43 : 2 3 5 7 prime
testing 47 : 2 3 5 7 prime
testing 49 : 2 3 5 7 square of prime
testing 51 : 2 3 is a factor
testing 53 : 2 3 5 7 11 prime
testing 57 : 2 3 is a factor
testing 59 : 2 3 5 7 11 prime
testing 61 : 2 3 5 7 11 prime
testing 63 : 2 3 is a factor
testing 67 : 2 3 5 7 11 prime
testing 69 : 2 3 is a factor
testing 71 : 2 3 5 7 11 prime
{1: 5, 2: 1, 3: 5, 5: 1, 7: 5, 9: 3}
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53]


{1: 2484, 2: 1, 3: 2515, 5: 1, 7: 2508, 9: 2491}


{1: 24967, 2: 1, 3: 25007, 5: 1, 7: 25015, 9: 25009}

Notice that the probability of ending in 1, 3, 7, or 9 is essentially identical. It would be interesting to look at the patterns in more detail, for example within a particular decade do 3 and 7 tend to cluster, etc.

v = False
def primes():
pL = [2,3,5,7]
for i in range(4):
yield pL[i]
base = 0
while True:
base += 10
for i in [1,3,7,9]:
n = base + i
if v: print 'testing', n, ':',
for p in pL:
if v: print p,
prod = p*p
if prod > n:
if v: print 'prime'
pL.append(n)
yield n
break
if prod == n:
if v: print 'square of prime'
break
if not n % p:
if v: print 'is a factor'
break

def test(N):
p = primes()
L = [p.next() for i in range(N)]
return L

L = test(100000)
D = dict(zip([1,2,3,5,7,9],[0]*6))
for n in L:
x = n % 10
D[x] += 1
print D
if v: print L[:16]

It's seems likely that this is not efficient. The tests for 2 and 3 seem unnecessary. We should just test whether n is even, whether the sum of the digits is divisible by 3, and never test 5 at all. There are other divisibility rules as well (wikipedia).

If you want an efficient prime number generator in Python, go here. Although unutbu is awesome, you should look at Alex Martelli's answer, which is short and awfully fast.

[UPDATE: I found several errors in an earlier version of this post.]

Exploring mitochondrial genomes 2



Continuing with the problem from the last post (here), we want to construct a plot similar to what the ape book has on p. 60 (Fig 3.5), which shows the nucleotide composition of the mitochondrial genes from the species we're looking at. The code shown below is in the same script with the listing from last time, and depends on it.

In the first part, we organize the sequence data into a dict keyed by gene name, where the value is a list of all the sequences for that gene in the data. In the second part we go through that dict, pull out the sequences for each of our target genes and count all the A, C, G, T. The data structure is a nested one: a dict of dicts. The third part is somewhat more complicated. We want to plot not the total counts for each nt, but the fraction of each. The first line does the calculation:

    values = [nD[nt]*1.0/S for nt in 'atcg']
L = [sum(values[:j]) for j in range(1,5)]
for z,c,f in zip([4,3,2,1],colors,L):
b = plt.bar(i+1,f,width=1,color=c,zorder=z)

I chose the ATCG because then we can see at a glance what the GC content is for each gene. If we wanted to get slightly fancy we could sort on that. We're going to do a bar plot, so the value for each bar is not its fraction, but its fraction plus the sum of the fractions of all the nucleotides plotted previously to it. The plt.bar call looks a little convoluted: the purpose is to plot the bars in the correct z-order. In reality, we plot the bars one on top of the other, but we don't see what's underneath. We also tweak matplotlib into plotting the gene names as the tick labels.

The figure could use a key relating the colors to the nucleotides, but I'll leave that as an exercise for the reader.

Code listing:

D = dict()
targets = ['ATP6','ATP8','COX1','COX2','COX3',
'CYTB','ND1','ND2','ND3','ND4','ND4L',
'ND5','ND6']
for gene in targets: D[gene] = []

for item in rest.strip().split('\n\n'):
if 'Sequence does not exist' in item:
continue
title,seq = item.strip().split('\n',1)
i = title.find('(')
j = title.find(')')
gene = title[i+1:j]
D[gene].append(seq)
#---------------------------------

rD = dict()
for k in targets:
nD = dict(zip('acgt',[0]*4))
for seq in D[k]:
for nt in 'acgt':
nD[nt] += seq.count(nt)
rD[k] = nD
#---------------------------------

colors = ['0.2','0.4','0.6','0.8']
iL = list()
for i,k in enumerate(targets):
iL.append(i+1)
nD = rD[k]
S = sum(nD.values())
values = [nD[nt]*1.0/S for nt in 'acgt']
L = [sum(values[:j]) for j in range(1,5)]
for z,c,f in zip([4,3,2,1],colors,L):
b = plt.bar(i+1,f,width=1,color=c,zorder=z)

ax = plt.axes()
ax.set_xlim(1,len(targets)+1 )
ax.xaxis.set_ticks([i+0.5 for i in iL])
ax.xaxis.set_ticklabels(targets)
plt.savefig('example.png')

Exploring mitochondrial genomes

Following the ape book (Phylogenetics in R; starting on p. 55), they use a data set in a file named mammal_mtGenome.fasta. It's not part of the datasets with the ape package (too big), but a Google search led me to versions of Ch. 3 on the web, so you can follow along even if you don't have the book.

The example uses a database called OGRe---Organellar Genome Retrieval system (here). It has 1244 organisms! What's nice is that they've organized the data by genes and they have the nucleotide sequences.

If we go to select species from taxonomy and expand the phylogeny: Chordata > Vertebrata > Mammalia, we can get a checkbox for all mammals, then at the bottom of the page do display sequences and from the table at the very bottom choose these genes: rRNAs plus protein encoding genes: ATP6 .. ND6. I saved the download in 'sequences.fasta' and it's 2.7 MB.

The data file contains a header, followed by a section with entries like:

#HOMSAPMIT : _Homo sapiens_ (human) : MITOCHONDRION

after that we find the sequences of the genes, grouped by gene (all the ATP6 genes first, then ATP8 all the way to ND6 and beyond). Those entries look like this:

>GALPYRMIT(ATP6)
atgaac..

or

>ORYCUNMIT(12SrRNA)Sequence does not exist

A double newline ('\n\n') is found between successive "does not exist" entries, but '\n\n\n' between other entries. There's a total of 233 species. Parsing the data file is a fairly trivial Python exercise. We use the fact that the organism code name is the first element, that the genus contains '_' as its first letter while the species contains '_' as its last letter, and that the common name is enclosed in parentheses. We sort and print the results. Next time, we'll do something with the data. Here we just print the first few lines:

Acinonyx jubatus               ACIJUBMIT    cheetah
Ailuropoda melanoleuca AILMELMIT giant panda
Ailurus styani FULSTYMIT red panda
Ammotragus lervia AMMLERMIT aoudad
Anomalurus GP-2005 ANOSP1MIT scaly-tailed squirrel
Arctocephalus forsteri ARCFORMIT New Zealand fur seal
Arctocephalus pusillus ARCPUSMIT cape fur seal
Arctocephalus townsendi ARCTOWMIT Guadalupe fur seal
Artibeus jamaicensis ARTJAMMIT Jamaican fruit-eating bat


Code listing:

import matplotlib.pyplot as plt

fn = 'sequences.fasta'
FH = open(fn,'r')
data = FH.read().strip()
FH.close()

results = list()
header, defs, rest = data.split('\n\n',2)
for line in defs.strip().split('\n'):
L = line.split()
code = L[0][1:]
genus = [w for w in L if w[0] == '_'][0][1:]
species = [w for w in L if w[-1] == '_'][0][:-1]
i = line.find('(')
j = line.find(')')
name = line[i+1:j]
results.append((genus,species,code,name))

def test():
print len(results)
results.sort()
for g,s,c,n in results:
print (g + ' ' + s).ljust(30), c, ' ', n

test()

Turn autocomplete="on"


I have another item of trivia (something more substantial is next, I promise). Some months ago, a website that I log into every day adopted the use of a form element for logins that is modified to look like this:

<input class="login" type="text" value="" size="20"  autocomplete="off">

The autocomplete attribute turns off autocompletion, which the browser would otherwise use (subject to User Prefs), to fill in the login name (and most important, the password) in the form.

An argument in favor of this strategy is that it protects your random idiot who does his online banking on a public PC at his hotel in Mexico and stores his login info without thinking about it.

One argument against is that it promotes the use of weak passwords. Since the site in question requires at least a modestly complex password, and I try to always use strong passwords, this is a real problem.

The solution, on Safari anyway, is a Safari Extension that I got from Jeff Johnson at Lapcat Software (here). You can read more about Safari Extensions (here) and get a lot of them too.

The extension's code has a format which looks funny at first, it has a file suffix safariextz and looks fairly weird if you drop it on a text editor. But the first few letters are xar, which is a big hint that you can look inside easily and see what's there (here):

xar -xf


(xar being the "extensible archiver"). The nice thing about that is I can readily tell how it works and that it's harmless. What you can't do is to rearchive it and then use it (at least so I'm told). That's because of Code Signing (here).

Read the original post on the extension (here).

More PyObjC


If you're interested in PyObjC, I have lots of good examples up on my iDisk Public folder (see the page here). I think most of that material is still relevant on Snow Leopard. Just for fun, I grabbed the animation code and reassembled it into a new XCode project template. A quick outline:

Do the files first (then IB can figure things out for you):
File > New File > User Templates: Python NSView subclass: MyView
File > New File > User Templates: Python NSObject subclass: MyAnimation

Paste in the code except use the new versions for inits and add @IBAction for button methods and do objc.IBOutlet() for outlets.

Put the following in the nib:

2 buttons (Start and Stop) hooked up to the App Delegate methods
a Custom View with its class set to MyView
this view hooked up as an outlet of the App Delegate
a Progress Indicator is also an outlet of the App Delegate

The class MyAnimation inherits from NSAnimation
It's instantiated from the AppDelegate (no nib representation)

In MyAnimation.init() we specify the
duration = 5 sec
FrameRate = 6 Hz
ProgressMarks = 300 (so 300/5 = 60 Hz)

in MyAnimation I added print calls to show that:
setCurrentProgress_ gets called every Frame (30 times total)
each time we tell the view to save an NSBezierPath
using the current position of the object
we also tell the view to update currentProgress

The AppDelegate is the delegate of the animation
animation_didReachProgressMark_
gets called at 60 Hz (300 calls in 5 sec)
we call setNeedsDisplay_(True) on the view each time

It's simple and beautiful. I put the zipped project files (and a built app) on Dropbox (here).

Wednesday, December 15, 2010

Time's up for "easy_install"

I wanted to solve the problem of installing PyObjC using easy_install for the oddball Pythons that XCode insists on trying to use to run my projects. If that sounds indirect, it is. It's a problem caused by a problem caused by a problem.

I've spent at least 5 hours trying to figure out how to use my existing System Python's easy_install to install PyObjC for an alternate Python (like in ~/Software/Python ..) using these instructions and specifically this method. It's not working for me. Maybe someone knows of a step-by-step example on the web.

The method given does not seem to work. I can't even figure out how to modify a pth file so easy_install is happy.

It makes more sense to me to do a clean install of OS X on the machine in question. That only takes an hour or so.

Likelihood in Seven Pages

I just re-read a "gentle introduction" to likelihood by Peter Foster. He is here, and I got the paper here. It's called

The Idiot's Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

Now that I've got the idea about Q and P matrices (e.g. here) and a bit of practice with Maximum Likelihood for phylogenetic models (here), it makes perfect sense. Feels like I'm making progress.

Highly recommended.

More troubleshooting: XCode and PyObjC


I was going to tell you how I got PyObjC and XCode working together on my third (office) machine. The symptoms are shown in the above screenshot. The red Failure thingies are saying that none of the names defined in Python/Python.h are recognized, like it couldn't find the header file. But the "missing required architecture ppc" warning suggested to me (since I have an i386 iMac), that I should play with the Project settings.

And I did something to those settings that fixed (this part of) it, but I'm not positive now because it's been a couple of days and I can't recreate the failure. The settings I have that work are:



So we'll just mark that down as a likely trouble spot to look at, for next time. The second issue was the same kind of failure that I saw on the other machine (here), though not the same Python, naturally:

Traceback (most recent call last):
File "main.py", line 12, in
import objc
ImportError: No module named objc

where an inserted print sys.version statement gave:

2.6.4 (r264:75706, Mar  3 2010, 09:38:49) 

It's not this:

$ /usr/bin/python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)

So where is 2.6.4 from Mar 3 2010, 09:38:49 ? After a long search,

cd /Library/Frameworks/Python.framework/Versions/2.6/bin
./python
Python 2.6.4 (r264:75706, Mar 3 2010, 09:38:49)

So.. we've got a Python in /Library and that's what is being used..
How did it get there? I don't remember.. Why did XCode choose it? No idea. Since it was in /Library rather than /System, and I was having difficulty with easy_install, I decided to try more radical surgery:

cd /Library/Frameworks
sudo rm -rf Python.framework
Don't try that at home!
The patient survived. And XCode switches to:

2.5.4 (r254:67916, Jun 24 2010, 21:47:25)

Old, but the XCode project runs. And I can't find that Python anywhere!