Sunday, August 10, 2008

Deonier Ch 2 Codon Bias - Codon Data

Continuing with the problem of analyzing codon bias, there is abundant evidence that the preferred codon usage pattern varies greatly. There are several ideas about why this might be (see e.g. Andersson and Kurland 1990). What I'm concerned with here is to explore whether there is a relationship between expression level and codon usage in E. coli. Please note that this is not a scholarly examination of the question. I am really just trying to develop my scripting skills by reading and exploring Deonier.

In previous posts (1, 2) I discussed obtaining the data for all coding sequences (CDS) from the GenBank record for the genome of MG1655, and Affymetrix array expression data (for a rather distant relative, AB1157). Loading and sorting the data is straightforward. (See my script). I obtained expression data for 4345 genes. But only one of the top 50 in expression encodes a protein (lpp) We filter the expression data for CDS's (obtaining 2938 items). Here is a histogram of the values, scaled so that we can look at rare bins on the high end of the distribution.



I chose genes with expression levels in the top 40 as representative of highly expressed genes (values to the right of the red vertical bar, > 276), and I chose genes with expression < 150 as representative of genes with average expression. Here are the top 12:

lpp   3343.85
rmf 1953.51
fimA 1112.27
cspC 853.39
yfiD 840.27
rpmJ 759.78
yiiU 635.30
cspG 597.44
hns 570.08
cspE 549.95
udp 537.76
dnaK 470.56

I also went back and looked at one of the original references (Sharp and Li, 1986, behind a firewall), which gives lists of very highly and highly expressed genes. Interestingly, only 5 of their very highest category qualify in the top 40.

From this point, it is simply a matter of counting all the codons used in each gene for each group and saving the counts in a dict. To analyze the results, for each amino acid, we compare the ratio of the count for a given codon to the total of the synonomous codons, and finally, we compute the ratio of ratios (high expression to average). Here are examples for two amino acids, serine and tyrosine, as well as stop codons:

S  TCG       22  0.059    8457  0.156  0.382
S TCA 21 0.057 6449 0.119 0.478
S AGT 37 0.100 7923 0.146 0.685
S AGC 82 0.222 15247 0.281 0.789
S TCC 91 0.246 8282 0.153 1.612
S TCT 117 0.316 7909 0.146 2.170

Y TAT 75 0.362 15155 0.561 0.646
Y TAC 132 0.638 11852 0.439 1.453

* TGA 8 0.200 796 0.282 0.710
* TAG 2 0.050 198 0.070 0.714
* TAA 30 0.750 1832 0.648 1.157

The third and fourth columns are the codon count and frequency for highly expressed genes, the fifth and six columns are the same for average genes, and the last column is the ratio of ratios. It's clear that some codons are disfavored for highly expressed genes. For the stop codons, there are not enough examples to say anything with confidence.

Here are some more, where we compare the most and least favored codon for a particular amino acid:

A  GCT      250  0.337   14131  0.158  2.138
A GCC 116 0.156 24186 0.270 0.580

C TGC 37 0.627 6080 0.564 1.111
C TGT 22 0.373 4693 0.436 0.856

D GAC 234 0.505 18400 0.378 1.338
D GAT 229 0.495 30307 0.622 0.795

E GAA 408 0.730 37809 0.691 1.056
E GAG 151 0.270 16878 0.309 0.875

F TTC 163 0.639 16091 0.435 1.471
F TTT 92 0.361 20942 0.565 0.638

G GGT 300 0.503 23714 0.339 1.483
G GGA 23 0.039 7113 0.102 0.379

H CAC 80 0.552 9403 0.438 1.260
H CAT 65 0.448 12073 0.562 0.797

I ATC 275 0.608 24210 0.426 1.428
I ATA 10 0.022 3708 0.065 0.339

K AAA 384 0.814 31585 0.768 1.059
K AAG 88 0.186 9525 0.232 0.805

L CTG 437 0.714 51779 0.509 1.402
L CTA 10 0.016 3588 0.035 0.463

M ATG 184 1.000 26359 1.000 1.000

N AAC 258 0.789 20608 0.560 1.409
N AAT 69 0.211 16185 0.440 0.480

P CCG 164 0.672 22815 0.542 1.240
P CCC 7 0.029 4951 0.118 0.244

Q CAG 272 0.791 27460 0.658 1.201
Q CAA 72 0.209 14248 0.342 0.613

R CGT 234 0.578 20306 0.390 1.480
R CGG 11 0.027 4727 0.091 0.299

S TCT 117 0.316 7909 0.146 2.170
S TCG 22 0.059 8457 0.156 0.382

T ACT 152 0.328 8277 0.164 2.001
T ACG 57 0.123 13497 0.267 0.460

V GTT 261 0.441 17052 0.254 1.735
V GTC 66 0.111 14442 0.215 0.518