Thursday, August 7, 2008

Python for simulations (3) - population structure

In an environmental sequencing project one isolates and sequences SSU (16S in bacteria) ribosomal RNA genes to obtain information about a microbial population. As you'd expect, the amount of detail you can see depends on the structure of the population and how hard you look. The goal is to estimate the total number of species (or phylotypes, for uncultivatable organisms), as well as their relative numbers. The problem is that the number of observations may be limited, although that is changing rapidly. Sophisticated methods (like DOTUR) exist to tackle this challenge, but I wanted to explore it in a simple way by using a simulation written in Python.

In the simulation, a population contains three frequency classes of species: high, medium and low. We construct a population with varying frequencies specified for the high and medium classes. For example, starting with a unit population size of 1000, we might specify a frequency for high of 0.50 (500 individuals, arranged in 10 species), a frequency for medium of 0.30 (300 individuals, arranged in 50 species), and the balance consisting of unique individuals (200 species).

Then, we sample from the population at random, with replacement. We make 1000 observations, and at each step we record the number of different species seen so far. To make this more reliable, we do five replicates and average the results. The code is simple enough that I'm not going to post it. If you would like a copy, shoot me an email. Here are the results, plotted using R:


In the figure, each curve shows the number of species seen so far, as a function of the number of observations. The curves are labeled with the values for the high and medium frequency classes, as well as the total number of species actually present. As expected, the curves level off with increasing number of observations. But the rate at which this occurs depends greatly on the fraction of the population in the low frequency category. Also, it takes a lot of observations to get a reliable estimate of the total counts, I'd say about 10 times as many observations as there are species in all. Typical observation sets have 100-200 sequences where the estimated species count is between 100-1000.