Friday, August 8, 2008

Deonier Ch 2 example 1,2

In example 2.1, Deonier want to simulate a DNA sequence by generating a random sequence of letters with some probability. Here is their code for a demo of the sample function in R:
x = c(1,0)
pi = c(0.25,0.75)
v <- sample(x,10,replace=T,pi)
v
# [1] 1 0 0 0 1 0 0 0 1 0


In Python I just make a list or string with the elements inserted according to the desired probability, then use random.choice.
import random, math
R = random.Random(31)
x = [1,0,0,0]
for i in range(20):
print R.choice(x),
print
# 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0


To simulate a DNA sequence, they use the integers 1 to 4, because R doesn't deal with strings or characters very gracefully.
set.seed(13)
pi = rep(0.25,4)
x = 1:4
seq1 <- sample(x,10000,replace=T,pi)
seq1[1:15]
# [1] 4 2 3 2 1 2 4 1 1 2 4 1 1 4 4
seq2 <- sample(x,10000,replace=T,pi)
seq2[1:15]
# [1] 1 4 4 2 1 3 4 1 3 2 3 2 3 2 1


In Python
b = 'ACGT'
N = int(1E3)
L = [R.choice(b) for i in range(N)]
print L[:6]
# ['A', 'C', 'C', 'T', 'A', 'A']
print L.count('T')
# 259


In example 2.2, they generate a sequence from the binomial distribution
x <- rbinom(2000,1000,0.25)
mean(x)
# [1] 250.0715
var(x)
# [1] 196.1775
sd(x)^2
# [1] 196.1775


The math module doesn't have a mean function so we define some simple statistical functions:
def mean(L):  
return sum(L)*1.0/len(L)

def var(L):
m = mean(L)
sumsq = sum([(n-m)**2 for n in L])
return 1.0/(len(L)-1) * sumsq

def sd(L):
return math.sqrt(var(L))


We fake the binomial distribution like this:
import random, math
R = random.Random(31)
N = int(1E3)
b = 'ACGT'
x = list()
obs = 2000
for i in range(obs):
L = [R.choice(b) for i in range(N)]
x.append(L.count('T'))
print '%3.2f %3.2f %3.2f' % (
mean(x), var(x), sd(x))

# 250.03 187.47 13.69


And finally, we'd like to generate a histogram of the values (I'll plot the R results):
b = max(x) - min(x) + 1
r = c(min(x)-10,max(x)+10)
par('mar' = par('mar')+2)
hist(x,breaks=b,xlim=r,xlab='successes',
cex.lab=1.5,main='',col='magenta')

Here it is: