Python for Bioinformatics: Student's t-test again 6

Monday, November 8, 2010

Student's t-test again 6

Code

This is the sixth in a series of posts about Student's t test. The previous posts are here, here, here, here, and here.

In this post I'm going to talk about how I would test my home-grown functions to do the t-test in Python. I'm only going to do this for the one-sample test, because I'm running out of energy. But you could easily extend this to the other functions.

We could check using SciPy if it is installed, and that would be easy. But I'm going to use R, partly because it's the gold standard for statistics, and partly because it's a useful technique that I've used in my real work. While we could use RPy as I mentioned the other day, here I am going to use R to generate the sample data, run the test of interest and then write the results to disk.

What we're going to do is run R in batch mode. If it gives an error, which frequently happens when building such a test, we want to show the error. We also clean up extra files when we're done. The code to do this has been added to utils.py and is given below. It should be self-explanatory.

When we run R this way, we feed it a text file with the commands to run. This file is constructed by the function write_R_code. It's not pretty, but I hope you get the idea. Perhaps a better method, which I've also used, is to write the code to a text file, then load it into Python, substitute the file names and specific functions etc., then write that to disk as the file to give to R.

In the last step, we load the results that R has given us and evaluate how well our functions do on the same data. As you can see, we match very well as a rule, but have rounding errors at the extremes of the t-distribution. I believe this could be fixed by extending the range over which we built the distribution (see here). I think it looks pretty good.

[UPDATE: Just to make it clear what we did: we do 1000 runs and look at the details for the first five, then test the rest for deviation from the R results by more than ε = 0.0001. ]

Output:

run
(1626, 0)
clean
.RCode removed
.RCode.Rout removed
.RData removed
.RHistory not found

test results:
N =  1000
R gave:  t = -1.858651
I get:   t = -1.85865011386

R gave:  t = -4.250644
I get:   t = -4.25063601226

R gave:  t = -1.743078
I get:   t = -1.74307933578

R gave:  t = 0.1286087
I get:   t = 0.128612751636

R gave:  t = -9.36424
I get:   t = -9.36429217504

error with:
[ 8.946167  8.941678  8.940817] -1843.026 -1843.23012185

error with:
[ 8.895525  8.686258  8.761813] -52.60937 -52.6095381366

error with:
[ 10.54477  10.42989  10.66126] -21.77912 -21.7796812829

error with:
[ 11.05575  10.89208  10.68675] -10.50883 -10.5090054298

error with:
[ 10.02618  10.07471  10.16116] -48.45138 -48.4527009573

R_test.py

import numpy as np
from t_test import one_sample_t
import utils

def write_R_code(D):
    L = [ 'fn <- file("' + D['results_fn'] + '","w")',
          'n = ' + str(D['n']),
          'mu = ' + str(D['mu']),
          'sigma = ' + str(D['sigma']),
          'for (i in 1:' + str(D['N']) + ') {',
          'A = rnorm(n,mu,sigma)',
          'result = ' + D['func'] + str(D['mu_est']) + ')',
          'cat(A,file = fn,sep = "\n")',
          'cat(result$statistic,file = fn,sep="\n")',
          'cat("\n",file = fn)',
          '}',
          'close(fn)'
          ]
    FH = open(D['code_fn'], 'w')
    FH.write('\n'.join(L))
    FH.close()
    
def compare_to_my_result(L):
    print 'test results:'
    print 'N = ', len(L)
    for i,e in enumerate(L):
        A = np.array(e[:-1])
        result = one_sample_t(A,mu=12)
        if i < 5:
            print 'R gave:  t =', e[-1]
            print 'I get:   t =', result[0]
            print
        else:
            if e[-1] - result[0] > 0.0001:
                print 'error with:'
                print A, e[-1], result[0]
                print

if __name__ == '__main__':
    directory = '/Users/telliott_admin/Desktop/'
    D = { 'code_fn':directory + '.RCode',
          'results_fn':directory + 'results.txt',
          'func':'t.test(A,alternative="less",mu=',
          'n':3,'mu':10,'sigma':2,'mu_est':12,'N':1000 }
                  
    write_R_code(D)
    utils.runR(D['code_fn'],v=True)
    utils.cleanup_R_files(D['code_fn'],directory,v=True)
    L = utils.load_data(D['results_fn'])
    compare_to_my_result(L)

utils.py

from __future__ import division
import os,subprocess
import numpy as np

def unbiased_var(X):
    n = len(X)
    sample_SS = sum(X**2) - sum(X)**2 / n
    return sample_SS/ (n-1)
    
def unbiased_std(X):
    return np.sqrt(unbiased_var(X))

def runR(code_fn,v=False):
    if v:  print 'run'
    cmd = 'R CMD BATCH ' + code_fn
    p = subprocess.Popen(cmd, shell=True)
    sts = os.waitpid(p.pid, 0)
    if v:  print sts
    if sts[1] != 0:  show_R_error()

def show_R_error():
    fn = directory + '.Rcode.Rout'
    FH = open(fn, 'r')
    msg = FH.read()
    FH.close()
    print '\n'.join(msg.split('\n')[-3:]).strip()

def cleanup_R_files(code_fn,directory,v=False):
    if v:  print 'clean'
    L = ['.RCode.Rout','.RData', '.RHistory']
    L = [code_fn] + [directory + e for e in L]
    for fn in L:
        try:
            os.remove(fn)
            if v:  print fn.split('/')[-1], 'removed'
        except OSError:
            if v:  print fn.split('/')[-1], 'not found'
    print

def load_data(results_fn):
    def convert(s):
        L = s.split('\n')
        return [float(n) for n in L]
    FH = open(results_fn, 'r')
    data = FH.read().strip()
    FH.close()
    L = data.split('\n\n')
    rL = [convert(s) for s in L]
    return rL