Python for Bioinformatics: KS test (2)

Monday, April 30, 2012

KS test (2)

A bit more about the Kolmogorov-Smirnov test after the first post here. I found a pdf at MIT ocw about this. I don't understand much of it but it's clear that using the absolute value of the difference to get the statistic is correct.

Now we need to figure out the distribution of the statistic under the null hypothesis, so we can set our type I error at 0.05. From my perspective, the way to do this is to do a simulation in Python. The code below does the following N times: grab n=20 numbers from the standard normal and sort them, compute the D statistic as we did in the previous post, and accumulate those values in a list.

We sort the list, then find the place in the list where the statistic for the test case is found. (I used a function from the bisect module for this, but the naive method gives the same result).

I get a p-value of 0.25. This does not match what we had before (0.36), and I am not sure why that is.

To do a little more checking, I found an online reference that gives p-values for various statistic,n combinations.

for n = 20

p-value   D
0.20   0.210
0.15   0.246
0.10   0.264
0.05   0.294
0.01   0.356

According to this, a statistic of 0.20 (what we had) gives a p-value a bit larger than 0.20, but not much. So this source doesn't agree with what we got from R and Python. Maybe the online source is for a one-sided test (it doesn't state clearly), that would give a smaller p-value for a given statistic.

I added calculations for our simulation of d given various p, and vice-versa. The output:

> python script.py
d = 0.199 p = 0.25
d = 0.246 p = 0.10
d = 0.264 p = 0.06
d = 0.294 p = 0.03

p = 0.15 d = 0.226
p = 0.10 d = 0.245
p = 0.05 d = 0.275
p = 0.01 d = 0.336

Not sure what the correct answer is here, but I can't find a problem with the code and I feel like it's time to move on. One last bit is something to think about that is quoted here:

The Kolmogorov-Smirnov test is based on a simple way to quantify the discrepancy between the observed and expected distributions. It turns out, however, that it is too simple, and doesn't do a good job of discriminating whether or not your data was sampled from a Gaussian distribution. An expert on normality tests, R.B. D’Agostino, makes a very strong statement: “The Kolmogorov-Smirnov test is only a historical curiosity. It should never be used.” (“Tests for Normal Distribution” in Goodness-of-fit Techniques, Marcel Decker, 1986).

Take that, Andrey!

from bisect import bisect_left as bl
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np
    
# normal cdf
rv = norm()

def calc_stat(xL,yL):
    xL.sort()
    dL = list()
    for x,y0 in zip(xL,yL):
        y1 = rv.cdf(x)
        dL.append(abs(y1-y0))
    return max(dL)

def p_value(d,L):
    #for i,e in enumerate(L):
        #if d < e:
            #break
    i = bl(L,d)
    p = 1 - ((i+1)*1.0/len(L))
    return p
#-----------------------------------
# simulation
N = 10000
n = 20
yL = range(1,n+1)
yL = [1.0*y/n for y in yL]  #replaces "steps"
pL = list()

for i in range(N):
    xL = np.random.normal(0,1,n)
    D = calc_stat(xL,yL)
    pL.append(D)

plt.hist(pL,bins=50)
plt.savefig('example.png')
#-----------------------------------
pL.sort()
for d in [0.199, 0.246, 0.264, 0.294]:
    p = p_value(d,pL)
    print 'd = %3.3f' % d,
    print 'p = %3.2f' % p

print
for p in [0.15, 0.10, 0.05, 0.01]:
    i = int(N*(1-p))
    d = pL[i]
    print 'p = %3.2f' % p,
    print 'd = %3.3f' % d