Saturday, October 31, 2009

Simple Bayes example, doing the math

Let us continue with the example introduced in the last post, going on to do some calculations using Bayes rule. (If you want to see my original discussion, it's here). Rather than struggle with the layout, I think I'll just post screenshots from the slides I used in class. So, here is the first one:



If we generalize the notation for a minute to Hypothesis and Evidence, we can write the standard form of Bayes rule as:

P(H|E) = P(E|H) P(H) / P(E)

An easy way to remember this is to think of the H terms on the right-hand side cancelling, though of course they do not actually do so.

Look at the right-hand side. The first term P(E|H) is computed as follows: under our model that the cookie jar in question is jar A and jar A contained equal numbers of the two cookie types, then the probability of obtaining either type is easily calculated as 1/2. This is also called the likelihood (of H, in connection with this observation or evidence).

The second term P(H) is the probability that the cookie jar is jar A, given our current state of knowledge. Before the first draw, we might (I would at least) assume that P(A) = P(B), so this term is also 1/2. This is called the prior.

The third term, in the denominator, is P(E). This term is called the marginal probability or sometimes, the evidence. It looks simple but is actually compound: it is P(E) under all hypotheses. That is:

P(E) = P(E|H) P(H) + P(E|~H) P(~H)

where ~H is read as not H (the hypothesis is not true). Or, in cookie-land:

P(A) = P(CC|A) P(A) + P(CC|B) P(B)

The left-hand term is, of course, the probability we seek. It is called the posterior probability, the probability of H (here, A) given the evidence observed.





Calculating the posterior is a simple matter, as you can see from the slide above. It is reassuring to get the same answer that we arrived at earlier by a simple, intuitive process. Are you underwhelmed?

The power of Bayes rule lies in its repeated application. Suppose we replace the cookie in the jar (you've eaten it?---well, go get another one from the package my sister sent). Mix well, then draw again. It's another chocolate chip cookie. Our lucky day!

We have to update two values in the equation before we can do the calculation. We do not need to change the likelihood (because we replaced the cookie that was drawn.

We update the prior by setting it equal to 2/3, the posterior that was calculated in the first step. Knowing that we got CC the first time, our prior is now 2/3. We must also update the marginal probability, because some of its underlying parameters have changed. It's like this:







Bayesian analysis:

• is about updating expectations or "beliefs" based on evidence
• requires an estimate of prior probability for the hypothesis
(although we may be deliberately agnostic)

Next time: dice

UPDATE:

It is worth considering the case where the second draw is O rather than CC. Now we have:

P(A|O) = P(O|A) P(A) / P(O)
P(O|A) = 1 - P(CC|A) = 1/2
P(A) = 2/3
(unchanged from before)
P(O) = 1 - P(CC) = 7/12

And thus we calculate: 1/2 * 2/3 divided by 7/12 = 4/7.

In contrast to this example, we do not return to the naive initial probability of 1/2. Instead, having seen one of each type, we favor A by 1 part in 7 over B. The reason for this is that there are more O cookies overall. Therefore, the second observation of O does not give as much information as the first observation of CC because P(E) is higher for O. We are more impressed by having drawn the CC cookie, and so we favor A.

The observation that the sun rose today in the east is not very helpful in deciding any hypothesis, even one about the sun, because it always rises in the east.

Simple Bayes example

I want to explore a device used by Durbin which they call the "occasionally dishonest casino," and use that to get into Hidden Markov Models. I'll explain the device next time. But the casino example is also a wonderful use of Bayes in action, so I want to talk about that again. I'm going to start by going through an elementary example that I used for my Bioinformatics class.

The example involves cookies. (Everybody loves cookies, right?) So imagine that we obtain two identical cookie jars, which we will call A and B. In the first jar, we put 20 chocolate chip cookies (just like my sister makes), and also 20 of some yucky store-bought brand, for a total of 40. The second jar (outwardly identical to the first) also contains 40 cookies, but the proportions are different, only 1/4 of them are the kind I prefer.



Now suppose I show you a single jar (you can even hold it, as long as you don't peek inside). I carefully draw out a cookie---it's a chocolate chip! :)

The question is: does this new information that we obtained a CC cookie in the draw help us to guess which jar we have in front of us? According to the Bayesian system it does. You would guess (at least if you reason like me) that before the draw there was an equal chance that I had chosen the A or the B jar. Or let's just say that I let you pick at random. We symbolize this as P(A|CC), the probability that the jar is A, given that the cookie was a chocolate chip. Since the proportion of CC in jar A is higher than in B, it seems reasonable to expect that P(A|CC) is now more than 1/2. How much more?

As before, we can set up a frequency table which includes the marginal probabilities. Before the draw, we have:



We reason as follows. We have observed a chocolate chip cookie in the draw. Initially possible observations in the second row of the table were not obtained, so we draw a line through that row, and recalculate the total. There are 30 CC cookies in total. Since 20 of these are in jar A, assuming the jar was chosen randomly, the probability that the jar is A = 20/30.



My nephew Evan doesn't like this reasoning. But I ask him (and you to consider), what if we knew in the beginning that the B jar contained zero CC cookies? It would be pathological not to update our probabilities in that case. So, what if 39/40 were CC in jar A? It's a slippery slope, and we're sliding.

As Ernst Blofeld tells James Bond: "They have a saying in Chicago..."

Next time: a full Bayesian analysis of the cookie problem.

Parsing image links in Blogger XML



I wrote a script that I explained briefly in a previous post (download here: bloggerScript.py), which uses ElementTree in Python to parse content from a blogger XML archive for image links. If you're editing a post, or just using your browser in the normal way and looking at a page's source which contains an image, you'll see something like

<a onblur="try ...

The "a" element's name is short for "anchor", that is, a hyperlink. It has two "attributes" (I assume they are named the same in HTML as XML): some javascript code and href="http..." (the link to the full-res image), as well as child elements including <img... with its own attribute src="http...". Note that when displayed as content text, this anchor element has the < symbol rather than the escape code &lt;, whereas in the source XML it is the escaped form. So the code for parsing image links is:


def parseImageLinks(s):
links = list()
i = s.find('<a')
while i != -1:
j = s.find('/a>',i+1)
link = s[i+1:j]
if link.startswith('a onblur'):
links.append(link)
i = s.find('<a',i+1)
retL = list()
for link in links:
i = link.find('href')
j = link.find('>')
retL.append(link[i+6:j-1])
return retL


We use the string method "find" to get the indexes flanking the substring of interest. A subtle bug that I had in a previous version was that I failed to specify i+1 in the line:

j = s.find('/a>',i+1)

The result was code that apparently worked, but found only the first image if there was more than one. This happens because, on the second time through the loop, i was correctly set to the start of the second element, but j was still the first end-tag, and with j < i, we get an empty string with s[i+1:j].


>>> s = 'abcde'
>>> s[3:2]
''


The second part of the function finds the href for each link and saves it in a list to return. I end up with a containing the title of each post that has at least one image, and all the image links. I just saved them in a text file.

Friday, October 30, 2009

More about XML


This post is in the category of "Google helps Tom get organized." It is so elementary that if you know anything you'd be much better off reading Mark's book Dive Into Python. But, if you know just enough to be dangerous, and are trying to figure out XML for Blogger, you've come to the right place.

XML consists of layers of elements, arranged with a single element as the root, and then various child elements below that, similar to HTML. How is it different? I'm not really sure. I found this on the web:

HTML is about displaying information, while XML is about carrying information

Does that help? No, me neither...

[UPDATE: XML uses tags <stuff>...</stuff> to organize data. They have to be properly "nested," so you can't have the end-tag for the root come before the end, before all of its child elements end-tags have appeared, or the document wouldn't be valid XML. And that's why all of the formatting tags for the HTML that you see in content of a blog-post as XML, have been changed to the escape characters---like < converted to &lt; ]

XML may be formatted to show the structure explicitly, as in Pubmed XML, or not, as in Google Data XML. A formatted version of the data for a blog would look like this:


<feed>
  <child1></child1>
  <child2></child2>
</feed>


Here there is a root element named <feed> with various child elements. The document closes with the end-tag (</feed>) for the root element. In the case of blogger, the child elements are (at least in my case): id, updated, title, link (4 of them), author, generator, and then a bunch of <entry> elements. The first 40 or so entry elements are metadata. It would be nice if they were named differently, but that's how it is.

Any element can have (or not have) child elements, marked with begin-tag and an end-tag, which may themselves be complex. They can also have attributes, one or more name/value pairs typically within the start-tag:


<entry>
  <id>tag:blogger.com,1999:blog-8953369623923024563.post-2518535825745482316</id>
  <published>2009-09-08T05:34:00.000-04:00</published>
  <updated>2009-10-01T10:18:10.585-04:00</updated>
  <category
    scheme='http://schemas.google.com/g/2005#kind'
    term='http://schemas.google.com/blogger/2008/kind#post'/>


Here, category is a child of entry, and it has two attributes scheme and term, which can be accessed from a Python dictionary when parsing with element tree. If there are no attributes, the dictionary is empty. Notice the subtle forward-slash in "#post'/>". That's an abbreviated way of doing the end-tag rather than the full: </category>.

Finally, an element may have a text value, called its content.


Using ElementTree



It helps a lot to know what you're looking for. If you don't, you can find out by perusing the XML. In the case of ElementTree, we traverse the tree structure by doing:


item.getchildren()


which returns a list of the child elements. (Item is the variable name I've given to this particular element).

Attributes are found by looking in the attribute dictionary, obtained with

item.attrib

Child elements can also be searched for by name. The following two functions search for the next element or all elements named title, but only at the next level, any sub-levels are opaque.


item.find(t + 'title')
item.findall(t + 'link')


The variable t above is a string we need to add to the search text, and it's explained below. The content of an element is obtained like this:


item.text


Notice the difference between functions, which use the call operator () and objects, which don't.

One confusing thing about the Blogger data is that it is in the "Atom" format, which specifies that the tag of an element, obtained with

item.tag

is always prefaced with its namespace (in this case):

{http://www.w3.org/2005/Atom}

although it doesn't actually appear as such in the XML. So, for example, if you grab bloggerScript.py referenced in the previous post, and do:

print root.tag

The output is not 'feed' but

{http://www.w3.org/2005/Atom}feed

Furthermore, and most important, if you search as we did with "title" above, you must add the namespace tag to the search string. So t has the value:

t = '{http://www.w3.org/2005/Atom}'

Is that clear? One more thing, within the content for a post obtained by:

c = item.find(t + 'content')

and then:

c.text

there can be found the links embedded in the post. There don't seem to be child elements for the content element: getchildren() returns an empty list for the content. The links have to be parsed by hand. But it's easy, more about that next time.

Thursday, October 29, 2009

Archiving my blog posts




Last time I talked about looking at the Google Data API to archive my blog posts. Of course, there is a download link on the blog under the Settings tab, but I didn't like the look of the XML I get from them. However, I realized that (naturally enough) I get exactly the same data through the API. So I stopped with Google Data, and started trying to parse what I have. I found Mark Pilgrim's book very useful here, not only because it's so clearly written, but it's directly on point.

ElementTree is used to parse the XML. According to the Atom specification, the root element is a "feed", which is qualified with its "namespace." This is its "tag":

{http://www.w3.org/2005/Atom}feed

The child elements of the feed are varied. These are their tags (minus the namespace):

• id
• updated
• title
• link (x 4)
• author
• generator
• entry (many)

You might think when you got down to an entry it would be one of the blog entries. But nope. It's still metadata... Each entry has a child (id) with a long prefix and then its text value is BLOG_PUBLISHING_MODE or BLOG_NAME and so on.

The first authentic post is entry number 58. I distinguish the real entries from metadata by testing whether last character of the id is a digit, as in:

tag:blogger.com,1999:blog-8953369623923024563.post-799567705527714853

An item, whether it's an entry or metadata or a child of an entry, may have attributes, stored in a Python dictionary. It may also have a text value, or not. I wrote a script to look through all this stuff (bloggerScript.py).

This post is long enough that I think I'll quit here and talk about parsing for the URL for each of my images another time. Here is the output for one element:


item 100
id:
tag:blogger.com,1999:blog-8953369623923024563.post-2518535825745482316
title:
Heat Mapper rises from the ashes
tag:
entry
children:
id
t: tag:blogger.com,1999:blog-8953369623923024563.post...
{}
published
t: 2009-09-08T02:34:00.000-07:00
{}
updated
t: 2009-10-01T07:18:10.585-07:00
{}
category
t:
k: term
v: http://schemas.google.com/blogger/2008/kind#post
k: scheme
v: http://schemas.google.com/g/2005#kind
category
t:
k: term
v: Instant Cocoa
k: scheme
v: http://www.blogger.com/atom/ns#
title
t: Heat Mapper rises from the ashes
k: type
v: text
content
t: <a onblur="try {parent.deselectBloggerImageGracefu...
k: type
v: html
link
t:
k: href
v: http://telliott99.blogspot.com/feeds/2518535825745...
k: type
v: application/atom+xml
k: rel
v: replies
k: title
v: Post Comments
link
t:
k: href
v: https://www.blogger.com/comment.g?blogID=895336962...
k: type
v: text/html
k: rel
v: replies
k: title
v: 0 Comments
link
t:
k: href
v: http://www.blogger.com/feeds/8953369623923024563/p...
k: type
v: application/atom+xml
k: rel
v: edit
link
t:
k: href
v: http://www.blogger.com/feeds/8953369623923024563/p...
k: type
v: application/atom+xml
k: rel
v: self
link
t:
k: href
v: http://telliott99.blogspot.com/2009/09/heat-mapper...
k: type
v: text/html
k: rel
v: alternate
k: title
v: Heat Mapper rises from the ashes
author
t:
{}
{http://search.yahoo.com/mrss/}thumbnail
t:
k: url
v: http://1.bp.blogspot.com/_39NGKVWYg3o/SqYr9tfuDJI/...
k: width
v: 72
k: height
v: 72
{http://purl.org/syndication/thread/1.0}total
t: 0
{}
links:
http://www.blogger.com/feeds/8953369623923024563/posts/default/2518535825745482316
content:
k: type
v: html
<a onblur="try {parent.deselectBloggerImageGracefu
images:
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitZvwkD8rqQHLVk3Bohd0JaAuLTwvcW6JHvrYscyHzEdp6aFBF1nAk4jojHc9OIJRwrnlk0FIBdNhoczY9mFqQ5rnzzebZzPZUtLYpyDa683dLD1owrqYzevH-CmbAd2drSuwkxNzLFGbb/s1600-h/Screen+shot+2009-09-08+at+6.02.38+AM.png

Wednesday, October 28, 2009

Google Data API - baby steps



The Google Data API is designed to make it easy to get data from Google and use it in your application. I did not find it easy. In fact, I really haven't figured it out yet.

What I want to do is to archive my web log, eventually in an automated way. I'd like to have html for the individual posts, preferably human-readable html similar to what I originally typed in the the editor window. The Atom feed would be perfect since the format is compact. Then I could parse it somehow and go grab the original images from blogger at full resolution. This is important because I haven't kept copies of those images. They're usually just screenshots that I typically discard.

Perhaps I would try to organize everything into directories, say individual posts grouped by months or by topic, and maybe change the image links so they point to the local copies of the images. Ideas:

• Export from Blogger
the XML isn't displayed properly by Safari

• the Atom feed looks nice, but I haven't figured out how to get all the posts at once. This looks like it works, but then you see it's been broken into two pages:

http://telliott99.blogspot.com/feeds/posts/default?max-results=500

• The standard URL does work

http://telliott99.blogspot.com/search?max-results=500

I can save either of these from Safari as a Web Archive (though that has changed---it is actually in a special Apple format). I can grab the source and save it as HTML, but it's pretty ugly HTML.


Google Data API



This API is designed to make it easy to get data from Google and use it in your application. I did not find it easy but I think I got it working a little bit---baby steps. I would love to find tutorials for this stuff.

I grabbed the Python client library for the Blogger data API and installed as usual. I ran:

./tests/run_data_tests.py

and everything looked fine. I didn't run the sample BloggerExample.py because the version they have requires a log-in and I didn't want to chance screwing up the blog. By digging in the source to try to change that I just got lost. Eventually I found an introductory video at youtube, but it doesn't go far enough. From the video I learned how to do is this:


import gdata.blogger.service,sys
client = gdata.blogger.service.BloggerService()

def test1():
base = 'http://telliott99.blogspot.com/'
base += 'feeds/posts/default?'
url = base + 'max-results=5'
#url = base + 'max-results=500'
feed = client.Get(url)
#print '# entries', len(feed.entry)

for entry in feed.entry[:2]:
print entry.title.text
print entry.id.text
for c in entry.category:
print c.term
# it's the last link that matters
li = entry.link[-1]
print li.title
print li.href

test1()


After a deprecation warning for use of the sha module, I get:


Hidden Markov Models (1)
tag:blogger.com,1999:blog-8953369623923024563.post-2952608856988019032
bioinformatics
Hidden Markov Models (1)
http://telliott99.blogspot.com/2009/10/hidden-markov-models-1.html

The way of the program
tag:blogger.com,1999:blog-8953369623923024563.post-5554856078488748140
thinking aloud
The way of the program
http://telliott99.blogspot.com/2009/10/way-of-program.html


I suppose the numbers are ids for the blog and individual posts

So now we need to go farther... After looking more carefully (patiently) at the instructions here, I see that what I'm supposed to do this:

http://www.blogger.com/feeds/profileID/blogs

Where the profileID is obtained from the URL displayed when I click to display my profile.


def test2():
base = 'http://www.blogger.com/feeds/'
profileID = '01151844786921735933'
url = base + profileID + '/blogs'
feed = client.Get(url)
print len(feed.entry)

test2()


This just prints the number of blogs I have!

By reading more in the instructions, I finally got some real data:


def test3():
base = 'http://www.blogger.com/feeds/'
blogID = '8953369623923024563'
url = base + blogID
url += '/posts/default?max-results=500'
feed = client.Get(url)
print '# entries', len(feed.entry)
e = feed.entry[0]
print e.title.text
print e.id.text
#print e.content.ToString()
print dir(e)

test3()


Output from the dir call includes: 'author', 'category', 'content', 'contributor', 'control', 'extension_attributes', 'extension_elements', 'id', 'link', 'published', 'rights', 'source', 'summary', 'text', 'title', 'updated'.

What I need to do:

• Figure out the URL to send to request a particular entry
• Figure out how to work with the xml data format I'm getting

Hidden Markov Models (1)




I want to start a series of posts about Hidden Markov Models or HMMs. In the spirit of the blog, these will be reports from someone who is a biologist by training, who struggled a bit with the mathematical ideas, and then found his way to a basic understanding. Nothing fancy, just an explanation that makes sense to me. And some Python code to go with it.

The best introductions that I know of are (unfortunately not free):

• The first 80 pages of Durbin
• A short article by Sean Eddy (PMID 15470472)
• A series of Dave Swofford lectures that got disappeared off the web
• A chapter by Anders Krogh in this book

Before we get into HMMs let's consider an example from Krogh explaining about Position-Specific Scoring Matrices or PSSMs. Here is a set of 5 aligned sequences:



Assume we have evidence that these are sites with the same biological function (binding data, mutations, or whatever). We want to find more examples in new sequence that we would predict have the same function.

We could use regular expressions to search for this motif. Krogh uses this example of a regular expression to match either C. elegans or Caenorhabditis elegans

"C[\.a-z]* elegans"

The notation formalizes the logic that we require first capital C, then some stuff, then a space, followed by elegans. The stuff in the middle is either a period or any lower case letter from a to z, of any length (that's the * wild-card). The backslash is just to remind ourselves that we really do mean a period.

For the sequence example, we would use [AT] [CG] [AC] [ACGT]* A [TG] [GC].

But, one problem with this approach is that we haven't used the frequency information. So, for example, our RE does not distinguish the highly implausible sequence TGCT--AGG from the consensus sequence ACAC--ATC.

Here is a PSSM for the first three positions in the sequence alignment:



We could do something similar with the last three. But a PSSM has no way simple way to deal with the variable spacing. Particularly if there are multiple variable spaces. This leads us to a simple HMM:



We proceed through a series of states which are considered to "emit" nucleotides with different frequencies, resulting in a probability for emission of any particular sequence. The states may have transition probabilities as well. In more complicated models a given sequence may be emitted by more than one series of states.

Using the basic model, we can calculate expected frequencies for different sequences. For example, the probability of the sequence ACAC--AGC is

0.8 x 0.8 x 0.8 x 0.6 x 0.4 x 0.6 x 1.0 x 0.2 x 0.8 = 0.0118

while for the second sequence TGCT--AGG we get

0.2 x 0.2 x 0.2 x 0.6 x 0.2 x 0.6 x 1.0 x 0.2 x 0.2 = 0.000023

It's probability is (satisfyingly) 500 times smaller than the first one. As you know, multiplication is tedious for us (and slow as well as error-prone for very small numbers, even on a computer). So typically the model's probabilities are specified as logarithms. And we would usually compare the probabilities according to the model with a second model based on random nucleotides at the frequencies observed in the data, although when comparing two different sequences such normalization is factored out.

You might have noticed that it was really lucky :) that both sequences had an A at the position 2 from the end. Otherwise, we'd have multiplied by zero for zero total probability. It doesn't make biological sense to do discard an otherwise excellent match for this reason. It is probably a consequence of having a limited set of authentic sequences. We say that the model is over-fitted to the data, and adjust the data using pseudocounts. So, every nucleotide that had zero probability (because it was never observed in the small set of authentic sequences) might receive a pseudo-count of 1 (or 0.1 or whatever the modeler decides is reasonable).

As an aside, PSSMs work reasonably well for many problems. I've used them a lot, and maybe I'll write about that another time.

Tuesday, October 27, 2009

The way of the program



You're a scientist---like maybe a biologist. Why should you become a Python coder?



Because the future of Biology is data, and lots of it. Unfortunately, data does not always come in the form you would wish, or that the software you want to use expects. Python's primary purpose is to massage data to get it the way you need it and like it.

And the secret about Python is that it's very easy to use. What you do every day is hard, and you're smart. Python is easy.

Now, if you've never thought at all about how computer programs do what they do, it'll take you a little while to wrap your head around that, a few hours perhaps. But learning to use Python productively is also just a matter of a few hours. An afternoon or two, tops. Can you spare them?

Here's an example. I'm in iTunes,



having just loaded a bunch of songs so that I can put them on an iPod to play in the car. I get to thinking, I wonder if iTunes can export a playlist? Of course it can!

From the File menu do Library > Export Playlist…






You could bring it up in Word (just kidding) or … How about TextEdit? Just drop the file on the application icon. It looks like this:



Hmm… lots of what look like they could be columns (because they are separated by what look like they could be tabs). How about Numbers? Drop the same file on that application icon (or Excel, if you insist) and it looks like this:



I notice that there is a column that I haven't seen in the playlist, it's called year (year of recording). To make things simpler, select that column in Numbers, then copy and paste it into a new Numbers document. It looks like this:



Now what? Numbers can export too. It looks like this:



Now bring that file up in TextEdit. It looks like this:



That file sure has a lot of commas in it. Well, sh*t. This is a job for … Python. Put the following in a file called script.py in the same directory as the file with the commas (like your Desktop). I named the file: "fileWithLotsOfCommas", it has a .csv extension if you look using get Info.

The script.py file has:


FH = open('fileWithLotsOfCommas.csv')
data = FH.read()
data = data.replace(',','')
data.strip()
print data


Save it in a file made with TextEdit that is on your Desktop, named script.py. You want plain text. (If you have what looks like this in Text Edit:



choose the appropriate item from the menu, then do Save.

Now go to the Terminal and do:

python script.py > results.txt

The file looks like this:



Remove the two lines at the top by hand. Now, you probably don't have R installed yet. But I do. I go to R and do:

setwd('Desktop')
v = read.table('results.txt',head=F)
hist(v[,1], col='magenta',
breaks=50, xlim=c(1960,2010))




My formula for the user's age: take the first peak and subtract 18.

Here is what Numbers gives. It's not nearly so pretty.


Heat Mapper rises from the ashes (2)



Here is the latest incarnation of HeatMapper. When I installed Snow Leopard almost two months ago the previous version, written using PyObjC, didn't work any more. I took Bill Bumgarner's advice and have been practicing to speak Cocoa and Objective-C since then.

The app makes heat maps, such as what you might find in R. It has an inspector window that allows customization of colors. The standard ones:




Here are topo colors:



For reasons that escape me, it was not written as a document-based application. But it does allow saving of UserPrefs. It also has some stuff that is probably only useful to me. It will color the row names according to a dictionary, but I haven't yet generalized this to allow the User to specify her/his own values. It can load csv data from a file.



This is the app I think is most functional and I would really appreciate any feedback. As before, the source and the app (64-bit for Snow Leopard) is on my download page hosted at .mac. If you want a different build, or whatever, just let me know. You can get a hold of me here.

[UPDATE: The app seems broken (1.4 years later). That says more about my lack of comprehensive testing than anything else, I guess. Sorry. There is a link in comments about how to use R for this purpose, and I've posted about it too. For my part, now that I have matplotlib, I use that now. When I get time I'll look into this. ]

NSPredicate

Here are two examples of elementary usage of NSPredicate. This is good stuff to know: how to do in Cocoa something that is second-nature in Python.

In the first one, we search an array of strings for @"SELF like [c] 'g*'", that is the object is "like" (equals in a stringy way) the string that starts with g and has zero or more letters following, in a [c] case-insensitive comparison.


    NSArray *A = [NSArray arrayWithObjects:
@"Tom",@"Sam",@"George",nil];
NSPredicate *p = [NSPredicate
predicateWithFormat:
@"SELF like [c] 'g*' "];
NSLog(@"%@",
[A filteredArrayUsingPredicate:p]);


 (
George
)


In the second example, we have an array of dictionaries and test them for any that have an object for the key "firstName" that is contained in an array (which is derived from same array we used above). Since we eliminated the last dictionary (range from 0 up to but not including 2), only the first two dictionaries pass the filter, as shown in the output:


    NSMutableArray *mA;
mA = [NSMutableArray arrayWithCapacity:5];
NSMutableDictionary* mD;
for (id obj in A) {
mD = [NSMutableDictionary
dictionaryWithObject:obj forKey:@"firstName"];
[mA addObject:mD];
}

NSIndexSet *is = [NSIndexSet
indexSetWithIndexesInRange:
NSMakeRange(0,2) ];
NSArray *A2 = [A objectsAtIndexes:is];

p = [NSPredicate
predicateWithFormat: @"firstName in %@", A2];
NSLog(@"%@",
[mA filteredArrayUsingPredicate:p]);


(
{
firstName = Tom;
},
{
firstName = Sam;
}
)

More sophistication with NSArray

This stackoverflow post hightlights another NSArray method I haven't used:

- (id)valueForKey:(NSString *)key

This method
"Returns an array containing the results of invoking valueForKey: using key on each of the receiver's objects."
The example is to make an array containing all the values obtained for a given key. The objects could be custom objects, dictionaries or whatever. Another answer says to do it by using the function below on an NSArray of objects:

- (id)valueForKeyPath:

as in:

[people valueForKeyPath:@"@distinctUnionOfObjects.state"]

It wasn't clear to me how one would know that. It's not listed under the NSArray documentation. And what is this @distinctUnionOfObjects thingie? Ah...it is added under KVC.

"Array operators allow you to specify operations to be performed on a collection's items using a key path of the form keyPathToArray.@operator.keyPathToProperty."

Next time I want the object in an array with the maximum value for a particular key, I'll have to remember to use @max.

C arrays of id objects!



Stackoverflow seems like a very nice site for Cocoa questions. I found this today, which helped me understand that in the middle of a function doing Cocoa stuff, you can still make a standard array of objects of type id.

id arrayObjects[arrayCount]

I never thought about that. In fact, one of my weaknesses is a failure to integrate barely remembered C++ and C stuff with Objective-C. The example also uses an instance method of NSArray that I hadn't noticed:

getObjects:range:

which, according to the class reference,
"copies the objects contained in the receiver that fall within the specified range to aBuffer… the size of the buffer must therefore be at least the length of the range multiplied by the size of an object reference."

I think the example is probably overkill since we want all of the objects. Also, is it really better to construct two arrays and then use

NSDictionary dictionaryWithObjects:forKeys:count:

or simply add objects and keys one by one. More efficient? I don't know. I also saw a reference to Google Toolbox for Mac. That should be an awesome resource. Thanks guys.

Monday, October 26, 2009

Sliders come unstuck



I have a design issue for HeatMapper. We want to pick intervals like say, these for integers

0-1
2-3
4-5
6-8
9-12

and so on, and assign a color to each interval. The current method is to use a table view, but I thought it would be interesting to design a kind of "equalizer" like you see in iTunes or a music app.

So I set up two sliders in the nib and bound their values to instance variables in my app that have been set up with @property and @synthesize as usual. I also put in a stepper (also bound to the instance variable #1) to change the value of slider #1. (The behavior is the same if you move the slider by hand).

The idea was to bind the minimum value of slider #2 to the current value of the instance variable that is bound to slider #1. Unfortunately, it doesn't work. In fact, what it does is so weird that I can't explain it.



We start with slider #1 at 0 and slider #2 at 3.
Now move slider #1 up in steps to 1, 2, 3, 4, 5. The reported value of the variable n2 never changes. And what the slider does is just bizarre:








Here is the output:


2009-10-26 14:59:20.362 Equalizer[3291:80f] n1 = 1
2009-10-26 14:59:20.366 Equalizer[3291:80f] n2 = 3
2009-10-26 14:59:28.898 Equalizer[3291:80f] n1 = 2
2009-10-26 14:59:28.899 Equalizer[3291:80f] n2 = 3
2009-10-26 14:59:37.971 Equalizer[3291:80f] n1 = 3
2009-10-26 14:59:37.971 Equalizer[3291:80f] n2 = 3
2009-10-26 14:59:43.691 Equalizer[3291:80f] n1 = 4
2009-10-26 14:59:43.692 Equalizer[3291:80f] n2 = 3
2009-10-26 14:59:49.371 Equalizer[3291:80f] n1 = 5
2009-10-26 14:59:49.372 Equalizer[3291:80f] n2 = 3

Bar Grapher



Next up, a Bar plotting program. I don't think there is a good reason to use this for real. But it is fun to use the Inspector window and watch things shifting around around. Almost everything from the Inspector is done with bindings, and the main window updates automatically.

Here is the link to a download page hosted at my .mac pages.


Links to Cocoa Projects



I have 4 or 5 Cocoa projects at the version 0.1 stage. I'll be posting about them (briefly) and putting up links to a download page hosted at my .mac pages. There will be a link to the zipped files of the project (build directory removed), as well as a dmg holding the application, built to run under Snow Leopard. These links should last as long as I pay Apple the $100 every February.

When I finish that, I'm going back to Bioinformatics.

These projects have some functionality, and no back-breaking bugs that I know of. I hope someone else finds them useful. The first one is a reference browser called PubmedHelper. It's a document-based application which allows the user to build and navigate a database of Pubmed references.



References can be added in two ways. First, there is an auxiliary window with a WebView that is pointed at Pubmed---activated by a toolbar item. Then, if an article is selected in the browser, the app parses it and shows the pmid, title, etc. in the main window. There is another toolbar item to add it to the database. The second method is to import an XML file. You can build a set of up to 500 references on the Clipboard at Pubmed. Download the XML to a file, the app can parse that file and import the references. The data is saved to disk in plist format, so it could be easily processed further by the user for some other purpose.

The most likely source of bugs is in parsing Genbank's XML. I've tested it some, notably with 500 references from the Clipboard. Please let me know if you can break it.

Navigation is via forward and back buttons in the toolbar, but there is also a stepper as well as a custom control that allows a quick jump around in the database. I like this control a lot.

Useful example code in the project includes:

• showing a WebView (don't forget to link to the WebKit!)
• parsing XML by the event-based NSXMLParser
• parsing XML by using the NSXMLDocument class
• composing a URL to fetch data for a PMID from Pubmed

Enjoy. Feedback welcomed.

Thursday, October 22, 2009

XML Reader (4)



The is the fourth post in a series. The others are (here, here and here). We're reading XML today.

Actually, we've finished reading it and now we want to filter the elements to get just what we want. One thing of interest is the code for sorting all the keys in the dictionary (helpful for printing in debug phase). It seems like there should be an instance method for that in NSArray, but there isn't. The method is called sortedKeys.

Since we're going to be writing to a plist, I substitute the string @"None" when a value is nil. Of course, if I ever come across a record with that as an authentic value, I'll be screwed. The correct way to do this for collection objects is to use NSNull, then process it appropriately when displaying to the screen. But you still can't save it in a plist. Probably, I should use @"TESpecialValueThatSimulatesNullAndCanBeStoredInAPlist". You'd never run into that in any other context.

That's it. I need to test this a lot more, and then try to build an app that will take a PMID (say by reference to a PubMed record shown in a browser view), fetch the XML, process it, and then display it nicely in the GUI. If I were sadistic I would try to build a lightweight alternative to Reference Manager. (Just look at that website---it's so ugly only a mother could love it). Many years ago, they basically destroyed my system with one of their early efforts and I've never used it since. But, the thing is, for such an app to be useful you'd have to interface with Microsoft Word. I may be sadistic but I'm not crazy!




....

- (NSMutableArray *)sortedKeys {
NSMutableArray *mA = [[mD allKeys] mutableCopy];
SEL theSel = @selector(caseInsensitiveCompare:);
[mA sortUsingSelector:theSel];
return mA;
}

- (void)reportAll {
for (cKey in [self sortedKeys]) {
NSLog(@"%@", cKey);
NSLog(@"%@", [mD objectForKey:cKey]);
}
}

- (NSMutableDictionary *)processEntries{
NSMutableDictionary *mD2;
mD2 = [NSMutableDictionary
dictionaryWithCapacity:10];
NSArray *keyL;
keyL = [NSArray arrayWithObjects:
@"PMID",
@"MedlineTA",
@"Title",
@"Volume",
@"MedlinePgn",
@"Year",
@"Abstract",
@"AbstractText",
nil];
NSString *s;
NSString *k;
NSString *k2;
id obj;
for (k in keyL) {
obj = [mD objectForKey:k];
if ([obj isEqualToString:@"multiples"]) {
k2 = [k stringByAppendingString:
[NSString stringWithFormat:@"%i",1]];
obj = [mD objectForKey:k2];
}
if (!obj) { obj = @"None"; }
[mD2 setObject:obj forKey:k];
}
// finally, build the author list
int i;
int N = [[keyCountD
objectForKey:@"LastName"] intValue];
keyL = [NSArray arrayWithObjects:
// some have one, some the other!
@"FirstName",
@"ForeName",
@"Initials",
@"LastName",nil];

NSMutableArray *mA;
mA = [NSMutableArray arrayWithCapacity:5];
NSMutableDictionary *mD3;

for (i = 1; i < N + 1; i++) {
mD3 = [NSMutableDictionary
dictionaryWithCapacity:3];
for (k in keyL) {
k2 = [k stringByAppendingString:
[NSString stringWithFormat:@"%i",i]];
s = [mD objectForKey:k2];
if (!s) { s = @"None"; }
[mD3 setObject:s forKey:k];
}
[mA addObject:mD3];
}
[mD2 setObject:mA forKey:@"authorList"];
return mD2;
}

@end

XML Reader (3)



This is third post in a series (others here and here). If you're following along at home, you'll need the code for main.m. This app is built from the command line---no Xcode. We pass in a file name and read it using NSUserDefaults. No error checking---now that'll put hair on your chest!

Just make sure that you have a file called article.xml on your Desktop, and that it's formatted correctly.


// gcc -o test test.m Reader.m -framework Foundation -fobjc-gc-only
// ./test -ifile article.xml
#import <Foundation/Foundation.h>
#import "Reader.h";

int main (int argc, const char * argv[]) {
NSUserDefaults *args = [NSUserDefaults
standardUserDefaults];
NSString *fn =
[args stringForKey:@"ifile"];
NSURL *xmlURL =
[NSURL fileURLWithPath:fn];

Reader *R = [[[Reader alloc] init] retain];
[R read:xmlURL];

//[R reportAll];
NSMutableDictionary *mD = [R processEntries];
NSLog(@"final version %@", mD);

fn = NSHomeDirectory();
fn = [fn stringByAppendingString:
@"/Desktop/x.plist"];
NSURL *fu;
fu = [NSURL fileURLWithPath:fn];
NSLog(@"write %@", fu);
[mD writeToURL:fu atomically:YES];
return 0;
}

XML Reader (2)



What we're going to do here is to convert a Genbank record (received as XML) into a plist with just the values we're interested in, like the example above. The first post in the series is here.

Below is the code for about the first half of the Reader class. We call it by sending it the message read:(NSURL *)xmlURL. Parsing is very straightforward. But how to deal with duplicates?

What I do is when I get parser:didStartElement:..., I check to see if it is already a key in the dictionary. If it is, we convert the key from, say, "FirstName" to "FirstNamei" where i is the number of these things we have including the current one. But this only happens for elements that are multiples. So, the abstract is there under its own key ("Abstract" or "AbstractText") but for multiple authors "FirstName" is set to "multiple" and then there are "FirstName1", "FirstName2", etc. Simple, but effective.

Dealing with Genbank sure is fun. For example, one record has "Forename" as the key, another has "FirstName." Why in the world would they do that?


#import <Foundation/Foundation.h>
#import "Reader.h"

@implementation Reader
@synthesize mD;
@synthesize keyCountD;

- (id)init {
self = [super init];
if (nil == self) return nil;
[self setMD:[NSMutableDictionary
dictionaryWithCapacity:20]];
[self setKeyCountD:[NSMutableDictionary
dictionaryWithCapacity:5]];
return self;
}

- (void)read:(NSURL *)xmlURL{
NSXMLParser *parser;
parser = [[NSXMLParser alloc]
initWithContentsOfURL:xmlURL];
[parser setDelegate:self];
[parser
setShouldResolveExternalEntities:YES];
BOOL success = [parser parse];
if (success) { NSLog(@"yes"); }
else { NSLog(@"no"); }
}

- (void)parser:(NSXMLParser *)parser
didStartElement:(NSString *)elementName
namespaceURI:(NSString *)namespaceURI
qualifiedName:(NSString *)qName
attributes:(NSDictionary *)attributeDict {
//NSLog(@"didStart %@ %@ %@ %@",
//elementName,namespaceURI,qName,attributeDict);
NSString *k = elementName;
int n;
// we check for multiple values
if ([[mD allKeys] containsObject:k]) {
// how many have we seen
n = [[keyCountD
objectForKey:k] intValue];
// special for the first dup item seen
if (n == 1) {
// fetch what's in the dict
id obj = [mD objectForKey:k];
// get new key for this first value and set obj
cKey = [elementName stringByAppendingString:
[NSString stringWithFormat:@"%i",n]];
[mD setObject:obj forKey:cKey];
// mark the original value as multiple
[mD setObject:@"multiples" forKey:k];
}
// now get a value for the current key and count
n = n + 1;
cKey = [k stringByAppendingString:
[NSString stringWithFormat:@"%i",n]];
}
else {
n = 1;
cKey = k;
}
// don't forget to save the count
[keyCountD setObject:[NSNumber numberWithInt:n]
forKey:k];
cValue = [NSMutableString
stringWithString:@""];
}

- (void)parser:(NSXMLParser *)parser
foundCharacters:(NSString *)s {
//NSLog(@"found: %@", s);
[cValue appendString:s];
}

// this function is only called for a terminal leaf !!
// e.g. key 'Author' has no value to set
// but only for new articles!
- (void)parser:(NSXMLParser *)parser
didEndElement:(NSString *)elementName
namespaceURI:(NSString *)namespaceURI
qualifiedName:(NSString *)qName {

//NSLog(@"didEnd %@", elementName);
NSCharacterSet *cs = [NSCharacterSet
characterSetWithCharactersInString:@" \n"];
NSString *s2;
if (!(nil == cKey)) {
s2 = [cValue
stringByTrimmingCharactersInSet:cs];
[mD setObject:s2 forKey:cKey];
//NSLog(@"object %@ for key %@", cValue, cKey);
}
}