Python for Bioinformatics: Xgrid: getting data in and out

My goal is to run BLAST on a small grid of lab machines to process Pyrosequencing data. Actually, I'm trying to get ready, even though we don't have funding to do the project yet. So far, I have a grid with a single machine in all three roles: client, controller and agent.

There are lots of resources available, see the list here. I described in that post how I started up the xgridcontroller daemon, and obtained Xgrid Admin which gives me a view of xgrid jobs. The daemon is visible in Activity Monitor (choose "Other User Processes", or from Terminal with ps -A. It's curious that it comes back after a re-boot. The Xgrid controller can be administered from the command line:

sudo xgridctl controller start
sudo xgridctl controller status
sudo xgridctl c stop
sudo xgridctl c off

or by using XGrid Admin. Note that the earlier examples used sudo for xgrid commands, but this is not necessary. In browsing the manual for xgrid, I discovered that you can do this (for my bash shell):

export XGRID_CONTROLLER_HOSTNAME=localhost
export XGRID_CONTROLLER_PASSWORD=abcd
xgrid -job run /usr/bin/cal

now I can skip the -h 127.0.0.1 -p <password> stuff in my commands.

Before we get to the subject of this post, let's talk a little bit about where Xgrid runs and what you're allowed to do. According to the Apple docs, the jobs run on the agent as user "nobody." The docs discuss users and permissions here, but don't say much about user "nobody" beyond that it "provide(s) minimal permissions."

According to this FAQ from the mailing list:

The second ("nobody") provides only very minimal privileges, since it assumes that the agent doesn't trust the client. This is the most common reason why jobs that attempt to read or write outside of, e.g., /tmp, will get a permission error.

And if you look at this file:
/usr/share/sandbox/xgridagentd_task_nobody.sb

there are a couple of fairly complicated regular expressions

(allow file-read* (regex "^/(bin|dev|(private/)?(etc|tmp|var)|usr|System|Library)(/|$)"))
(allow file-read* file-write* (regex "^/(private/)?(tmp|var)(/|$)"))

which I can't decipher completely but I interpret as restricting reading privileges for "nobody." And according to the mailing list entry, "the best solution to this problem is to enable Kerberos."

But we're getting ahead of ourselves. The question for today is, how do we get data and code to the agent and data back again? According to the Apple docs

You have the option of supplying an input file or a directory of files. If you supply an input directory, it is copied to each agent and becomes the working directory for the executable file.

In an example I used earlier, there is a one-line Python script with:
print 'Hello Python world!'
in a directory temp on my Desktop. Working from the Desktop directory I did:

xgrid -job run -in temp /usr/bin/python temp/script.py

Hello Python world!

As the docs say:

Important: You have the option of providing a relative path or an absolute path when specifying executable files, input files and directories, and output files and directories. When a relative path is used, the executable and the input files or directories are copied to the agents, and the output files or directories are created for every agent and collected by the controller. If you specify an absolute path to the executable, input, or output files or directories, those files are assumed to exist on the agent computers, or to be available to the agents as part of a shared file system, at the path location specified. They are not copied or created.

Executive summary:

• relative path: copied to the agent
• absolute path: assumed to exist on the agent

The version of the above example that I posted the other day and another here are subtly different. They provided a full path /Users/te/Desktop/temp to the temp directory, and which should be "assumed to exist on the agent computers, or..." What we want is a relative path.

According to the man page for xgrid, there are more options:

-si stdin     for submit/run, file to use for standard input
-in indir     for submit/run, working directory to submit with job
-so stdout    for run/results, file to write the standard output stream to
-se stderr    for run/results, file to write the standard error stream to
-out outdir   for run/results, directory to store job results in

The docs say:

Use the -in parameter to pass an input directory. This directory is copied to each agent and becomes the working directory on the agent’s host computer. You can include anything needed in the working directory, such as additonal input files, libraries, and executables. The executable file is run in this directory.

Thinking about this, I realized there is another issue with the xgrid command above. Even though it works, what it really should be is:

xgrid -job run -in temp /usr/bin/python script.py

which also works. Since we're in temp on the agent, we don't need the directory before script.py.

I placed a file with some DNA sequence named seq.txt in the temp directory. The script xgrid.script.py opens the file and prints the data.

fn = 'seq.txt'
FH = open(fn)
data = FH.read()
FH.close()
print data

xgrid -job run -in temp /usr/bin/python xgrid.script.py

>DA19
GGGAGAGTAGCCGTG...

What about getting data back again? Well, I mean, other than the way we've been doing it :)

fn = 'out.txt'
FH = open(fn,'w')
FH.write('got here')
FH.close()

xgrid -in temp -out temp -job run /usr/bin/python out.script.py

It works! We'll save stderror (se) for next time, with BLAST.

Monday, November 9, 2009

Xgrid: getting data in and out