Monday, November 29, 2010

Notes on Phylip file format

The file format for this well-known suite of programs (here) is described in the main documentation file (here) in the section labeled Data file format. It is, as Prof. Felsenstein says, a "rather stereotyped input and output format," and I thought I would explore it a bit to reinforce my understanding of some of the idiosynchrasies rules that weren't completely clear to me at first. I tried to do this by running the program I used for testing (protdist) as a Unix process, as described previously (here).

The most important thing I learned is that for some errors, the process does not return an error code, it returns 0. So you can't use the faster method to test files for correct formatting. You have to get a special lime-green terminal window up and type the commands in yourself.

Another iimportant rule is that in sequential ('aligned') format (all of seq 1 followed by all of seq 2 etc), this program at least, does not allow blank lines between the sequences. The error is:

ERROR: end-of-line or end-of-file in the middle of species name for species 2


Yet, as mentioned, when run as a process this does not return an error. Other details:

  • The number of sequences and number of characters on line 1:
  • Alignment at the left margin with a single space between works
  • Another character (';') between does not, though the process returns 0

  • The number of sequences must be correct.
  • The number of chars can be less but not more than are actually present

  • As the docs say, the sequences must have names.
  • The names must have <=10 chars, and be filled out to 10 with spaces.
  • If a name is 10 chars, the sequence should start immediately, no space.

  • Spaces within sequences, like groups of 10, is optional
  • Sequences don't have to start at left hand margin after the first line.

  • Species name cannot be on a separate line.
  • A blank line between groups of seqs is optional in interleaved format.
  • Blank line is not allowed in aligned (one whole sequence then another)