FIRE: User Information (April 2009)

Program details and applications

For more information regarding the details and applications of FIRE, see the following publication (at the time or writing this, the manuscript is in review; once the manuscript is accepted a full citation will be provided): Durand PM, Hazelhurst S and Coetzer TL: The potential for evolutionary rates at codon sites to align sequences and infer protein domain function.

Implementing FIRE

To use the FIRE program all you need to have installed on your computer is fire.py (the FIRE program) and Python. There are also some optional features of FIRE that require additional installations. BUT if you do not use these optional features you do not need to install these

to produce pictorial views of the alignment, you need to install Gnuplot, the Python Gnuplot library (and dependencies like numpy and scipy). The program uses Gnuplot is imported into the program to produce a FIRE plot for the alignment. For information on downloading and using Python and Gnuplot on various operating systems, you will need to visit the relevant websites. It's fairly easy on Apple and Linux distributions and definitely possible on Windows.
to produces the actual sequence alignment, you must have installed the Biopython module

The fire.py program file is available at Scott Hazelhurst's webpage http://dept.ee.wits.ac.za/~scott/fire. The program is coded in Python and therefore requires that the Python software is installed on your operating system (Linux, Windows or Apple).

Run the program from the command line using the appropriate Linux/Unix or Windows convention. For example, to run it in a Unix-like operating system, you would say:

python fire.py file1 file2

where file1 and file2 are the input files.

The input files should be plain text files (not Word or RTF). FIRE assumes that the omega values are in one of the columns (see below for the options for specifying which column it's in).

FIRE produces the following output

a raw and normalised score of the optimal alignment
optionally, a histogram showing the quality of the alignment
optionally, a plot of the alignment of the two files (requires Gnuplot).

The x-axis shows the position of the codons in the two aligned sequences. The y-axis shows the percentage similarity.
optionally, an alignment of the sequences (requires biopython to be installed)

Options

(Note there have been some changes from an earlier version of the program. There is one change of substance: we have separate gap open and extension penalties. A superficial but important change is that the options have been changed so that FIRE can run without reliance on any external libraries. Before, you could switch off graphical ouptput if you wanted. Now, you have to opt-in -- this means that the program will run without any reliance on gnuplot.)

At the moment, it is possible to change a few parameters when running FIRE. The parameters that can be changed are: the window frame in the GNUPLOT, plotting histograms in output, the gap penalty, and the max value in calculating the alignment score.

`-h`	`fire.py -h` gives the usage. Note that the options can be given in two forms (e.g. -h / --help and -G / --gnuplot). In this document only the short form is shown, but the help option will show you the long form)

-G
Produce graphical output
-w "window-list"
What windows should be plotted. FIRE shows the similarity between the two sequences at each point. It is useful to be able to show not only at each point what the similarity is, but also at windows around each point. By default FIRE shows the similarity in window length of 1, 5 and 10. So, at position 20 on the x-axis, you can see (a) what the similarity is between the two sequences at position 20, (b) what the average similarity is between the two sequences from position 18 through 22 (window length 5), and (b) the average similarity between the two sequences from positions 15 through 24 (window length 10).

You can change which windows are shown with the -w option. You can have any number of windows you like. For example, to show windows of length 8, 16 and 24 you could say:
```
python fire.py  -w "8,16,24" file1 file
```
Note that for window lengths of less than size 4, the values are shown as points; for window lengths greater than this, the graph is shown with lines.
-e filename

By default, FIRE uses Gnuplot to show the graph on the screen (using X or Aqua). If you want to send output to a file using encapsulated PostSript, use the -e option:
```
python fire.py -e out.eps fire.py
```
send output to a file called out.eps. It is a known (Gnuplot) bug in some systems that when EPS output is used only the last window given in the -w option is shown.

-H

Produce histogram output. This produces a table summarising the similarity between the two sequences. It shows for each decile, how many of the codon pairs matched with a given level of similarity. For example:

 0%- 10%:    3
 10%- 20%:    3
 20%- 30%:    2
 30%- 40%:    0
 40%- 50%:    3
 50%- 60%:    7
 60%- 70%:   10
 70%- 80%:    4
 80%- 90%:   12
 90%-100%:   37

shows that in a given alignment, 37 of the pairs align with
similarity between omega values between 90 and 100%, 12 between 80%
and 90% and so on....

-g val

This is the gap open penalty. The default gap open penalty to represent indels is -0.5. You can use -g to change this.

NB: This value may change in future versions of the program
-x val

This is the gap extension penalty. The default gap extension penalty is -0.1

NB: this value may change in future versions of the program
-c val

The maximum omega cut-off value. Default is 1.5. We cap the maximum difference between two omega values at 1.5. The rationale for this is that a gap of 1.5 or more have similar evolutionary meaning: the two sites are under extremely different pressures. While allowing a larger cut-off might allow more accuracy in some cases, it makes calibrating smaller evolutionary differences difficult. But, you can use different values.
-s

Show the alignment. The input sequences are aligned based on the FIRE alignment. The input sequence data is taken from the omega input files. It is ASSUMED that the amino acid is the second column of the input file. If not, the -s option is unlikely to work.
-f sequenceformat

Which format the alignment should be produced in. The default alignment format is fasta. The allowable values are whatever values your current version of Biopython supports. Likely formats are: fasta, phylip, clustal, stockholm. Later versions may have genbank, nexus and tab.
-r regexp

This is used to specify where in the data file, the omega values can be found. There are some built-in possibilities
- 1 The omega value are found in column 1
- m2 The omega values are found just before a +-
  
  This is the format that PAML v4 using model M2 produces
- m3 The omega values are found immediately after a close
  
  parenthesis ")". This is the format that PAML using model M3 produces
The default is m3

(If you know Python, you can also enter any valid Python regexp. The regular expression must have two groupings: the first group should find the amino acid for the row (possibly null) and the second group the omega value for example "A d+ ([A-Z]).) +(d+.d+)")

Background

Omega values are aligned using a modified Needleman-Wunsch algorithm on the Omega estimates in the two .txt files. The two sequences of Omega estimates are aligned so as to maximise the FIRE score, which is the sum of all individual codon scores. A codon score, cs, measures the similarity between two aligned omega values in the range [0,1]. The maximum difference between two omega values has been capped to OMAX (default 1.5), but is parameterisable.

Suppose in an alignment of the two sequences at a particular position, the omega score in sequence 1 is o1 and the omega score in sequence 2 is o2.

Then, cs(o1, o2)= 0, if abs(o1 - o2)> OMAX and: cs(o1, o2)= 1 - abs(o1 - o2)/OMAX otherwise.

The score of alignment is the sum of cs values of all aligned codons. FIRE uses a dynamic programming algorithm to find the best alignment.

The raw score is computed, as well as a normalised score which is the raw score divided by the length of the smaller of the two sequences.