
| Top | Definitions | Samples | Notes | Release | Disclaimer |
Filo is a command-line Java program written by Michael Charleston and Katherine Holt at the School of IT, University of Sydney, distributed free of charge and with no warranty whatsoever. It was partly funded by a Discovery Project awarded by the Australian Research Council (DP0770991) and we are very grateful to the ARC for their support. It is designed to be used to simulate (molecular) sequence data used in phylogenetic analyses under very general conditions, including:
Filo can be run using either entirely command-line arguments or using input file(s). Using the command-line is a bit more restrictive because it just gets too awkward to write a huge great set of arguments when you can put them into an input file in a more easily read way. The input file uses a simple syntax that is consistent with the NEXUS format, and uses a "Filo" block.
Filo can be used to run many simulations in batches. The input file can be thought of as a script of commands that are run in order. The idea is that you can define models to use very generally on any part of a tree, including branch lengths of course, and different matrices on each branch, different indel rates, and by combining trees with their own sets of parameters. The parameters are grouped into general, tree and branch parameters.
| Top | Definitions | Samples | Notes | Release | Disclaimer |
(alphabet | -a) {
( '<' ( "DNA" | "nucleotide"
| "RNA"
| "aminoacids" | "aa"
| "bin" | "binary" ) ,
<string>
where
( "p" | "pi" | "baseFreq" ) <doubleArray>The above defines the base frequencies used at the base of a tree. By default these are uniform, so if you are using 4 states (as with nucleotides) the base frequencies will be 0.25, 0.25, 0.25, 0.25.
( "birthrate" | "birth" ) <real>This defines the birth rate per unit time of each lineage in the tree bifurcating into two lineages, as used in the Markov / Yule model of tree growth. By default this is 1. See also deathRate.
( "branchparams" | "branchparameters" | "bp" )
<brparamsLabel> ,
[ "=" ] {
<deletionRate> ,
<insertionRate> ,
<indelRate> ,
<sequenceLength> ,
<matrix> }
";"
This defines a set of parameters identified by the label
<brparamsLabel>. Details for these parameters are
given elsewhere.
( "deathrate" | "death" ) <real>This defines the birth rate per unit time of each lineage in the tree bifurcating into two lineages, as used in the Markov / Yule model of tree growth. By default this is 0, corresponding to a pure birth process. See also birthRate. Note that both birth rate and death rate must be non-negative. Incidentally it is a standard result that the shape distribution of the tree grown under a birth/death process is only affected by the ratio of birth rate to death rate, not to their individual values.
( "deletionLength" | "delLength" ) [ "=" ] <real>This sets a parameter related to the mean deletion length. The actual length used in the simulations is drawn from a log-normal distribution, but with a minimum of 1, so this parameter is definitely not the true mean length. However this parameter is representative of the spread of deletion lengths.
( "deletionRate" | "delRate" ) [ "=" ] <real>This sets the rate per unit time of having a deletion event at each site in the sequence.
[ <real> [ "=" ] <real> ... <real> ]The above reads in a set of real numbers separated by whitespace and surrounded by square brackets. It is often used to define base frequencies (see baseFreq) or matrices (see matrix).
"filename" [ "=" ] <string>The above sets the root of the name of the output files. For example if you call filename myfile here then all the output files will begin with the string "myfile" and be followed with indices showing the experiment and trial number, followed by a suffix to indicate the file type (see format, below).
( "fasta" | "nexus" | "phylip" | "raw" | "nogap" | "trees" | "treeview" )These are the current output formats supported by Filo. FASTA, NEXUS and Phylip are very well known. The output for these files includes gaps. "raw" or "nogap" files are FASTA format but with gaps removed, for instance for use in alignment testing. "trees" or "treeview" format is for use in the TreeView programs, also readable by FigTree and many other programs. Such files do not contain sequence data.
"format" <filetype> { "," <filetype> }
This defines the format(s) of the output to be used. See
filetype.
"indelLength" [ "=" ] <real>This sets the length parameter for both insertions and deletions. See deletionLength for more details.
( "indel" | "indelRate" ) [ "=" ] <real>This sets the rate of both insertions and deletions. See deletionRate for more details.
( "insertionLength" | "inLength" ) [ "=" ] <real>This sets the length parameter for insertions. See deletionLength for more details.
( "insertionRate" | "inRate" ) [ "=" ] <real>This sets the rate parameter for insertions. See deletionRate for more details.
( "matrix" | "-m" ) <string> [ "=" ] ( <doubleArray> | <JC> | <K80> | <F81> | <HKY> | <F84> | <TN93> | <GTR> )This defines a matrix in numerous ways. The first of these simply sets all the parameters in the matrix, row by row. Some checking is done but the responsibility is on the user to ensure these make sense. The next matrices deserve a subsection of their own. Oh look, here's one:
"JC" <real>This sets the one parameter of the one-parameter Jukes-Cantor model. Diagonals are set such that the row sums are zero, so JC x creates an n×n matrix whose non-diagonal entries are x and whose diagonals are x/(n-1), where the alphabet has n entries.
"K80" <real> <real>This defines a K80 matrix with parameters a and b. K80 is a special case of the more general TN93 matrix, in particular is defined by creating a TN93 matrix with the base frequencies set to (0.25, 0.25, 0.25, 0.25) and arguments (a, a, b) (see TN93 for more details).
"F81" <real> <doubleArray>This defines a F81 matrix with rate parameter mu and base frequencies given. Again, F81 is a special case of the more general TN93 matrix, in this case created with arguments (mu, mu, mu) and the base frequencies defined. See TN93 for more details.
( "HKY85" | "HKY" ) <real> <real> <doubleArray>This defines an HKY85 matrix with parameters a and b in order and base frequences given by the doubleArray. As HKY85 is a special case of TN93, the matrix is created with the base frequencies given and the other arguments (a, a, b). See TN93 for more details.
"F84" <real> <real> <doubleArray>This defines an F84 matrix with the parameters k and b in order and the base frequencies given. As F84 is a special case of TN93, the matrix is constructed with arguments derived from k and b and the base frequencies.
If the base frequencies are pi(A), pi(C), pi(G), pi(T), let pi(R) be pi(A) + pi(G) and pi(Y) be pi(C) + pi(T). The matrix is constructed as a TN93 matrix with arguments ( 1 + k/pi(Y)) × b, 1 + k/pi(R)) × b, b) and base frequencies as given.
( "TN93" | "TN" ) <real> <real> <real> <doubleArray>This defines a Tamura-Nei '93 matrix with the parameters a1, a2, b in order and base frequencies as given. The resulting matrix looks like this:
A C G T
Q = A [ . b p(C) a2 p(G) b p(T) ]
C [ b p(A) . b p(G) a1 p(T) ]
G [ a2 p(A) b p(C) . b p(T) ]
T [ b p(A) a1 p(C) b p(G) . ]
( "GTR" | "REV" ) <real> <real> <real> <real> <real> <real>This defines a General Time-Reversible model (also known as REV, but we believe that is not quite correct). The resulting matrix looks like this:
A C G T
Q = A [ . a * pi(C) b * pi(G) c * pi(T) ]
C [ a * pi(A) . d * pi(G) e * pi(T) ]
G [ b * pi(A) d * pi(C) . f * pi(T) ]
T [ c * pi(A) e * pi(C) f * pi(G) . ]
where "." means "the negative of the sum of the other elements in
this row".
( "molclock" | "molecularclock" ) [ "=" ] <bool>where <bool> is "true", "1" or "yes" for true and "false", "0" or "no" for false. This sets the root-to-tip distance in simulated trees to be the same for all tips (if true) or allows it to vary (if false). The default is that the molecular clock is assumed.
( <string> | "(" <node> { "," <node> } ")" )
[ "[" <matrixLabel> "]" ]
[ ":" ( <brlen> | <brparamLabel> ) ]
where <matrixLabel> is the label of a matrix defined
elsewhere, <brlen> is a non-negative real number
corresponding to the length of the branch immediately ancestral to the
node just defined, and <brparamLabel> is the label of a
set of branch parameters defined elsewhere. See examples later
for more details on how to define trees with different matrices and
lengths on each branch.
Note that if no matrix is defined for a given branch, the matrix that was defined for ancestral branch will be used, if there is one. If there is no such matrix then the global default matrix will be used. If no matrix is defined at all then this global default is the Jukes-Cantor model. Any time a matrix is defined it becomes the current default, so if you are only using one matrix you need not identify it explicitly in the tree definition.
"nosequences" [ "=" ] <bool>where <bool> is defined as above. This enables users to turn off sequence generation. By default, nosequences is false.
( "ntrials" | "nreps" ) <int>where <int> is a positive integer. This sets the number of trials to do with the current settings.
( "ntax" | "numtaxa" ) <int>where <int> is a positive integer as above; this sets the number of tips to grow to when growing trees. Note that there is a philosophical difficulty in the current implementation, which effectively assumes that the tree growth process is birth-only in that the last bifurcation event is the first one leading to the correct number of taxa. This rules out the possibility that under a birth-death process the number of tips may be exceeded and then return to the desired number. The correct implementation is on my to-do list.
( "dp" | "precision" ) <int>Sets the number of decimal places to use in output.
"seed" <int>where <int> is any integer; this sets the random number seed explicitly.
( "l" | "seqlength" | "sequencelength" ) <int>where <int> is a positive integer; this sets the initial sequence to use at the root of the simulated trees.
( "tree" | "-t" ) <string> [ "[" <int> "]" ] [ "=" ] ( <node> | <treeParams> ) [ ";" ]This defines a tree with label <string> or set of trees with labels <string>0, <string>1, etc. To define a set of trees use the optional "[" <int> "]" to define the number of trees with the same definition following. See node for more details.
( "height" | "root_to_tip" ) [ "=" ] <real>This sets the total root-to-tip distance for simulated trees. Once the tree is complete the maximum distance from root to tips is set to the required value by normalising the branch lengths in the tree.
| Top | Definitions | Samples | Notes | Release | Disclaimer |
#NEXUS
% Input for many experiments:
% n = 2, 4, 8, 16, 32, 64
% c = 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
% root_to_tip = 0.1, 0.125, 0.16, 0.2, 0.25, 0.32, 0.4, 0.5, 0.64, 0.8, 1, ..., 10
% trees grown in pure birth process under Yule model
% tree T1 = ((tA[F84]:0.2, tB[F84]:0.25)[F84]:0.02, (tC[HKY]:0.2,tD[HKY]:0.5)[F81]:0.01);
% tree T2 = (tA[K80]:0.2, (tB[K80]:0.25, (tC[HKY]:0.2, tD[HKY]:0.5)[F81]:0.01)[F81]:0.25);
% tree T3 = (tA[JC]:0.2, (tB:0.25, (tC:0.2, tD:0.5):0.01):0.25);
begin filo;
output
format = fasta, nexus, raw
filename = sampleoutput
precision 5
;
matrix JC = JC 0.1;
matrix K80 = K80 0.1 0.2;
matrix F81 = F81 0.3 [ 0.2 0.3 0.15 0.35 ];
matrix F84 = F84 1 0.3 [ 0.1 0.4 0.1 0.4 ];
matrix HKY = HKY85 0.2 0.5 [ 0.25 0.05 0.25 0.45 ];
tree T1 = ((tA[F84]:0.2, tB[F84]:0.25)[F84]:0.02, (tC[HKY]:0.2,tD[HKY]:0.5)[F81]:0.01);
tree T2 = (tA[K80]:0.2, (tB[K80]:0.25, (tC[HKY]:0.2, tD[HKY]:0.5)[F81]:0.01)[F81]:0.25);
tree T3 = (tA[JC]:0.2, (tB:0.25, (tC:0.2, tD:0.5):0.01):0.25);
tree T4 = (((tA[JC]:0.175, tB:0.175):0.025,(tC:0.15,tD:0.15):0.05):0.05,((tE[JC]:0.175, tF:0.175):0.025,(tG:0.15,tH:0.15):0.05):0.05);
tree T5 = (((tA[HKY]:0.75, tB[HKY]:0.75)[HKY]:0.25,(tC[HKY]:0.5,tD[HKY]:0.5)[HKY]:0.5)[HKY]:0.5,((tE[HKY]:0.75, tF[HKY]:0.75)[HKY]:0.25,(tG[HKY]:0.5,tH[HKY]:0.5)[HKY]:0.5)[HKY]:0.5);
params
l 200
indel 0.05
ntrials = 10
;
treeparams T1
l 50
;
treeparams T4
pi [ 0.25 0.25 0.25 0.25 ]
l 100
indelrate 0.25
;
treeparams T5
pi [ 0.45 0.05 0.05 0.45 ]
l 50
;
run;
end;
begin allconsistent;
% All these give exactly the same matrix:
matrix JC = JC 0.3;
matrix K80 = K80 0.3 0.3;
matrix F81 = F81 0.3 [ 0.25 0.25 0.25 0.25 ];
matrix F84 = F84 0 0.3 [ 0.25 0.25 0.25 0.25 ];
matrix HKY = HKY85 0.3 0.1 [ 0.25 0.25 0.25 0.25 ];
end;
begin generalexperiments;
matrix M = HKY 4 [ 0.25 0.25 0.25 0.25 ];
tree T = (((((((taxonA:0.01, taxonB:0.01):0.01, taxonC:0.03):0.01, taxonD:0.04):0.01,
taxonE:0.05):0.01, taxonF:0.06):0.01, taxonG:0.07):0.01, taxonH:0.08);
params
ntrials 10
height 0.32
l 1000
indel 0.0
insertionRate 0.001
;
run;
params deletionRate 0.001;
run;
params deletionRate 0.002;
run;
end;
The above file, when run with
java -jar Filo.jar -f sampleinput.txthas the following output to console:
This is Filo, version v1.1, released 2009.02 Written by Michael A. Charleston, with contributions by Katharine Holt Warranty: No warranty of any kind. The author(s) take no responsibility for any damage incurred to anything at all through the use of this software, and do not make any claims as to its usefulness or applicability to any particular problem. Distribution: Available from http://www.it.usyd.edu.au/~mcharles Funding: Supported by the Australian Research Council Starting time: Wed Mar 04 16:40:52 EST 2009 Opening file sampleinput.txt for parsing. Generating data..........done. Output files are "sampleoutput0_1.*"to "sampleoutput0_9.*" File "sampleinput.txt" has all been processed. No drama. Command-line arguments have all been processed. No drama. Stopping time: Wed Mar 04 16:41:02 EST 2009
The output files are collected as raw, FASTA and Nexus format .zip files.
| Top | Definitions | Samples | Notes | Release | Disclaimer |
| Top | Definitions | Samples | Notes | Release | Disclaimer |
Invocation:
java -jar Filo.jar <args>
<args> ::= { <cl_infile> | <cl_alphabet> | <cl_seqlength> |
<cl_matrix> | <cl_nreps> | <cl_output> |
<cl_params> | <cl_basefreq> | <cl_run> |
<cl_tree> | <cl_treeparams> | <cl_verbosity> |
<cl_version> | <cl_help> }
<cl_infile> ::= '-f' <ident>
<cl_alphabet> ::= '-a' <alphabet>
<cl_seqlength> ::= '-l' <int>
<cl_matrix> ::= '-m' <matrix>
<cl_nreps> ::= '-n' <int>
<cl_output> ::= '-o' <output>
<cl_params> ::= '-p' <params>
<cl_basefreq> ::= '-pi' <basefreq>
<cl_run> ::= '-r'
<cl_tree> ::= '-t' <tree>
<cl_treeparams> ::= '-tp' <treeparams>
<cl_verbosity> ::= '-v' <bool>
<cl_version> ::= '--v'
<cl_help> ::= '-?' | '-h'
<alphabet> ::= <dna> | <rna> | <aa> | <binary>
<dna> ::= "DNA" | "nucleotide"
<rna> ::= "RNA"
<aa> ::= "AA" | "aminoacids"
<binary> ::= "bin" | "binary"
<matrix> ::= <matrixid> <eq> ( <jc> | <k80> | <f81> | <hky>
| <f84> | <tn93> | <gtr> )
<eq> ::= [ "=" ]
<jc> ::= "JC" <real>
<k80> ::= "K80" <real> <real>
<f81> ::= "f81" <real> <realarray>
<hky> ::= ( "HKY" | "HKY85" ) <real> <real> <realarray>
<f84> ::= "F84" <real> <real> <realarray>
<tn93> ::= ( "TN" | "TN93" ) <real> <real> <real> <realarray>
<gtr> ::= "GTR" <real> <real> <real> <real> <real> <real> <realarray>
<realarray> ::= "[" { <real> } "]"
<real> ::= <int> | <float>
<output> ::= { <outfile> | <outformat> | <outprecision> }
<outfile> ::= "filename" <ident>
<outformat> ::= "format" | { "fasta" | "phylip" | "raw" |
"retroml" | "trees" | "treeview" }
<outprecision> ::= ( "dp" | "precision" ) <int>
<params> ::= { <delrate> | <insertrate> | <indelrate> | <seqlength> |
<treeheight> | <molclock> | <nosequences> |
<nreps> | <basefreq> | <dellength> | <insertlength> }
<delrate> ::= ( "deletionrate" | "delrate" ) <eq> <real>
<insertrate> ::= ( "insertionrate" | "inrate" ) <eq> <real>
<indelrate> ::= ( "indel" | "indelrate" ) <eq> <real>
<seqlength> ::= ( "l" | "seqlength" | "sequencelength" ) <eq> <int>
<treeheight> ::= ( "height" | "root_to_tip" ) <eq> <real>
<molclock> ::= ( "molclock" | "molecularclock" ) | <bool>
<bool> ::= ( "1" | "t" | "true" | "y" | "yes" ) | <str>
<nosequences> ::= "nosequences" <eq> <bool>
<nreps> ::= ( "nreps" | "ntrials" ) <eq> <int>
<basefreq> ::= ( "p" | "pi" | "basefreq" ) <eq> <realarray>
<dellength> ::= ( "dellength" | "deletionlength" ) <eq> <int>
<insertlength> ::= ( "insertlength" | "insertionlength" ) <eq> <int>
<treedef> ::= <treeset> | <tree>
<treeset> ::= <treeid> "[" <int> "]" <eq> <node>
<tree> ::= <treeid> <eq> <node>
<node> ::= ( <internalnode> | <leaf> )
[ "[" <matrixid> "]" ] [ ":" ( <real> | <brparamsid> ) ]
<internalnode> ::= "(" <node> "," <node> { "," <node> } ")" [ <matrixtag> ]
[ <brdescriptor> ]
<matrixtag> ::= "[" <matrixid> "]"
<brdescriptor> ::= ":" ( <real> | <brparamsid> )
<leaf> ::= <taxonid> [ <matrixtag> ] [ <brlength> ]
<treeparams> ::= <treeparamsid> <eq> { <birthrate> | <branchparamsid> |
<seqlength> | <delrate> | <insertrate> |
<dellength> | <insertlength> | <growth> |
<matrix> | <molclock> | <ntax> | <basefreq> |
<treeparamsid> }
<birthrate> ::= ( "br" | "birthrate" ) <real>
(NB: no reason to put in death rate as it's set to unity)
<grow> ::= ( "grow" | "growth" ) <eq> ( "atel" | "yule" )
<brparams> ::= ( "branchparams" | "branchparameters" ) <brparamsid> <eq>
{ <delrate> | <insertrate> | <indelrate> | <brlength> |
<matrix> }
<brlength> ::= ( "l" | "len" | "length" ) <eq> <real>
<ntax> ::= "ntax" <eq> <int>
| Top | Definitions | Samples | Notes | Release | Disclaimer |
| Top | Definitions | Samples | Notes | Release | Disclaimer |