Filo Logo
(This image is copyright M Charleston, 2008)
Top Definitions Samples Notes Release Disclaimer

Download Filo

Filo is a command-line Java program written by Michael Charleston and Katherine Holt at the School of IT, University of Sydney, distributed free of charge and with no warranty whatsoever. It was partly funded by a Discovery Project awarded by the Australian Research Council (DP0770991) and we are very grateful to the ARC for their support. It is designed to be used to simulate (molecular) sequence data used in phylogenetic analyses under very general conditions, including:

Filo can be run using either entirely command-line arguments or using input file(s). Using the command-line is a bit more restrictive because it just gets too awkward to write a huge great set of arguments when you can put them into an input file in a more easily read way. The input file uses a simple syntax that is consistent with the NEXUS format, and uses a "Filo" block.

Filo can be used to run many simulations in batches. The input file can be thought of as a script of commands that are run in order. The idea is that you can define models to use very generally on any part of a tree, including branch lengths of course, and different matrices on each branch, different indel rates, and by combining trees with their own sets of parameters. The parameters are grouped into general, tree and branch parameters.

general parameters
refer to those which apply across all the trees (overall rates, base frequencies, root-to-tip distances etc.), and govern simulation settings too, such as number of trials (e.g., number of trials, output format);
tree parameters
are those parameters which are to be assumed for agiven tree, identified by a label you choose. In general it makes sense to use the same label for a set of tree parameters as you chose for the tree itself. These parameters are such things as initial sequence length (which can change through the tree if there are insertions or deletions permitted), birth/death rate of lineages in the tree if the tree itself is to be grown through time, insertion/deletion parameters of rate and mean length, number of tips to which to grow, whether the molecular clock should be used, the initial base frequency, etc. The individual parameters are detailed later.
branch parameters
are a selection of those parameters above, which apply to a whole tree, and applied to a single branch identified by a label you choose. These are indel parameters, branch length, and the transition matrix used (see later).
Here is a description of how the different parts of a Filo "block" are dealt with by Filo:
alphabet
followed by a string of characters, defines the set of character states available. By default this is the set of nucleotides ACGT.
branchparameters
followed by definition for any of deletionRate, insertionRate, indelRate, sequenceLength, matrix definition. See below for details on these.
matrix
followed by a matrix definition. This is very handy for being able to pick & choose among matrices later by assigning them to branches.
output
followed by and of filename, format or precision definitions (see below).
parameters
followed by any of the in/del parameters: {deletionRate, insertionRate, indelRate, sequenceLength, deletionLength, insertionLength, indelLength}, treeHeight, molclock, nosequences, ntrials, baseFreq, seed, alphabet. See below for details.
run
is a command to run the currently defined simulation and output results.
tree
followed by a tree definition. See below.
treeParameters
followed by any of birthRate, "br" or "branch" followed by branchParameters, sequenceLength, in/del parameters as above, treeHeight, matrix, numTaxa, baseFreq, or a previously defined set of tree parameters, as identified by a label you chose.

Top Definitions Samples Notes Release Disclaimer

Definitions

First some notes on the parsing of input files or command-line arguments.
  1. The parser is case insensitive: "tree" is the same as "Tree" and "tReE" and "TREE"
  2. The parser is intended to be flexible so many items have alternatives for your own ease of use or legibility (a common conflict), and some punctuation is optional. For instance "params" is often used as a shortening of "parameters", and in some cases this can be further shortened to "-br". The last of these is aimed at ease of use in command-line invocation of Filo.
  3. When setting values the equals "=" is optional.
  4. Throughout, the notation "[" blah "]" is used to imply that blah is optional.
  5. Throughout, "(" and ")" are used to group options together, and "|" is used to indicate "or", so "this | that" means "this" or "that" are acceptable.
  6. The notation "{" red "," fish "," blue "}" is used to indicate that a sequence of zero or more of the enclosed terms is expected, in no particular order.
alphabet
	(alphabet | -a) {
		( '<' ( "DNA" | "nucleotide" 
			| "RNA" 
			| "aminoacids" | "aa" 
			| "bin" | "binary" ) ,
		<string>
	
where If for some strange reason you want to use the characters "D", "N" and "A" you must leave out the < above.
baseFreq
	( "p" | "pi" | "baseFreq" ) <doubleArray>
	
The above defines the base frequencies used at the base of a tree. By default these are uniform, so if you are using 4 states (as with nucleotides) the base frequencies will be 0.25, 0.25, 0.25, 0.25.
birthRate
	( "birthrate" | "birth" ) <real>
	
This defines the birth rate per unit time of each lineage in the tree bifurcating into two lineages, as used in the Markov / Yule model of tree growth. By default this is 1. See also deathRate.
branchParameters
	( "branchparams" | "branchparameters" | "bp" )
		<brparamsLabel> ,
		[ "=" ] {
			<deletionRate> ,
			<insertionRate> ,
			<indelRate> ,
			<sequenceLength> ,
			<matrix> }
		";"
	
This defines a set of parameters identified by the label <brparamsLabel>. Details for these parameters are given elsewhere.
deathRate
	( "deathrate" | "death" ) <real>
	
This defines the birth rate per unit time of each lineage in the tree bifurcating into two lineages, as used in the Markov / Yule model of tree growth. By default this is 0, corresponding to a pure birth process. See also birthRate. Note that both birth rate and death rate must be non-negative. Incidentally it is a standard result that the shape distribution of the tree grown under a birth/death process is only affected by the ratio of birth rate to death rate, not to their individual values.
deletionLength
	( "deletionLength" | "delLength" ) [ "=" ] <real>
	
This sets a parameter related to the mean deletion length. The actual length used in the simulations is drawn from a log-normal distribution, but with a minimum of 1, so this parameter is definitely not the true mean length. However this parameter is representative of the spread of deletion lengths.
deletionRate
	( "deletionRate" | "delRate" ) [ "=" ]  <real>
	
This sets the rate per unit time of having a deletion event at each site in the sequence.
doubleArray
	[ <real> [ "=" ] <real> ...  <real> ]
	
The above reads in a set of real numbers separated by whitespace and surrounded by square brackets. It is often used to define base frequencies (see baseFreq) or matrices (see matrix).
filename
	"filename" [ "=" ] <string>
	
The above sets the root of the name of the output files. For example if you call filename myfile here then all the output files will begin with the string "myfile" and be followed with indices showing the experiment and trial number, followed by a suffix to indicate the file type (see format, below).
filetype
	(
		"fasta" 
		| "nexus" 
		| "phylip" 
		| "raw" | "nogap" 
		| "trees" | "treeview"
	)
	
These are the current output formats supported by Filo. FASTA, NEXUS and Phylip are very well known. The output for these files includes gaps. "raw" or "nogap" files are FASTA format but with gaps removed, for instance for use in alignment testing. "trees" or "treeview" format is for use in the TreeView programs, also readable by FigTree and many other programs. Such files do not contain sequence data.
format
	"format" <filetype> { "," <filetype> }
	
This defines the format(s) of the output to be used. See filetype.
indelLength
	"indelLength" [ "=" ] <real>
	
This sets the length parameter for both insertions and deletions. See deletionLength for more details.
indelRate
	( "indel" | "indelRate" ) [ "=" ] <real>
	
This sets the rate of both insertions and deletions. See deletionRate for more details.
insertionLength
	( "insertionLength" | "inLength" )  [ "=" ] <real>
	
This sets the length parameter for insertions. See deletionLength for more details.
insertionRate
	( "insertionRate" | "inRate" )  [ "=" ] <real>
	
This sets the rate parameter for insertions. See deletionRate for more details.
matrix
	( "matrix" | "-m" ) <string> [ "=" ]
	( <doubleArray>
		| <JC>
		| <K80>
		| <F81>
		| <HKY>
		| <F84>
		| <TN93>
		| <GTR>
	)
	
This defines a matrix in numerous ways. The first of these simply sets all the parameters in the matrix, row by row. Some checking is done but the responsibility is on the user to ensure these make sense. The next matrices deserve a subsection of their own. Oh look, here's one:
<JC>
		"JC" <real>
		
This sets the one parameter of the one-parameter Jukes-Cantor model. Diagonals are set such that the row sums are zero, so JC x creates an n×n matrix whose non-diagonal entries are x and whose diagonals are x/(n-1), where the alphabet has n entries.
<K80>
		"K80"  <real> <real>
		
This defines a K80 matrix with parameters a and b. K80 is a special case of the more general TN93 matrix, in particular is defined by creating a TN93 matrix with the base frequencies set to (0.25, 0.25, 0.25, 0.25) and arguments (a, a, b) (see TN93 for more details).
<F81>
		"F81" <real> <doubleArray>
		
This defines a F81 matrix with rate parameter mu and base frequencies given. Again, F81 is a special case of the more general TN93 matrix, in this case created with arguments (mu, mu, mu) and the base frequencies defined. See TN93 for more details.
<HKY>
		( "HKY85" | "HKY" ) <real> <real> <doubleArray>
		
This defines an HKY85 matrix with parameters a and b in order and base frequences given by the doubleArray. As HKY85 is a special case of TN93, the matrix is created with the base frequencies given and the other arguments (a, a, b). See TN93 for more details.
<F84>
		"F84" <real> <real> <doubleArray>
		
This defines an F84 matrix with the parameters k and b in order and the base frequencies given. As F84 is a special case of TN93, the matrix is constructed with arguments derived from k and b and the base frequencies.

If the base frequencies are pi(A), pi(C), pi(G), pi(T), let pi(R) be pi(A) + pi(G) and pi(Y) be pi(C) + pi(T). The matrix is constructed as a TN93 matrix with arguments ( 1 + k/pi(Y)) × b, 1 + k/pi(R)) × b, b) and base frequencies as given.

<TN93>
		( "TN93" | "TN" )  <real> <real> <real> <doubleArray>
		
This defines a Tamura-Nei '93 matrix with the parameters a1, a2, b in order and base frequencies as given. The resulting matrix looks like this:
           A        C        G        T          
Q = A [    .      b p(C)  a2 p(G)   b p(T)  ]
    C [  b p(A)     .      b p(G)  a1 p(T)  ]
    G [ a2 p(A)   b p(C)     .      b p(T)  ]
    T [  b p(A)  a1 p(C)   b p(G)     .     ]
		
<GTR>
		( "GTR" | "REV" ) <real> <real> <real> <real> <real> <real>
		
This defines a General Time-Reversible model (also known as REV, but we believe that is not quite correct). The resulting matrix looks like this:
             A            C            G           T       
Q = A [      .        a * pi(C)    b * pi(G)    c * pi(T)  ]
    C [  a * pi(A)        .        d * pi(G)    e * pi(T)  ]
    G [  b * pi(A)    d * pi(C)        .        f * pi(T)  ]
    T [  c * pi(A)    e * pi(C)    f * pi(G)       .       ]
		
where "." means "the negative of the sum of the other elements in this row".
molclock
	( "molclock" | "molecularclock" ) [ "=" ] <bool>
	
where <bool> is "true", "1" or "yes" for true and "false", "0" or "no" for false. This sets the root-to-tip distance in simulated trees to be the same for all tips (if true) or allows it to vary (if false). The default is that the molecular clock is assumed.
node
	( <string> | "(" <node> { "," <node> } ")" )
		[ "[" <matrixLabel> "]" ] 
		[ ":" ( <brlen> | <brparamLabel> ) ]
	
where <matrixLabel> is the label of a matrix defined elsewhere, <brlen> is a non-negative real number corresponding to the length of the branch immediately ancestral to the node just defined, and <brparamLabel> is the label of a set of branch parameters defined elsewhere. See examples later for more details on how to define trees with different matrices and lengths on each branch.

Note that if no matrix is defined for a given branch, the matrix that was defined for ancestral branch will be used, if there is one. If there is no such matrix then the global default matrix will be used. If no matrix is defined at all then this global default is the Jukes-Cantor model. Any time a matrix is defined it becomes the current default, so if you are only using one matrix you need not identify it explicitly in the tree definition.

nosequences
	"nosequences" [ "=" ] <bool>
	
where <bool> is defined as above. This enables users to turn off sequence generation. By default, nosequences is false.
ntrials
	( "ntrials" | "nreps" ) <int>
	
where <int> is a positive integer. This sets the number of trials to do with the current settings.
numTaxa
	( "ntax" | "numtaxa" ) <int>
	
where <int> is a positive integer as above; this sets the number of tips to grow to when growing trees. Note that there is a philosophical difficulty in the current implementation, which effectively assumes that the tree growth process is birth-only in that the last bifurcation event is the first one leading to the correct number of taxa. This rules out the possibility that under a birth-death process the number of tips may be exceeded and then return to the desired number. The correct implementation is on my to-do list.
precision
	( "dp" | "precision" ) <int>
	
Sets the number of decimal places to use in output.
<real>
A real number, which might be expressed as a floating point, or integer.
seed
	"seed" <int>
	
where <int> is any integer; this sets the random number seed explicitly.
sequenceLength
	( "l" | "seqlength" | "sequencelength" ) <int>
	
where <int> is a positive integer; this sets the initial sequence to use at the root of the simulated trees.
tree
	( "tree" | "-t" ) <string> [ "[" <int> "]" ] [ "=" ] 
		( <node> | <treeParams> )
		[ ";" ]
	
This defines a tree with label <string> or set of trees with labels <string>0, <string>1, etc. To define a set of trees use the optional "[" <int> "]" to define the number of trees with the same definition following. See node for more details.
treeHeight
	( "height" | "root_to_tip" ) [ "=" ] <real>
	
This sets the total root-to-tip distance for simulated trees. Once the tree is complete the maximum distance from root to tips is set to the required value by normalising the branch lengths in the tree.
Top Definitions Samples Notes Release Disclaimer

Samples

An example input file is shown below (and can be downloaded here).
#NEXUS

%	Input for many experiments:
%	n = 2, 4, 8, 16, 32, 64
%	c = 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
%	root_to_tip = 0.1, 0.125, 0.16, 0.2, 0.25, 0.32, 0.4, 0.5, 0.64, 0.8, 1, ..., 10
%	trees grown in pure birth process under Yule model

% 	tree T1 = ((tA[F84]:0.2, tB[F84]:0.25)[F84]:0.02, (tC[HKY]:0.2,tD[HKY]:0.5)[F81]:0.01);
% 	tree T2 = (tA[K80]:0.2, (tB[K80]:0.25, (tC[HKY]:0.2, tD[HKY]:0.5)[F81]:0.01)[F81]:0.25);
% 	tree T3 = (tA[JC]:0.2, (tB:0.25, (tC:0.2, tD:0.5):0.01):0.25);

begin filo; 
     
	output 
		format = fasta, nexus, raw
		filename = sampleoutput
		precision 5
 	;
 	matrix JC = JC 0.1;
 	matrix K80 = K80 0.1 0.2;
 	matrix F81 = F81 0.3 [ 0.2 0.3 0.15 0.35 ];
 	matrix F84 = F84 1 0.3 [ 0.1 0.4 0.1 0.4 ];
 	matrix HKY = HKY85 0.2 0.5 [ 0.25 0.05 0.25 0.45 ];
 	tree T1 = ((tA[F84]:0.2, tB[F84]:0.25)[F84]:0.02, (tC[HKY]:0.2,tD[HKY]:0.5)[F81]:0.01);
 	tree T2 = (tA[K80]:0.2, (tB[K80]:0.25, (tC[HKY]:0.2, tD[HKY]:0.5)[F81]:0.01)[F81]:0.25);
 	tree T3 = (tA[JC]:0.2, (tB:0.25, (tC:0.2, tD:0.5):0.01):0.25);
	tree T4 = (((tA[JC]:0.175, tB:0.175):0.025,(tC:0.15,tD:0.15):0.05):0.05,((tE[JC]:0.175, tF:0.175):0.025,(tG:0.15,tH:0.15):0.05):0.05);
	tree T5 = (((tA[HKY]:0.75, tB[HKY]:0.75)[HKY]:0.25,(tC[HKY]:0.5,tD[HKY]:0.5)[HKY]:0.5)[HKY]:0.5,((tE[HKY]:0.75, tF[HKY]:0.75)[HKY]:0.25,(tG[HKY]:0.5,tH[HKY]:0.5)[HKY]:0.5)[HKY]:0.5);
	params
		l 200
		indel 0.05
		ntrials = 10
	;
	treeparams T1
		l 50
	;
	treeparams T4
		pi [ 0.25 0.25 0.25 0.25 ]
		l 100
		indelrate 0.25
	;
	treeparams T5
		pi [ 0.45 0.05 0.05 0.45 ]
		l 50
	;
 	run;
end;

begin allconsistent;
% All these give exactly the same matrix:
 	matrix JC = JC 0.3;
 	matrix K80 = K80 0.3 0.3;
 	matrix F81 = F81 0.3 [ 0.25 0.25 0.25 0.25 ];
 	matrix F84 = F84 0 0.3 [ 0.25 0.25 0.25 0.25 ];
 	matrix HKY = HKY85 0.3 0.1 [ 0.25 0.25 0.25 0.25 ]; 

end;

begin generalexperiments;
 	matrix M = HKY 4 [ 0.25 0.25 0.25 0.25 ]; 
	tree T = (((((((taxonA:0.01, taxonB:0.01):0.01, taxonC:0.03):0.01, taxonD:0.04):0.01, 
			taxonE:0.05):0.01, taxonF:0.06):0.01, taxonG:0.07):0.01, taxonH:0.08);
	params  
		ntrials 10
		height 0.32 
		l 1000
		indel 0.0
		insertionRate 0.001
	;
	run;
	params deletionRate 0.001;
	run;
	params deletionRate 0.002;
	run;
end;

The above file, when run with
java -jar Filo.jar -f sampleinput.txt
has the following output to console:
This is Filo, version v1.1, released 2009.02
Written by Michael A. Charleston, with contributions by Katharine Holt
Warranty: No warranty of any kind.
The author(s) take no responsibility for any damage incurred
to anything at all through the use of this software, and do
not make any claims as to its usefulness or applicability to
any particular problem.
Distribution: Available from http://www.it.usyd.edu.au/~mcharles
Funding: Supported by the Australian Research Council
Starting time: Wed Mar 04 16:40:52 EST 2009
Opening file sampleinput.txt for parsing.
Generating data..........done.
Output files are "sampleoutput0_1.*"to "sampleoutput0_9.*"
File "sampleinput.txt" has all been processed.  No drama.
Command-line arguments have all been processed.  No drama.
Stopping time: Wed Mar 04 16:41:02 EST 2009

The output files are collected as raw, FASTA and Nexus format .zip files.

Top Definitions Samples Notes Release Disclaimer

Notes:


Top Definitions Samples Notes Release Disclaimer

Grammar

This is a draft of the EBNF grammar that Filo uses.
Invocation:

java -jar Filo.jar <args>

<args> ::= { <cl_infile> | <cl_alphabet> | <cl_seqlength> |
	 <cl_matrix> | <cl_nreps> | <cl_output> |
	 <cl_params> | <cl_basefreq> | <cl_run> |
	 <cl_tree> | <cl_treeparams> | <cl_verbosity> |
	 <cl_version> | <cl_help> }
<cl_infile> ::= '-f' <ident>
<cl_alphabet> ::= '-a' <alphabet>
<cl_seqlength> ::= '-l' <int>
<cl_matrix> ::= '-m' <matrix>
<cl_nreps>  ::= '-n' <int>
<cl_output> ::= '-o' <output>
<cl_params> ::= '-p' <params>
<cl_basefreq> ::= '-pi' <basefreq>
<cl_run> ::= '-r'
<cl_tree> ::= '-t' <tree>
<cl_treeparams> ::= '-tp' <treeparams>
<cl_verbosity> ::= '-v' <bool>
<cl_version> ::= '--v'
<cl_help> ::= '-?' | '-h'

<alphabet> ::= <dna> | <rna> | <aa> | <binary>
<dna> ::= "DNA" | "nucleotide"
<rna> ::= "RNA"
<aa> ::= "AA" | "aminoacids"
<binary> ::= "bin" | "binary"

<matrix> ::= <matrixid> <eq> ( <jc> | <k80> | <f81> | <hky> 
				| <f84> | <tn93> | <gtr> )
<eq> ::= [ "=" ]
<jc> ::= "JC" <real>
<k80> ::= "K80" <real> <real>
<f81> ::= "f81" <real> <realarray>
<hky> ::= ( "HKY" | "HKY85" ) <real> <real> <realarray>
<f84> ::= "F84" <real> <real> <realarray>
<tn93> ::= ( "TN" | "TN93" ) <real> <real> <real> <realarray>
<gtr> ::= "GTR" <real> <real> <real> <real> <real> <real> <realarray>

<realarray> ::= "[" { <real> } "]"
<real> ::= <int> | <float>

<output> ::= { <outfile> | <outformat> | <outprecision> }
<outfile> ::= "filename" <ident>
<outformat> ::= "format" | { "fasta" | "phylip" | "raw" |
					"retroml" | "trees" | "treeview" }
<outprecision> ::= ( "dp" | "precision" ) <int>

<params> ::= { <delrate> | <insertrate> | <indelrate> | <seqlength> |
	<treeheight> | <molclock> | <nosequences> |
	<nreps> | <basefreq> | <dellength> | <insertlength> }
<delrate> ::= ( "deletionrate" | "delrate" ) <eq> <real>
<insertrate> ::= ( "insertionrate" | "inrate" ) <eq> <real>
<indelrate> ::= ( "indel" | "indelrate" ) <eq> <real>
<seqlength> ::= ( "l" | "seqlength" | "sequencelength" ) <eq> <int>
<treeheight> ::= ( "height" | "root_to_tip" ) <eq> <real>
<molclock> ::= ( "molclock" | "molecularclock" ) | <bool>

<bool> ::= ( "1" | "t" | "true" | "y" | "yes" ) | <str>

<nosequences> ::= "nosequences" <eq> <bool>
<nreps> ::= ( "nreps" | "ntrials" ) <eq> <int>
<basefreq> ::= ( "p" | "pi" | "basefreq" ) <eq> <realarray>
<dellength> ::= ( "dellength" | "deletionlength" ) <eq> <int>
<insertlength> ::= ( "insertlength" | "insertionlength" ) <eq> <int>

<treedef> ::= <treeset> | <tree>
<treeset> ::= <treeid> "[" <int> "]" <eq> <node>
<tree> ::= <treeid> <eq> <node>
<node> ::= ( <internalnode> | <leaf> ) 
	[ "[" <matrixid> "]" ] [ ":" ( <real> | <brparamsid> ) ]

<internalnode> ::= "(" <node> "," <node> { "," <node> } ")" [ <matrixtag> ]
	[ <brdescriptor> ]
<matrixtag> ::= "[" <matrixid> "]"
<brdescriptor> ::= ":" ( <real> | <brparamsid> )
<leaf> ::= <taxonid> [ <matrixtag> ] [ <brlength> ]

<treeparams> ::= <treeparamsid> <eq> { <birthrate> | <branchparamsid> |
	<seqlength> | <delrate> | <insertrate> |
	<dellength> | <insertlength> | <growth> |
	<matrix> | <molclock> | <ntax> | <basefreq> |
	<treeparamsid> }
<birthrate> ::= ( "br" | "birthrate" ) <real>
(NB: no reason to put in death rate as it's set to unity)
<grow> ::= ( "grow" | "growth" ) <eq> ( "atel" | "yule" )
<brparams> ::= ( "branchparams" | "branchparameters" ) <brparamsid> <eq>
	{ <delrate> | <insertrate> | <indelrate> | <brlength> |
	<matrix> }
<brlength> ::= ( "l" | "len" | "length" ) <eq> <real>
<ntax> ::= "ntax" <eq> <int>
Top Definitions Samples Notes Release Disclaimer

Disclaimer:

This software is intended for use in generating simulated sequence data for phylogenetic analysis. It is released as is, and with no warranty of any kind. Use it at your own risk!
Top Definitions Samples Notes Release Disclaimer

Release Notes:

Build 523
Version 1.1. Corrected bug causing insertions and deletions to get out of sync; improved definition of parameterised matrices
Version 1.0
Initial release