Honorary Associate Professor Lars Jermiin

Back to Dr Lars Jermiin's profile page


Hetero: A program designed to simulate the evolution of nucleotide sequences on a rooted, 4-tipped tree (Version 1.0)

Description

Hetero is a computer program designed to simulate the evolution of a nucleotide sequence across a tree with four tips. It allows its user to specify the lineage-specific nucleotide substitution models that are used in the simulation, together with information on the ancestral sequence and the order and timing of the divergence events. It has a simple user-interface and output, making it equally useful in the teaching and research of phylogenetics.

Author

Lars Jermiin

Citation

Jermiin LS, Ho SYW, Ababneh F, Robinson J, AWD Larkum. Hetero: A program to simulate the evolution of DNA on a four-taxon tree. Applied Bioinformatics 2, 159-163 [2003].


Copyright & Disclaimer Notice

Hetero is Copyright© to the University of Sydney, 2003 - 2010. All rights reserved. It may not be used to imply anything else but what is presented in Jermiin et al. (2003), and in these web-pages.

Permission is granted to use Hetero, free of charge, for non-commercial use, subject to appropriate acknowledgement of the source. Commercial use of the program requires prior written permission from the University of Sydney, which may incur a fee.

The University of Sydney does not invite reliance upon, nor accept responsibility for, the information it provides through Hetero.

The University of Sydney gives no guarantees, undertakings or warranties, either expressed or implied, concerning the accuracy, completeness or up-to-date nature of the information provided or suitability of the information for a particular purpose or application. Users should confirm information from another source if it is of sufficient importance for them to do so.

Under no circumstances will the University of Sydney be liable for indirect or consequential damages, including, without limitation, loss of income or use of information.

Hetero is made available by the University of Sydney through the auspices of its Biological Sciences (the 'host').

Whilst all care is taken to ensure a high degree of accuracy, users are invited to notify the author of any discrepancies.

Downloads

You can download the executable code of hetero to your own from this page; simply click on the appropriate hyperlink, and follow the instructions.

  • Download hetero for Sun Computers (compiled on a SunBlade 1000, running Solaris 5.8, using the Gnu C 3.3 compiler).
  • Download hetero for PCs (compiled on Compaq Presario 1550AP, running Windows XP, using the Metrowerks CodeWarrior 8.0).
  • Download hetero for Macs (compiled on a Mac G4, running OSX, using the Metrowerks CodeWarrior 8.0).

Installing Hetero on Sun Workstations

On Sun workstations and similar types of computer platforms, download Hetero to the directory of your choice, and type:

uncompress hetero.tar.Z

followed by

tar xvf hetero.tar

The result is the appearance of the executable file, hetero. If you have superuser access to the computer, then you may wish to change the ownership and the group to the relevant choice, e.g.:

chown root:staff hetero

After having done so, place the executable in the appropriate directory, e.g.:

/usr/local/bin

or

/user/biotools/bin

Remember to make the chosen directory accessible to your group members. By now, hetero is ready to be used (all documentation is included in these web pages).

Installing Hetero on PCs

On PCs and similar types of computer platforms, download Hetero to the directory of you choice. Double-click on the file to run the installation program. The installation program allows you to specify the directory in which you would like to install hetero; the default directory is

c:\Program Files\Hetero

The installation program will also place a shortcut to the program on your Desktop. You may then delete the original installation program, as it is no longer required by the program. By now, hetero is ready to be used (all documentation is included in these web pages).

Installing Hetero on Macs

On Macintosh computers (G3, G4 and G5 running either MacOS9.2 or MacOSX), hetero-download.shtml||download Hetero to the folder of you choice, preferentially the one in which you have also stored programs from the PHYLIP program package, and click on the self-extracting archive to unwrap the program. By now, hetero is ready to be used (all documentation is included in these web pages).

Source Code Availability

Anyone wishing to obtain a copy of the source code (ANSI C compatible) must have a software license to Numerical Recipes because the source code includes a licenced algorithm from Numerical Recipes in C (Cambridge University Press). A software license can be ordered from Numerical Recipes' web site or by email from .


Running Hetero on Sun Workstations

On Sun workstations and similar types of platforms, simply type:

hetero

in a terminal window, and answer the questions as they appear on the monitor. One of the two output files contains the information that was used to generate the sequence alignments stored in the other output file. Both files can be viewed using a text editor, and the file containing the aligned nucleotide sequences can be analysed directly, without any reformatting, using the relevant phylogenetic programs from PHYLIP.

Running Hetero on PCs

On PCs, simply double-click on the program's icon, or type:

hetero.exe

in a terminal window, and answer the questions as they appear on the monitor. One of the two output files contains the information that was used to generate the sequence alignments stored in the other output file. Both files can be viewed using a text editor, and the file with aligned nucleotide sequences can be analysed directly, without any reformatting, using the relevant programs from PHYLIP.

Points of Consideration

To run hetero is straightforward - you simply follow the guidelines provided in the program.
Whenever you enter an answer, the program will, in most cases, assess whether the answer conforms to it its expectations. If the answer is not correct, the program will in most instances tell you what is wrong. If you keep on entering unexpected answers, the program will abort after three attempts.

Only in one known instance is it possible to cause the program to hang; this can occur when (i) you enter parameters that produce a tree with a length just below 1.0 substitution per site, and (ii) you disallow multiple substitutions at the same site. When this happens, you should:

  1. Quit the program;
  2. Start hetero again, and enter the same edge lengths and evolutionary parameters;
  3. Use a different randomly-chosen number to seed the random number generator.

Stress Testing of Hetero

Hetero was stress tested on four computer systems: (i) SunBlade 1000 running SunOS 5.8; (ii) Macintosh PowerBook G3 running MacOS 9.2; (iii) Macintosh G4 running MacOSX; (iv) Toshiba Presario 1550AP running Windows XP.


Interpretation of the Results produced by Hetero

The output of hetero appears in two files. The first output file contains all the details that were entered before the simulation was begun, and it also contains most of the results. An example of the first output file is given below - all details in the output file are indented and written in a different typeface.

The first few lines (below) are largely self-explanatory - the output file contains the aligned nucleotide sequences that can be analysed using many of the programs from PHYLIP, and the time gives the exact point in time when the simulation was done.

DETAILS OF PROGRAM    
Program Hetero 1.0
Copyright Hetero is Copyright to the University of Sydney, 2003
Output file test.aln
Time Fri Nov 7 19:42:40 2003

The next few lines (below) show the length of the six edges in the tree, and that the length of these edges was entered in terms of time units (and not in terms of average substitutions per site).

PROPERTIES OF THE TREE    
 Edges entered in terms of time (units)  
   
Length of a 0.95000
Length of b 0.95000
Length of c 0.95000
Length of d 0.95000
Length of e 0.05000
Length of f 0.05000

The next few lines (below) show the lengths of all the edges - the tree length is then estimated by adding six edge lengths together. If the tree length is equal to or larger than 1.0, then multiple substitutions will always be allowed (see below).

The tree used with the Monte Carlo simulation,
written in the Newick format with edge lengths
given in terms of (1) time or (2) average rate
of nucleotide substitution per site: 
 (1) - ((SeqA:0.9500,SeqB:0.9500):0.0500,(SeqC:0.9500,SeqD:0.9500):0.0500);
 (2) - ((SeqA:0.1425,SeqB:0.1425):0.0075,(SeqC:0.1425,SeqD:0.1425):0.0075);

The following few lines are largely self-explanatory.

AVERAGE RATES OF CHANGE ALONG THE EDGES  
 Rate along edge a  0.15000
 Rate along edge b  0.15000
 Rate along edge c  0.15000
 Rate along edge d  0.15000
 Rate along edge e  0.15000
 Rate along edge f  0.15000

The following few lines outline the other relevant information.

OTHER RELEVANT INFORMATION    
 Sequence length  100
 Number of cycles  5
 Seed  -57
 Multiple hits  yes
 Order of nucleotides  A, C, G & T

The sequence length is the number of nucleotides in the alignment; the number of cycles is the number of times that the ancestral sequence at the root of the tree is allowed to evolve towards the tips of the tree; the seed is a number needed to prime the random number generator (Note: the number entered is 57 but in the print out it will appear as -57; if the number in the print out is not negative, then you have entered an invalid seed); the multiple hits are either accepted, in which case the answer is "yes"; the order of nucleotides is mentioned because it allows us to read the properties of the substitution models (see below) and those of the threshold matrices (see below).

The following few lines are largely self-explanatory.

PROPERTIES OF THE ANCESTRAL SEQUENCE   
 Frequency of A 0.25000 
 Frequency of C 0.25000
 Frequency of G 0.25000
 Frequency of T 0.25000

The next many lines require some explanation. For each of the six edges, a matrix of rates of change is developed on the basis of the equilibrium nucleotide frequencies and the conditional rates of change of the model. The rates of change are measured in terms of time units.

PROPERTIES OF THE SUBSTITUTION MODELS

Model Ra – Nucleotide frequencies

0.25000 0.25000  0.25000  0.25000 

Model Ra – Conditional rates

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

Model Ra – Rates of change along edge a

-0.15000   0.05000  0.05000  0.05000
 0.05000 -0.15000   0.05000  0.05000
 0.05000  0.05000 -0.15000  0.05000
 0.05000  0.05000  0.05000 -0.15000 

Model Rb – Nucleotide frequencies

0.25000 0.25000  0.25000  0.25000 

Model Rb – Conditional rates

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

Model Rb – Rates of change along edge b

-0.15000   0.05000  0.05000  0.05000
 0.05000 -0.15000   0.05000  0.05000
 0.05000  0.05000 -0.15000  0.05000
 0.05000  0.05000  0.05000 -0.15000 

Model Rc – Nucleotide frequencies

0.25000 0.25000  0.25000  0.25000 

Model Rc – Conditional rates

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

Model Rc – Rates of change along edge c

-0.15000   0.05000  0.05000  0.05000
 0.05000 -0.15000   0.05000  0.05000
 0.05000  0.05000 -0.15000  0.05000
 0.05000  0.05000  0.05000 -0.15000 

Model Rd – Nucleotide frequencies

0.25000 0.25000  0.25000  0.25000 

Model Rd – Conditional rates

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

Model Rd – Rates of change along edge d

-0.15000   0.05000  0.05000  0.05000
 0.05000 -0.15000   0.05000  0.05000
 0.05000  0.05000 -0.15000  0.05000
 0.05000  0.05000  0.05000 -0.15000 

Model Re – Nucleotide frequencies

0.25000 0.25000  0.25000  0.25000 

Model Re – Conditional rates

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

Model Re – Rates of change along edge e

-0.15000   0.05000  0.05000  0.05000
 0.05000 -0.15000   0.05000  0.05000
 0.05000  0.05000 -0.15000  0.05000
 0.05000  0.05000  0.05000 -0.15000 

Model Rf – Nucleotide frequencies

0.25000 0.25000  0.25000  0.25000 

Model Rf – Conditional rates

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

Model Rf – Rates of change along edge f

-0.15000   0.05000  0.05000  0.05000
 0.05000 -0.15000   0.05000  0.05000
 0.05000  0.05000 -0.15000  0.05000
 0.05000  0.05000  0.05000 -0.15000 

The next many lines require some explanation. In order to understand how the nucleotide sequences are generated using this program, it is necessary to explain an integral component of the Monte Carlo simulation, i.e. the generation of random mutations. The rate matrix for a given edge in the tree defines how random mutations are generated along that edge and works through its corresponding threshold matrix. The threshold matrix is produced by (i) adding the identity matrix, I, to the rate matrix, and by (ii) adding the values in the x-th column of this matrix to the corresponding values in the preceeding columns. A comparison of this method of the generation of nucleotide sequences with other such methods is presented in Ababneh et al. (2006). The relationship between the rate matrix and its corresponding threshold matrix is illustrated below:

Rate Matrix

-0.15000   0.05000  0.05000  0.05000
 0.05000 -0.15000   0.05000  0.05000
 0.05000  0.05000 -0.15000  0.05000
 0.05000  0.05000  0.05000 -0.15000 

Identity matrix (I):

 1.00000   0.00000  0.00000  0.00000
 0.00000  1.00000   0.00000  0.00000
 0.00000  0.00000  1.00000  0.00000
 0.00000  0.00000  0.00000  1.00000 

Sum of rate matrix and identity matrix (R + I):

 1.00000   0.00000  0.00000  0.00000
 0.00000  1.00000   0.00000  0.00000
 0.00000  0.00000  1.00000  0.00000
 0.00000  0.00000  0.00000  1.00000 

Threshold matrix:

 0.85000   0.90000  0.95000  1.00000
 0.05000  0.90000   0.95000  1.00000
 0.05000  0.10000  0.95000  1.00000
 0.05000  0.10000  0.15000  1.00000 

Remembering the order of the four nucleotides (i.e., A, C, G & T), it is now possible to use randomly generated numbers between 0.0 and 1.0 to determine whether a nucleotide at a given site will remain the same or change to one of the three other nucleotides. For example, if the nucleotide at a given site is an A, then we must focus on the first row in the threshold matrix. Suppose the random number returned by the random number generator equals 0.3567, then the A will remain an A because 0.0000 < 0.3567 <= 0.8500. If, on the other hand, the random number was 0.9232, then the A will change to a G because 0.9000 < 0.9232 <= 0.9500. The six threshold matrices are listed below.

THRESHOLD MATRICES

Thresholds used under Model Ra (edge a)

0.85000  0.90000  0.95000  1.00000 
0.05000  0.90000  0.95000  1.00000 
0.05000  0.10000  0.95000 1.00000 
0.05000  0.10000  0.15000  1.00000

Thresholds used under Model Rb (edge b)

0.85000  0.90000  0.95000  1.00000 
0.05000  0.90000  0.95000  1.00000 
0.05000  0.10000  0.95000 1.00000 
0.05000  0.10000  0.15000  1.00000

Thresholds used under Model Rc (edge c)

0.85000  0.90000  0.95000  1.00000 
0.05000  0.90000  0.95000  1.00000 
0.05000  0.10000  0.95000 1.00000 
0.05000  0.10000  0.15000  1.00000

Thresholds used under Model Rd (edge d)

0.85000  0.90000  0.95000  1.00000 
0.05000  0.90000  0.95000  1.00000 
0.05000  0.10000  0.95000 1.00000 
0.05000  0.10000  0.15000  1.00000

Thresholds used under Model Re (edge e)

0.85000  0.90000  0.95000  1.00000 
0.05000  0.90000  0.95000  1.00000 
0.05000  0.10000  0.95000 1.00000 
0.05000  0.10000  0.15000  1.00000

Thresholds used under Model Rf (edge f)

0.85000  0.90000  0.95000  1.00000 
0.05000  0.90000  0.95000  1.00000 
0.05000  0.10000  0.95000 1.00000 
0.05000  0.10000  0.15000  1.00000

The next many lines contain some of the results that can be gleaned from analysing the aligned nucleotide sequences - these results are stored in three tables.
The first table contains for each simulation the differences in the GC content between the four sequences, the number of constant sites, and the number of sites with different types of splits in the data.

RESULTS OF SIMULATION

Column:

  1. Dif. in GC content (SeqA vs SeqB)
  2. Dif. in GC content (SeqA vs SeqC)
  3. Dif. in GC content (SeqA vs SeqD)
  4. Dif. in GC content (SeqB vs SeqC)
  5. Dif. in GC content (SeqB vs SeqD)
  6. Dif. in GC content (SeqC vs SeqD)
  7. Constant sites
  8. Split A (A|BCD)
  9. Split B (B|ACD)
  10. Split C (C|ABD)
  11. Split D (D|ABC)
  12. Split E (AB|CD)
  13. Split F (AC|BD)
  14. Split G (AD|BC)
  15. Hypervariable sites

[1] 

[2] 

[3] 

[4] 

[5] 

[6] 

[7] 

[8] 

[9] 

[10] 

[11] 

[12] 

[13] 

[14] 

[15] 

0.010

0.030

0.000

0.020

-0.010

-0.030 

57.0 

11.0 

10.0 

8.0 

7.0 

1.0 

1.0 

0.0 

5.0 

-0.020 

-0.020 

-0.030 

0.000 

-0.010 

-0.010 

68.0 

2.0 

11.0 

5.0 

4.0 

2.0 

0.0 

0.0 

8.0 

0.050 

0.040 

0.000 

-0.010 

-0.050 

-0.040 

53.0 

6.0 

10.0 

8.0 

11.0 

2.0 

1.0 

1.0 

8.0 

0.000 

-0.060 

-0.060 

-0.060 

-0.060 

0.000 

56.0 

12.0 

10.0 

5.0 

5.0 

4.0 

3.0 

1.0 

4.0 

-0.010 

0.030 

0.000 

0.040 

0.010 

-0.030 

63.0 

2.0 

3.0 

15.0 

12.0 

0.0 

1.0 

1.0 

3.0 


The next table contains for each simulation the average differences in the GC content between the four sequences, the average number of constant sites, and the average number of sites with different types of splits in the data.

AVERAGE VALUES

[1] 

[2] 

[3] 

[4] 

[5] 

[6] 

[7] 

[8] 

[9] 

[10]

[11] 

[12] 

[13] 

[14] 

[15] 

0.006 

0.004 

-0.018 

-0.002 

-0.024 

-0.022 

59.4 

6.6 

8.8 

8.2 

7.8 

1.8 

1.2 

0.6 

5.6 


The next table contains the average number of sites that have changed X times - these values can only be generated through simulation and can only be guessed or estimated through analysis of real data.

Average number of sites with X hits

# Sites  Percentage 
58.600  58.600 
29.800  29.800 
8.800  8.800 
1.800  1.800 
0.200  0.200 
0.000  0.000 
0.000  0.000 
0.000  0.000 
0.000  0.000 
0.000  0.000 
10  0.000  0.000 
11  0.000  0.000 
12  0.000  0.000 
13  0.000  0.000 
14  0.000  0.000 
15  0.000  0.000 
16  0.000  0.000 
17  0.000  0.000 
18  0.000  0.000 
19  0.000  0.000 

The second output file contains the aligned nucleotide sequences in a sequential PHYLIP format, and since five cycles were done in this simulation, there are five alignments in this file. The file content is illustrated below:

 4

 100

 
 SeqA   ACTCACGTGATTCGAGGAATCTCTGGTAGTGCTGCACCAGTTTCTCGTGCGCACTCCAGCCATAAGTCTAGGAGCTGCAAGCTTGGGAATCGAGATAGTC 
 SeqB   AGTCAGGTGATTCGAGGAATGTCGCGTGGTGATGAAACACCTACGCGTACGCACTCCAAGTATGAGTCTAAGGGCTGCTTGCTTGAGATTCGAGTTCGTC
 SeqC   ACTCAGGTGATTCGACGAATTTGTGGTGGTGCTGCTACAGATTCGCTTACGCTCAACAGGCATCAGTCTAAGAGCCGAATGATTGGGATTCGAGTTCGTC 
 SeqD   ACTCAGGTGATTCGAGGAATTTCCGGCGGTGCCGCAAAAGATTCGCTTAAGCACTCCAGGCATCAGTCTAACAGCAGCTTGCTTGGGATTCAGGTTCGTC 

 4

 100

 
 SeqA   GCTGAAACGGAGCGTTGCTTGGTTCGCGTGACTAGGCAGCTACATGTCGCATCGACGACTGTAACTAGAATGCTTCGTCGTTAGTATGCGGATGTTCACT 
 SeqB   GCAGATACGGATCGTTCTTTGGTTCCTGTGGCTAGGCAACGACATGTCGCGTCGACCACTGAAACTAGTATGCCACCTCGTTACTGTGCGGATGTTCCCA
 SeqC   GCTGAGACGGAGAGTTCCTTGGATCGCTTGCCTAGGTACCTACATGTCGGATCGACGACTGTTAGTAGTATGCCCCGTCGTTAGTATGCGGATGTTCGCA 
 SeqD   GCTGAGACGGAGAGTTTCTTGGTTCGCGTGGCTAGGTAACTACATCTCGCATCGACGACTGTAACGAGCATGCCACGTAGTTAGTCCGCGGATGTTCGCA 

 4

 100

 
 SeqA   AACGTCCTGGAAACTCGCTTCGACGCATTACATTGAGAGCTCACCCGCGTATTGAATGGTTAAACCCCGGAAACTTATTAATGCAACCGCGGCAGCTCCT 
 SeqB   AACAATCTCGTAACTCGCTTTCACGCAGTACATCGAGGGATCACCAGCATAGTGAATGGTTATAACCTGTTAACTGATTGATGCATCCACGGTAGCCACT 
 SeqC   AAGGGCCTAGAAACTCTCTTCAACGCATTACAGAGCGGGCTCACCAACATATTGAATGGTAAAAACTCATTAGCTTATTCATGCATCCGCGGCAACCGCT 
 SeqD   TACGTCATCGAAACTTGCTTCCACGCCTTACATAGAGGGCTCAGCAACATATTGAACGGTTTGACCTCGTCAACTGATTCGTGGATCCGTAGCAGCCGCT 

 4

 100

 
 SeqA   ATTTCAGTCTAACGCGGCCAGGCCTAGAGGGTTTGTATCGGCGGCTCCTGCGGTTTTGGTACCCTCAATGTACTACTCACACCAGACACTCGGGCCCATG 
 SeqB   ATTCCAATCGGACTGGTCTACGCCGAGCGGGCTTGTCCCGGCGGCTCCTTGGGATTTGGTAACCTCAGTGTATGGCTCAAGCAAGAGATAAGGGCCCATA 
 SeqC   CTTTCAGTGGGACTCGGCCAGGCCGAGAGGGCTTGTCTCGGCGGCTCCTTGGGACTTGGTACCCTCAATCCATGACACAAGACAGCGTCACGGGCCCATG 
 SeqD   ATACCCGTCGGACTCGTCCAGGGCGAGAGGGCTTGTCTCGGCGGCTCCTCGAGATTTGGTACCCTCACTCTATAGCACCAGCCAGCGTCTCGGGCCCATT 

 4

 100

 
 SeqA   TTCTCGTTTGGGTCGTCCAGGAGCTGAAGAATTTGGCCTCATGACGATTTTCCCTCTGTAGGGGATAGATCGCGGTATACACTCCGAAAGTCGCTTGCTG 
 SeqB   TTCTCGTTTGGTTCGTCCAGGGGCTGGAGAATTTGGCCTCATGACGAGTTTACCTCTGTAGGGGATAGATCGCGATATACACTCCGAAAGCGGCTTGCTG 
 SeqC   TTCTCGTTTGGGTCTTACAGTAACGGAAGAATCAGGCCTAATGACGATTTTACCTATGCACGGAGTGGATCGCGTTATACACTCAGAAACCGGCTTGCCG 
 SeqD   TTCCGGTTTGGGTCGTCCACGAGCATATGAATTTGGCCTCATCACGAGTTTACCTCTGTGGTGGCTATATCGCGATAAACACACCGAAAGCCGCTTCCTG 

This file can easily be analysed using many of the programs available from PHYLIP.


Using Hetero to Teach Phylogenetics

The advantage of using hetero in an educational setting is that it allows the user to simulate the evolution under complex but known conditions and therefore the answer of a phylogenetic analysis is already known. If a phylogenetic analysis of the simulated data does not lead to the correct answer (i.e., the tree that was used to generate the data), then there is likely to be a discordance between the conditions under which the data were generated and the assumptions of the phylogenetic method.
Students may wish to work in pairs to improve their performance and understanding of the subject - if so, then I recommend that they simulate their data under conditions that are known only to themselves; after having generated the data, the students exchange data set, and the task then becomes to determine the evolutionary pattern and process. This often turns out to be harder than expected, and because it is an emulation of a real case scenario, it will sharpen the students approach to phylogenetics.

There are many interesting questions that can be addressed using hetero, and some of them are currently being investigated at the University of Sydney, and elsewhere. The results from some of these experiments have published (e.g., Jermiin et al. 2004; Ho and Jermiin 2004). Below I outline two simple examples that illustrate the use of hetero - many more examples are possible, but I leave it to the imagination of the inquisitive investigator to find out what they are....

Long Branch Attraction

Jermiin et al. (2003) outline an example of how hetero might be used to study the problem that is commonly referred to as the 'long branch attraction' effect. The problem has been studied extensively by others (e.g., Felsenstein 1978; Hasegawa et al. 1991; Steel et al. 1993), and can easily be illustrated using two sets of parameters, which I outline below.

Followed the ordered list, and analyse the alignments using the phylogenetic method of your choice.

1. Use hetero to generate 100 alignments of 1000 nucleotides using the following steps and parameters:
2. Use time to measure the length of edges in the phylogenetic tree.
3. Enter the following set of edge lengths:

Length of a ... 0.95
Length of b ... 0.95
Length of c ... 0.95
Length of d ... 0.95
Length of e ... 0.05
Length of f ... 0.05

4. Enter the following nucleotide content in ancestral sequence:

0.25000 0.25000  0.25000  0.25000 

6. Enter the following conditional rates for the six models:

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

7. Allow multiple substitutions at the same site to occur.
8. Save the information that describes the input in a file called sim_1.xls and the corresponding 100 alignments in a file called sim_1.aln.
9. Analyse the data in sim_1.aln using your favourite phylogenetic program, and record how many times the phylogenetic program recovers the correct phylogenetic tree (which is listed in sim_1.xls).
10. Repeat the simulation experiment but this time with different conditional rates.
11. Enter the following conditional rates for the models Re and Rf:

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

12. Enter the following conditional rates for the models Ra and Rd:

 ——— 0.20000 0.20000 0.20000
0.20000   ——— 0.20000 0.20000
0.20000  0.20000  ——— 0.20000
0.20000 0.20000 0.20000  ———

13. Enter the following conditional rates for the models Rb and Rc:

 ——— 0.38000 0.38000 0.38000
0.38000   ——— 0.38000 0.38000
0.38000  0.38000  ——— 0.38000
0.38000 0.38000 0.38000  ———

14. Save the information that describes the input in a file called sim_2.xls and the corresponding 100 alignments in a file called sim_2.aln.
15. Analyse the data in sim_2.aln using the same phylogenetic program as before, and record how many times the program recovers the correct phylogenetic tree (which is listed in sim_2.xls).
16. Compare the phylogenetic results and discuss why they may or may not be a different.
17. Repeat the phylogenetic analysis with another phylogenetic program and discuss why the two programs produce similar or different phylogenetic results.

Convergence of Nucleotide Content

Jermiin et al. (2004), among others (e.g., Galtier and Gouy 1995; Conant and Lewis 2001), have studied the effects of convergence of nucleotide content on phylogenetic estimates, and it is clear that as the nucleotide content becomes more and more heterogeneous, some phylogenetic methods are increasingly unable to recover the correct tree. This problem can easily be illustrated using two sets of parameters, which I outline below.

Followed the ordered list, and analyse the alignments using the phylogenetic method of your choice.

1. Generate 100 alignments of 1000 nucleotides using the following steps and parameters:
2. Use time to measure the length of edges in the phylogenetic tree.
3. Enter the following set of edge lengths:

Length of a ... 0.475
Length of b ... 0.475
Length of c ... 0.475
Length of d ... 0.475
Length of e ... 0.025
Length of f ... 0.025

4. Enter the following nucleotide content in ancestral sequence:

0.25000 0.25000  0.25000  0.25000 

5. Enter the following nucleotide content for the six models:

0.25000 0.25000  0.25000  0.25000 

6. Enter the following conditional rates for the six models:

 ——— 0.40000 0.40000 0.40000
0.40000   ——— 0.40000 0.40000
0.40000  0.40000  ——— 0.40000
0.40000 0.40000 0.40000  ———

7. Allow multiple substitutions at the same site to occur.
8. Save the information that describes the input in a file called sim_3.xls and the corresponding 100 alignments in a file called sim_3.aln.
9. Analyse the data in sim_3.aln using your favourite phylogenetic program, and record how many times the phylogenetic program recovers the correct phylogenetic tree (which is listed in sim_3.xls).
10. Repeat the simulation experiment but this time with different nucleotide frequencies for the six models.
11. Enter the following nucleotide content for models Re and Rf:

0.25000 0.25000  0.25000  0.25000 

12. Enter the following nucleotide content for models Ra and Rd:

0.50000 0.00000  0.00000  0.50000 

Enter the following nucleotide content for models Rb and Rc:

0.00000 0.50000  0.50000  0.00000 

14. Save the information that describes the input in a file called sim_4.xls and the corresponding 100 alignments in a file called sim_4.aln.
15. Analyse the data in sim_4.aln using the same phylogenetic program as before, and record how many times the phylogenetic program recovers the correct phylogenetic tree (which is listed in sim_4.xls).
16. Compare the phylogenetic results and discuss why they may or may not be a different.
17. Repeat the phylogenetic analysis with another phylogenetic program, and discuss why the programs may produce similar or different phylogenetic results.

I have outlined two exercises above but have deliberately not included the results because it would defeat the purpose of doing them. There is a number of other exercises that are readily available using hetero; many of these involve using other parameter values for the same exercises as those outlined above. Another exercise, which will not be outlined here, involves combining convergence in the nucleotide content and rate heterogeneity among the diverging lineages - this combination formed the basis for several unexpected results (Ho and Jermiin 2004).

Acknowledgment

The author is grateful for the feedback that many of his students have provided on Hetero during their education at the University of Sydney.


References

  1. Ababneh, F, Jermiin LS, Robinson J (2006). Generation of the exact distribution and simulation of matched nucleotide sequences on a phylogenetic tree. Journal of Mathematical Modelling and Algorithms 5, 291-308.
  2. Conant GC, Lewis PO (2001). Effects of nucleotide composition bias on the success of the parsimony criterion on phylogenetic inference. Molecular Biology and Evolution 18, 1024-1033.
  3. Felsenstein J (1978). Cases in which parsimony and compatibility methods will be positively misleading. Systematic Zoology 27, 401-410.
  4. Galtier N, Gouy M (1995). Inferring phylogenies from DNA sequences of unequal base compositions. Proceedings of the National Academy of Sciences USA 92, 11317-11321.
  5. Hasegawa M, Kishino H, Saitou N (1991). On the maximum likelihood method in molecular phylogenetics. Journal of Molecular Evolution 32, 443-445
  6. Ho SYW, Jermiin LS (2004). Tracing the decay of the historical signal in biological sequence data. Systematic Biology 53, 623-637
  7. Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD (2003). Hetero: a program to simulate the evolution of DNA on a four-taxon tree. Applied Bioinformatics 2, 159-163.
  8. Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD (2004). The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Systematic Biology 53, 638-643
  9. Steel MA, Hendy MD, Penny D (1993). Parsimony can be consistent! Systematic Biology 42, 581-587.