Honours projects 2009

Projects supervised by Michael Charleston

NOTE: I am also interested in hearing from students who have bioinformatics projects in mind but which don't fit any of those below.

Phylogenetics Projects

Introduction: Since Darwin's Origin of Species, scientists have debated the origin of the major animal species. The molecular data suggest a truly ancient origin, over 750 million years ago (Ma), but the fossil record has led some people to suggest a “Cambrian Explosion”: the sudden emergence about 450 Million years ago (Ma) of most of the existing animal diversity we see today. Though the Cambrian Explosion hypothesis has been largely dismissed, it remains a hotly debated topic. Understanding the evolution of life on our planet is crucial to making informed decisions about how to preserve the world we live in for the future.

Figure 1: a phylogenetic tree denoting the relationships among several plant species

The following projects all pertain to the area of molecular phylogenetics, that is, the study of reconstruction of evolutinary relationships among species of organisms, from aligned molecular sequences such as DNA, mitochondrial DNA (mtDNA) or protein sequences. There are few tougher computational problems in modern biology as, not only is the solution computationally complex, it is in general impossible to go back in time and check our answes. Not only is it important, but it's crucial: without knowledge of the evolutionary relationships among existing species, it is impossible to make sensible decisions about the best way forward to conserve biodiversity and life on our planet.

Requirements: interest in bioinformatics; good programming skills in C, C++ or Java; basic mathematics and statistics

Project 1: Characterising the space of evolutionary history
The estimation of evolutionary trees (a.k.a. phylogenies) is generally achieved by assigning some optimality criterion such as maximum likelihood (ML) or maximum parsimony (MP) to trees, and finding the tree or trees with the best score. Tree space Ð the set of all possible trees for a given set of species Ð grows super-exponentially with the number of species involved, so exhaustive search for the best one(s) is prohibited, but we can uncover characteristics of the search space either before or during a heuristic search. These characteristics, such as the proportion of times a given relationship occurs in a locally optimal tree (like "species A is more closely related to species "B" than it is to "C"), help bioinformaticians design most efficient search methods, and thus help biologists better understand the evolutionary processes involved in life on Earth.

This project will require the student to create a program in C, C++ or Java to characterise tree space, given input molecular sequence data in standard formats, for different optimality criteria (e.g., MP, ML). Research will be into the characteristics of “tree space” to aid heuristic search methods in finding optimal trees for particular problems of interest, including what is the origin and evolution of the major animal phyla?

Project 2: Parallel heuristics for estimating evolutionary trees
There are few problems in biology that are computationally harder than trying to uncover evolutionary relationships among species. The number of possible evolutionary trees increases exponentially with the number of species involved, so heuristic search must be used to find the one(s) that, we hope, best describe(s) their historical relationships. This is a clear candidate for parallelization, simply by delegating searches to multiple processors, but we can be a bit more clever than that by permitting different solutions to communicate with each other about their progress.

This project will implement and assess an existing parallel heuristic search strategy similar to genetic algorithms, to quickly find sets of optimal trees by operating in concert.

Project 3: Information Content-based phylogenetics
Molecular data sets are increasingly large and heterogeneous: they typically now contain multiple genes, and many species. Rather than assume that all the molecular data correspond to the same evolutionary process, we must allow there to be some variation across species and/or genes (such as some genes evolving faster than others); therefore we must be able to find which models correspond to which regions of our data, and partition it accordingly. If we don't then we risk using too-general models, and introducing error to our phylogenetic estimation. However it is essential that we do this reliably: we don't want to over- or under-fit our data. We use the concepts of information theory to ensure that we use the same amount of information from the data to suggest each partition.

This project will therefore use the information content and other characteristics of molecular sequence data to determine partitions of the data into regions that correspond with high probability to different evolutionary processes. The output will be a methodology that researchers can apply to their molecular data, to find and use these partitions.

Cophylogenetics Projects (as if the phylogeny problem wasn't hard enough...)

Figure 2: (a) a "tanglegram" representing a host tree H (left) and parasite tree P (right) for a ver

Introduction: Cophylogeny is the study of how groups of ecologically linked species have evolved with each other. Parasites (of which there are many more types than there are non-parasites) evolve with their hosts, and pathogens (disease agents) evolve in a constant arms race with their "victims". As many as three quarters of emergent diseases in humans have come from other species. Genes can be considered to 'parasitise' their 'hosts', because they undergo the same kinds of processes as do parasites and pathogens. Even languages, in a sense, parasitise people.

The problem is computationally very tough, as well as being statistically hard, and there have been several different approaches. One is by cophylogeny mapping, which is easily the most intuitive method, and happens to have other nice properties as well. In this method the dependent phylogeny (evolutionary tree) P is mapped into the independent phylogeny H, in order to show how the two phylogenies have been associated with each other in the past.

Requirements: interest in bioinformatics; good programming skills in C, C++ or Java; basic mathematics and statistics

Project 4: Epidemiological modelling at the evolutionary time-scale
Around 75% of emergent human diseases have come from other species by zoonosis. Uncovering where, and how, and when such events occurred is a key issue in identifying risk of zoonotic events for the future. One way to do this is by considering the evolutionary trees of pathogens and their hosts. The idea is that by finding convincing or statistically significant congruence between two evolutionary trees we can hypothesize ancient associations, episodes of cospeciation (also called codivergence), host-switching (zoonosis) and extinction. However not all host species are known, and not all parasites that infect each host are known, so it becomes a serious problem to identify these ancient associations, given patchy information about the present.

This project will investigate quantitatively the effect of taxon sampling - that is, which species are missed out - on reconstruction of coevolutionary history, through simulation and analysis, using current cophylogenetic methods developed by Charleston and others.

Project 5: (Fast) Statistical Tests of Cophylogeny
How much codivergence is a lot? It's possible to determine the minimum number of codivergence (a.k.a. cospeciation) events that can be attributed to a host/parasite system, but the statistical testing of this number can be very laborious. Present methods create many random parasite trees and map them into the host tree, in order to determine whether the "true" parasite tree P is statistically significant in the degree of its agreement with the host tree H. Such analyses can be very slow, particularly when P and H don't agree that well in the first place.

The aim of this project will be to find useful "rules of thumb" for measuring whether two phylogenies (say hosts and pathogens) have some congruence that is due to a coevolutionary history, without such computational costs. It will require some thinking, software development and some simulation in silico, but no contact with actual pathogens!