Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Learning outcomes

When you have read Chapter 16, you should be able to:

  • Recount how taxonomy led to phylogeny and discuss the reasons why molecular markers are important in phylogenetics

  • Describe the key features of a phylogenetic tree and distinguish between inferred trees, true trees, gene trees and species trees

  • Explain how phylogenetic trees are reconstructed, including a description of DNA sequence alignment, the methods used to convert alignment data into a phylogenetic tree, and how the accuracy of a tree is assessed

  • Discuss, with examples, the applications and limitations of molecular clocks

  • Give examples of the use of phylogenetic trees in studies of human evolution and the evolution of the human and simian immunodeficiency viruses

  • Describe how molecular phylogenetics is being used to study the origins of modern humans, and the migrations of modern humans into Europe and the New World

If genomes evolve by the gradual accumulation of mutations, then the amount of difference in nucleotide sequence between a pair of genomes should indicate how recently those two genomes shared a common ancestor. Two genomes that diverged in the recent past would be expected to have fewer differences than a pair of genomes whose common ancestor is more ancient. This means that by comparing three or more genomes it should be possible to work out the evolutionary relationships between them. These are the objectives of molecular phylogenetics.

16.1. The Origins of Molecular Phylogenetics

Molecular phylogenetics predates DNA sequencing by several decades. It is derived from the traditional method for classifying organisms according to their similarities and differences, as first practiced in a comprehensive fashion by Linnaeus in the 18th century. Linnaeus was a systematicist not an evolutionist, his objective being to place all known organisms into a logical classification which he believed would reveal the great plan used by the Creator - the Systema Naturae. However, he unwittingly laid the framework for later evolutionary schemes by dividing organisms into a hierarchic series of taxonomic categories, starting with kingdom and progressing down through phylum, class, order, family and genus to species. The naturalists of the 18th and early 19th centuries likened this hierarchy to a ‘tree of life’ (Figure 16.1), an analogy that was adopted by Darwin (1859) in The Origin of Species as a means of describing the interconnected evolutionary histories of living organisms. The classificatory scheme devised by Linnaeus therefore became reinterpreted as a phylogeny indicating not just the similarities between species but also their evolutionary relationships.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.1

The tree of life. An ancestral species is at the bottom of the ‘trunk’ of the tree. As time passes, new species evolve from earlier ones so the tree repeatedly branches until we reach the present time, when there are many species descended (more...)

Whether the objective is to construct a classification or to infer a phylogeny, the relevant data are obtained by examining variable characters in the organisms being compared. Originally, these characters were morphological features, but molecular data were introduced at a surprisingly early stage. In 1904 Nuttall used immunological tests to deduce relationships between a variety of animals, one of his objectives being to place humans in their correct evolutionary position relative to other primates, an issue that we will return to in Section 16.3.1. Nuttall's work showed that molecular data can be used in phylogenetics, but the approach was not widely adopted until the late 1950s, the delay being due largely to technical limitations, but also partly because classification and phylogenetics had to undergo their own evolutionary changes before the value of molecular data could be fully appreciated. These changes came about with the introduction of phenetics and cladistics (Box 16.1), two novel phylogenetic methods which, although quite different in their approach, both place emphasis on large datasets that can be analyzed by rigorous mathematical procedures. The difficulty in obtaining large mathematical datasets when morphological characters are used was one of the main driving forces behind the gradual shift towards molecular data, which have three advantages compared with other types of phylogenetic information:

  • When molecular data are used, a single experiment can provide information on many different characters: in a DNA sequence, for example, every nucleotide position is a character with four character states, A, C, G and T. Large molecular datasets can therefore be generated relatively quickly.

  • Molecular character states are unambiguous: A, C, G and T are easily recognizable and one cannot be confused with another. Some morphological characters, such as those based on the shape of a structure, can be less easy to distinguish because of overlaps between different character states.

  • Molecular data are easily converted to numerical form and hence are amenable to mathematical and statistical analysis.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Box 16.1

Phenetics and cladistics. Phenetics, when first introduced (Michener and Sokal, 1957), challenged the prevailing view that classifications should be based on comparisons between a limited number of characters that taxonomists believed to be important (more...)

The sequences of protein and DNA molecules provide the most detailed and unambiguous data for molecular phylogenetics, but techniques for protein sequencing did not become routine until the late 1960s, and rapid DNA sequencing was not developed until 10 years after that. Early studies therefore depended largely on indirect assessments of DNA or protein variations, using one of three methods:

  • Immunological data, such as those obtained by Nuttall (1904), involve measurements of the amount of cross-reactivity seen when an antibody specific for a protein from one organism is mixed with the same protein from a different organism. Remember that in Section 12.2.1 we learned that antibodies are immunoglobulin proteins that help to protect the body against invasion by bacteria, viruses and other unwanted substances by binding to these ‘antigens’. Proteins also act as antigens, so if human β-globin, for example, is injected into a rabbit then the rabbit makes an antibody that binds specifically to that protein. The antibody will also cross-react with β-globins from other vertebrates, because these β-globins have similar structures to the human version. The degree of cross-reactivity depends on how similar the β-globin being tested is to the human protein, providing the similarity data used in the phylogenetic analysis.

  • Protein electrophoresis is used to compare the electrophoretic properties, and hence degree of similarity, of proteins from different organisms. This technique has proved useful for comparing closely related species and variations between members of a single species.

  • DNA-DNA hybridization data are obtained by hybridizing DNA samples from the two organisms being compared. The DNA samples are denatured and mixed together so that hybrid molecules form. The stability of these hybrid molecules depends on the degree of similarity between the nucleotide sequences of the two DNAs, and is measured by determining the melting temperature (see Figure 5.8), a stable hybrid having a higher melting temperature than a less stable one. The melting temperatures obtained with DNAs from different pairs of organisms provide the data used in the phylogenetic analysis.

By the end of the 1960s these indirect methods had been supplemented with an increasing number of protein sequence studies (e.g. Fitch and Margoliash, 1967) and during the 1980s DNA-based phylogenetics began to be carried out on a large scale. Protein sequences are still used today in some contexts, but DNA has now become by far the predominant molecule. This is mainly because DNA yields more phylogenetic information than protein, the nucleotide sequences of a pair of homologous genes having a higher information content than the amino acid sequences of the corresponding proteins, because mutations that result in non-synonymous changes alter the DNA sequence but do not affect the amino acid sequence (Figure 16.2). Entirely novel information can also be obtained by DNA sequence analysis because variability in both the coding and non-coding regions of the genome can be examined. The ease with which DNA samples for sequence analysis can be prepared by PCR (Section 4.3) is another key reason behind the predominance of DNA in modern molecular phylogenetics.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.2

DNA yields more phylogenetic information than protein. The two DNA sequences differ at three positions but the amino acid sequences differ at only one position. These positions are indicated by green dots. Two of the nucleotide substitutions are therefore (more...)

As well as DNA sequences, molecular phylogenetics also makes use of DNA markers such as RFLPs, SSLPs and SNPs (Section 5.2.2), particularly for intraspecific studies such as those aimed at understanding migrations of prehistoric human populations (Section 16.3.2). Later in this chapter we will consider various examples of the use of both DNA sequences and DNA markers in molecular phylogenetics, but first we must make a more detailed study of the methodology used in this area of genome research.

16.2. The Reconstruction of DNA-based Phylogenetic Trees

The objective of most phylogenetic studies is to reconstruct the tree-like pattern that describes the evolutionary relationships between the organisms being studied. Before examining the methodology for doing this we must first take a closer look at a typical tree in order to familiarize ourselves with the basic terminology used in phylogenetic analysis.

16.2.1. The key features of DNA-based phylogenetic trees

A typical phylogenetic tree is shown in Figure 16.3A. This tree could have been reconstructed from any type of comparative data, but as we are interested in DNA sequences we will assume that the tree shows the relationships between four homologous genes, called A, B, C and D. The topology of this tree comprises four external nodes, each representing one of the four genes that we have compared, and two internal nodes representing ancestral genes. The lengths of the branches indicate the degree of difference between the genes represented by the nodes. The degree of difference is calculated when the sequences are compared, as described in Section 16.2.2.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.3

Phylogenetic trees. (A) An unrooted tree with four external nodes. (B) The five rooted trees that can be drawn from the unrooted tree shown in part A. The positions of the roots are indicated by the numbers on the outline of the unrooted tree.

The tree in Figure 16.3A is unrooted, which means that it is only an illustration of the relationships between A, B, C and D and does not tell us anything about the series of evolutionary events that led to these genes. Five different evolutionary pathways are possible, each depicted by a different rooted tree, as shown in Figure 16.3B. To distinguish between them the phylogenetic analysis must include at least one outgroup, this being a homologous gene that we know is less closely related to A, B, C and D than these four genes are to each other. The outgroup enables the root of the tree to be located and the correct evolutionary pathway to be identified. The criteria used when choosing an outgroup depend very much on the type of analysis that is being carried out. As an example, let us say that the four homologous genes in our tree come from human, chimpanzee, gorilla and orangutan. We could then use as an outgroup the homologous gene from another primate, such as the baboon, which we know from paleontological evidence branched away from the lineage leading to human, chimpanzee, gorilla and orangutan before the time of the common ancestor of those four species (Figure 16.4).

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.4

The use of an outgroup to root a phylogenetic tree. The tree of human, chimpanzee, gorilla and orangutan genes is rooted with a baboon gene because we know from the fossil record that baboons split away from the primate lineage before the time of the (more...)

We refer to the rooted tree that we obtain by phylogenetic analysis as an inferred tree. This is to emphasize that it depicts the series of evolutionary events that are inferred from the data that were analyzed, and may not be the same as the true tree, the one that depicts the actual series of events that occurred. Sometimes we can be fairly confident that the inferred tree is the true tree, but most phylogenetic data analyses are prone to uncertainties which are likely to result in the inferred tree differing in some respects from the true tree. In Section 16.2.2 we will look at the various methods used to assign degrees of confidence to the branching pattern in an inferred tree, and later in the chapter we will examine some of the controversies that have arisen as a result of the imprecise nature of phylogenetic analysis.

Gene trees are not the same as species trees

The tree shown in Figure 16.4 illustrates a common type of molecular phylogenetics project, where the objective is to use a gene tree, reconstructed from comparisons between the sequences of orthologous genes (those derived from the same ancestral sequence; see page 196), to make inferences about the evolutionary history of the species from which the genes are obtained. The assumption is that the gene tree, based on molecular data with all its advantages, will be a more accurate and less ambiguous representation of the species tree than that obtainable by morphological comparisons. This assumption is often correct, but it does not mean that the gene tree is the same as the species tree. For that to be the case, the internal nodes in the gene and species trees would have to be precisely equivalent. However, they are not equivalent, because:

  • An internal node in a gene tree represents the divergence of an ancestral gene into two genes with different DNA sequences: this occurs by mutation (Figure 16.5A).

  • An internal node in a species tree represents a speciation event (Figure 16.5B): this occurs by the population of the ancestral species splitting into two groups that are unable to interbreed, for example, because they are geographically isolated.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.5

The difference between a gene tree and a species tree.

The important point is that these two events - mutation and speciation - are not expected to occur at the same time. For example, the mutation event could precede the speciation. This would mean that, to begin with, both alleles of the gene are present in the unsplit population of the ancestral species (Figure 16.6). When the population split occurs, it is likely that both alleles will still be present in each of the two resulting groups. After the split, the new populations evolve independently. One possibility is that the results of random genetic drift (see Box 16.3) lead to one allele being lost from one population and the other being lost from the other population. This establishes the two separate genetic lineages that we infer from phylogenetic analysis of the gene sequences present in the modern species resulting from the continued evolution of the two populations.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.6

Mutation might precede speciation, giving an incorrect time for a speciation event if a molecular clock is used. See the text for details. Based on Li (1997).

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Box 16.3

Genes in populations. New alleles and haplotypes appear in a population because of mutations that occur in the reproductive cells of individual organisms. This means that many genes are polymorphic, two or more alleles being present in the population (more...)

How do these considerations affect the equivalence of the gene and species trees? There are various implications, two of which are as follows:

  • If a molecular clock (Section 16.2.2) is used to date the time at which the gene divergence took place, then it cannot be assumed that this is also the time of the speciation event. If the node being dated is ancient, say 50 million or more years ago, then the error may not be noticeable. But if the speciation event is recent, as when primates are being compared, then the date for the gene divergence might be significantly different to that for the speciation event.

  • If the first speciation event is quickly followed by a second speciation event in one of the two resulting populations, then the branching order of the gene tree might be different from that of the species tree. This can occur if the genes in the modern species are derived from alleles that had already appeared before the first of the two speciation events, as illustrated in Figure 16.7.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.7

A gene tree can have a different branching order from a species tree. In this example, the gene has undergone two mutations in the ancestral species, the first mutation giving rise to the ‘blue’ allele and the second to the ‘green’ (more...)

16.2.2. Tree reconstruction

In this section we will look at how tree reconstruction is carried out with DNA sequences, concentrating on the four steps in the procedure:

  • Aligning the DNA sequences and obtaining the comparative data that will be used to reconstruct the tree;

  • Converting the comparative data into a reconstructed tree;

  • Assessing the accuracy of the reconstructed tree;

  • Using a molecular clock to assign dates to branch points within the tree.

Sequence alignment is the essential preliminary to tree reconstruction

The data used in reconstruction of a DNA-based phylogenetic tree are obtained by comparing nucleotide sequences. These comparisons are made by aligning the sequences so that nucleotide differences can be scored. This is the critical part of the entire enterprise because if the alignment is incorrect then the resulting tree will definitely not be the true tree.

The first issue to consider is whether the sequences being aligned are homologous. If they are homologous then they must, by definition, be derived from a common ancestral sequence (Section 7.2.1) and so there is a sound basis for the phylogenetic study. If they are not homologous then they do not share a common ancestor. The phylogenetic analysis will find a common ancestor because the methods used for tree reconstruction always produce a tree of some description, even if the data are completely erroneous, but the resulting tree will have no biological relevance. With some DNA sequences - for example, the β-globin genes of different vertebrates - there is no difficulty in being sure that the sequences being compared are homologous, but this is not always the case, and one of the commonest errors that arises during phylogenetic analysis is the inadvertent inclusion of a non-homologous sequence.

Once it has been established that two DNA sequences are indeed homologous, the next step is to align the sequences so that homologous nucleotides can be compared. With some pairs of sequences this is a trivial exercise (Figure 16.8A), but it is not so easy if the sequences are relatively dissimilar and/or have diverged by the accumulation of insertions and deletions as well as point mutations. Insertions and deletions cannot be distinguished when pairs of sequences are compared so we refer to them as indels. Placing indels at their correct positions is often the most difficult part of sequence alignment (Figure 16.8B).

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.8

Sequence alignment. (A) Two sequences that have not diverged to any great extent can be aligned easily by eye. (B) A more complicated alignment in which it is not possible to determine the correct position for an indel. If errors in indel placement are (more...)

Some pairs of sequences can be aligned reliably by eye. For more complex pairs, alignment might be possible by the dot matrix method (Figure 16.9). The two sequences are written out on the x- and y-axes of a graph, and dots placed in the squares of the graph paper at positions corresponding to identical nucleotides in the two sequences. The alignment is indicated by a diagonal series of dots, broken by empty squares where the sequences have nucleotide differences, and shifting from one column to another at places where indels occur.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.9

The dot matrix technique for sequence alignment. The correct alignment stands out because it forms a diagonal of continuous dots, broken at point mutations and shifting to a different diagonal at indels.

More rigorous mathematical approaches to sequence alignment have also been devised. The first of these is the similarity approach (Needleman and Wunsch, 1970), which aims to maximize the number of matched nucleotides - those that are identical in the two sequences. The complementary approach is the distance method (Waterman et al., 1976), in which the objective is to minimize the number of mismatches. Often the two procedures will identify the same alignment as being the best one.

Usually the comparison involves more than just two sequences, meaning that a multiple alignment is required. This can rarely be done effectively with pen and paper so, as in all steps in a phylogenetic analysis, a computer program is used. For multiple alignments, Clustal is often the most popular choice (Jeanmougin et al., 1998). Clustal and other software packages for phylogenetic analysis are described in Technical Note 16.1.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Box 16.1

Phylogenetic analysis. Software packages for construction of phylogenetic trees. Few sets of DNA sequences are simple enough to be converted into phylogenetic trees entirely by hand. Virtually all research in this area is carried out by computer with (more...)

Converting alignment data into a phylogenetic tree

Once the sequences have been aligned accurately, an attempt can be made to reconstruct the phylogenetic tree. To date nobody has devised a perfect method for tree reconstruction, and several different procedures are used routinely. Comparative tests have been run with artificial data, for which the true tree is known, but these have failed to identify any particular method as being better than any of the others (Felsenstein, 1988).

The main distinction between the different tree-building methods is the way in which the multiple sequence alignment is converted into numerical data that can be analyzed mathematically in order to reconstruct the tree. The simplest approach is to convert the sequence information into a distance matrix, which is simply a table showing the evolutionary distances between all pairs of sequences in the dataset (Figure 16.10). The evolutionary distance is calculated from the number of nucleotide differences between a pair of sequences and is used to establish the lengths of the branches connecting these two sequences in the reconstructed tree.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.10

A simple distance matrix. The matrix shows the evolutionary distance between each pair of sequences in the alignment. In this example the evolutionary distance is expressed as the number of nucleotide differences per nucleotide site for each sequence (more...)

The neighbor-joining method (Saitou and Nei, 1987) is a popular tree-building procedure that uses the distance matrix approach. To begin the reconstruction, it is initially assumed that there is just one internal node from which branches leading to all the DNA sequences radiate in a star-like pattern (Figure 16.11A). This is virtually impossible in evolutionary terms but the pattern is just a starting point. Next, a pair of sequences is chosen at random, removed from the star, and attached to a second internal node, connected by a branch to the center of the star, as shown in Figure 16.11B. The distance matrix is then used to calculate the total branch length in this new ‘tree’. The sequences are then returned to their original positions and another pair attached to the second internal node, and again the total branch length is calculated. This operation is repeated until all the possible pairs have been examined, enabling the combination that gives the tree with the shortest total branch length to be identified. This pair of sequences will be neighbors in the final tree; in the interim, they are combined into a single unit, creating a new star with one branch fewer than the original one. The whole process of pair selection and tree-length calculation is now repeated so that a second pair of neighboring sequences is identified, and then repeated again so that a third pair is located, and so on. The result is a complete reconstructed tree.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.11

Manipulations carried out when using the neighbor-joining method for tree reconstruction. See the text for details.

The advantage of the neighbor-joining method is that the data handling is relatively easy to carry out, largely because the information content of the multiple alignment has been reduced to its simplest form. The disadvantage is that some of the information is lost, in particular that pertaining to the identities of the ancestral and derived nucleotides (equivalent to ancestral and derived character states, defined in Box 16.1) at each position in the multiple alignment. The maximum parsimony method (Fitch, 1977) takes account of this information, utilizing it to recreate the series of nucleotide changes that resulted in the pattern of variation revealed by the multiple alignment. The assumption, possibly erroneous, is that evolution follows the shortest possible route and that the correct phylogenetic tree is therefore the one that requires the minimum number of nucleotide changes to produce the observed differences between the sequences. Trees are therefore constructed at random and the number of nucleotide changes that they involve calculated until all possible topologies have been examined and the one requiring the smallest number of steps identified. This is presented as the most likely inferred tree.

The maximum parsimony method is more rigorous in its approach compared with the neighbor-joining method, but this increase in rigor inevitably extends the amount of data handling that is involved. This is a significant problem because the number of possible trees that must be scrutinized increases rapidly as more sequences are added to the dataset. With just five sequences there are only 15 possible unrooted trees, but for ten sequences there are 2 027 025 unrooted trees and for 50 sequences the number exceeds the number of atoms in the universe (Eernisse, 1998). Even with a high-speed computer it is not possible to check every one of these trees in a reasonable time, if at all, so often the maximum parsimony method is unable to carry out a comprehensive analysis. The same is true with many of the other more sophisticated methods for tree reconstruction.

Assessing the accuracy of a reconstructed tree

The limitations to the methods used in phylogenetic reconstruction lead inevitably to questions about the veracity of the resulting trees. Statistical tests of the accuracy of a reconstructed tree have been devised (Hillis, 1997; Whelan et al., 2001) but these are necessarily complex because a tree is geometric rather than numeric, and the accuracy of one part of the topology may be greater or lesser than the accuracy of the other parts.

The routine method for assigning confidence limits to different branch points within a tree is to carry out a bootstrap analysis. To do this we need a second multiple alignment that is different from, but equivalent to, the real alignment. This new alignment is built up by taking columns, at random, from the real alignment, as illustrated in Figure 16.12. The new alignment therefore comprises sequences that are different from the original, but it has a similar pattern of variability. This means that when we use the new alignment in tree reconstruction we do not simply reproduce the original analysis, but we should obtain the same tree.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.12

Constructing a new multiple alignment in order to bootstrap a phylogenetic tree. The new alignment is built up by taking columns at random from the real alignment. Note that the same column can be sampled more than once.

In practice, 1000 new alignments are created so 1000 replicate trees are reconstructed. A bootstrap value can then be assigned to each internal node in the original tree, this value being the number of times that the branch pattern seen at that node was reproduced in the replicate trees. If the bootstrap value is greater than 700/1000 then we can assign a reasonable degree of confidence to the topology at that particular internal node.

Molecular clocks enable the time of divergence of ancestral sequences to be estimated

When we carry out a phylogenetic analysis our primary objective is to infer the pattern of the evolutionary relationships between the DNA sequences that are being compared. These relationships are revealed by the topology of the tree that is reconstructed. Often we also have a secondary objective: to discover when the ancestral sequences diverged to give the modern sequences. This information is interesting in the context of genome evolution, as we discovered when we looked at the evolutionary history of the human globin genes (see Figure 15.9). The information is even more interesting on occasions when we are able to equate a gene tree with a species tree, because now the times at which the ancestral sequences diverged approximate to the dates of speciation events.

To assign dates to branch points in a phylogenetic tree we must make use of a molecular clock. The molecular clock hypothesis, first proposed in the early 1960s, states that nucleotide substitutions (or amino acid substitutions if protein sequences are being compared) occur at a constant rate. This means that the degree of difference between two sequences can be used to assign a date to the time at which their ancestral sequence diverged. However, to be able to do this the molecular clock must be calibrated so that we know how many nucleotide substitutions to expect per million years. Calibration is usually achieved by reference to the fossil record. For example, fossils suggest that the most recent common ancestor of humans and orangutans lived 13 million years ago. To calibrate the human molecular clock we therefore compare human and orangutan DNA sequences to determine the amount of nucleotide substitution that has occurred, and then divide this figure by 13, followed by 2, to obtain a rate of substitution per million years (Figure 16.13).

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.13

Calculating a human molecular clock. The number of substitutions is determined for a pair of homologous genes from human and orangutan: call this number ‘x’. The number of substitutions per lineage is therefore x/2, and the number per (more...)

At one time it was thought that there might be a universal molecular clock that applied to all genes in all organisms (Ochman and Wilson, 1987). Now we realize that molecular clocks are different in different organisms and are variable even within a single organism (Strauss, 1999). The differences between organisms might be the result of generation times, because a species with a short generation time is likely to accumulate DNA replication errors at a faster rate than a species with a longer generation time. This probably explains the observation that rodents have a faster molecular clock than primates (Gu and Li, 1992). Within an organism the variations are as follows:

  • Non-synonymous substitutions occur at a slower rate than synonymous ones. This is because a mutation that results in a change in the amino acid sequence of a protein might be deleterious to the organism, so the accumulation of non-synonymous mutations in the population is reduced by the processes of natural selection (see Box 16.3). This means that when gene sequences in two species are compared, there are usually fewer non-synonymous than synonymous substitutions.

  • The molecular clock for mitochondrial genes is faster than that for genes in the nuclear genome. This is probably because mitochondria lack many of the DNA repair systems that operate on nuclear genes (Section 14.2; Gibbons, 1998).

Despite these complications, molecular clocks have become an immensely valuable adjunct to tree reconstruction, as we will see in the next section when we look at some typical molecular phylogenetics projects.

16.3. The Applications of Molecular Phylogenetics

Molecular phylogenetics has grown in stature since the start of the 1990s, largely because of the development of more rigorous methods for tree building, combined with the explosion of DNA sequence information obtained initially by PCR analysis and more recently by genome projects. The importance of molecular phylogenetics has also been enhanced by the successful application of tree reconstruction and other phylogenetic techniques to some of the more perplexing issues in biology. In this final section we will survey some of these successes.

16.3.1. Examples of the use of phylogenetic trees

First, we will consider two projects that illustrate the various ways in which conventional tree reconstruction is being used in modern molecular biology.

DNA phylogenetics has clarified the evolutionary relationships between humans and other primates

Darwin (1871) was the first biologist to speculate on the evolutionary relationships between humans and other primates. His view - that humans are closely related to the chimpanzee, gorilla and orangutan - was controversial when it was first proposed and fell out of favor, even among evolutionists, in the following decades. Indeed, biologists were among the most ardent advocates of an anthropocentric view of our place in the animal world (Goodman, 1962).

From studies of fossils, paleontologists had concluded prior to 1960 that chimpanzees and gorillas are our closest relatives but that the relationship was distant, the split, leading to humans on the one hand and chimpanzees and gorillas on the other, having occurred some 15 million years ago. The first detailed molecular data, obtained by immunological studies in the 1960s (Goodman, 1962; Sarich and Wilson, 1967) confirmed that humans, chimpanzees and gorillas do indeed form a single clade (see Box 16.2) but suggested that the relationship is much closer, a molecular clock indicating that this split occurred only 5 million years ago. This was one of the first attempts to apply a molecular clock to phylogenetic data and the result was, quite naturally, treated with some suspicion. In fact, an acrimonious debate opened up between paleontologists, who believed in the ancient split indicated by the fossil evidence, and biologists, who had more confidence in the recent date suggested by the molecular data. This debate was eventually ‘won’ by the molecular biologists, whose view that the split occurred about 5 million years ago became generally accepted.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Box 16.2

Terminology for molecular phylogenetics. The text includes definitions of most of the important terms used in molecular phylogenetics. Here are a few additional definitions that you may find useful when reading research articles on this subject: Operational (more...)

As more and more molecular data were obtained, the difficulties in establishing the exact pattern of the evolutionary events that led to humans, chimpanzees and gorillas became apparent. Comparisons of the mitochondrial genomes of the three species by restriction mapping (Section 5.3.1) and DNA sequencing suggested that the chimpanzee and gorilla are more closely related to each other than either is to humans (Figure 16.14A), whereas DNA-DNA hybridization data supported a closer relationship between humans and chimpanzees (Figure 16.14B). The reason for these conflicting results is the close similarity between DNA sequences in the three species, the differences being less than 3% for even the most divergent regions of the genomes (Section 15.4). This makes it difficult to establish relationships unambiguously.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.14

Different interpretations of the evolutionary relationships between humans, chimpanzees and gorillas. See the text for details. Abbreviation: Myr, million years.

The solution to the problem has been to make comparisons between as many different genes as possible and to target those loci that are expected to show the greatest amount of dissimilarity. By 1997, 14 different molecular datasets had been obtained, including sequences of variable loci such as pseudogenes and non-coding sequences (Ruvolo, 1997). Analysis of these datasets confirmed that the chimpanzee is the closest relative to humans, with our lineages diverging 4.6–5.0 million years ago. The gorilla is a slightly more distant cousin, its lineage having diverged from the human-chimp one between 0.3 and 2.8 million years earlier (Figure 16.14C).

The origins of AIDS

The global epidemic of acquired immune deficiency syndrome (AIDS) has touched everyone's lives. AIDS is caused by human immunodeficiency virus 1 (HIV-1), a retrovirus (Section 2.4.2) that infects cells involved in the immune response. The demonstration in the early 1980s that HIV-1 is responsible for AIDS was quickly followed by speculation about the origin of the disease. Speculation centered around the discovery that similar immunodeficiency viruses are present in primates such as the chimpanzee, sooty mangabey, mandrill and various monkeys. These simian immunodeficiency viruses (SIVs) are not pathogenic in their normal hosts but it was thought that if one had become transferred to humans then within this new species the virus might have acquired new properties, such as the ability to cause disease and to spread rapidly through the population.

Retrovirus genomes accumulate mutations relatively quickly because reverse transcriptase, the enzyme that copies the RNA genome contained in the virus particle into the DNA version that integrates into the host genome (see Section 2.4.2), lacks an efficient proofreading activity (Section 13.2.2) and so tends to make errors when it carries out RNA-dependent DNA synthesis. This means that the molecular clock runs rapidly in retroviruses, and genomes that diverged quite recently display sufficient nucleotide dissimilarity for a phylogenetic analysis to be carried out. Even though the evolutionary period we are interested in is less than 100 years, HIV and SIV genomes contain sufficient data for their relationships to be inferred by phylogenetic analysis.

The starting point for this phylogenetic analysis is RNA extracted from virus particles. RT-PCR (see Technical Note 4.4) is therefore used to convert the RNA into a DNA copy and then to amplify the DNA so that sufficient amounts for nucleotide sequencing are obtained. Comparison between virus DNA sequences has resulted in the reconstructed tree shown in Figure 16.15 (Leitner et al., 1996; Wain-Hobson, 1998). This tree has a number of interesting features. First it shows that different samples of HIV-1 have slightly different sequences, the samples as a whole forming a tight cluster, almost a star-like pattern, that radiates from one end of the unrooted tree. This star-like topology implies that the global AIDS epidemic began with a very small number of viruses, perhaps just one, which have spread and diversified since entering the human population. The closest relative to HIV-1 among primates is the SIV of chimpanzees, the implication being that this virus jumped across the species barrier between chimps and humans and initiated the AIDS epidemic. However, this epidemic did not begin immediately: a relatively long uninterrupted branch links the center of the HIV-1 radiation with the internal node leading to the relevant SIV sequence, suggesting that after transmission to humans, HIV-1 underwent a latent period when it remained restricted to a small part of the global human population, presumably in Africa, before beginning its rapid spread to other parts of the world. Other primate SIVs are less closely related to HIV-1, but one, the SIV from sooty mangabey, clusters in the tree with the second human immunodeficiency virus, HIV-2. It appears that HIV-2 was transferred to the human population independently of HIV-1, and from a different simian host. HIV-2 is also able to cause AIDS, but has not, as yet, become globally epidemic.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.15

The phylogenetic tree reconstructed from HIV and SIV genome sequences. The AIDS epidemic is due to the HIV-1M type of immunodeficiency virus. ZR59 is positioned near the root of the star-like pattern formed by genomes of this type. Based on Wain-Hobson (more...)

An intriguing addition to the HIV/SIV tree was made in 1998 when the sequence of an HIV-1 isolate from a blood sample taken in 1959 from an African male was sequenced (Zhu et al., 1998). The RNA was highly fragmented and only a short DNA sequence could be obtained, but this was sufficient for the sequence to be placed on the phylogenetic tree (see Figure 16.15). This sequence, called ZR59, attaches to the tree by a short branch that emerges from near the center of the HIV-1 radiation. The positioning indicates that the ZR59 sequence represents one of the earliest versions of HIV-1 and shows that the global spread of HIV-1 was already underway by 1959. A later and more comprehensive analysis of HIV-1 sequences has suggested that the spread began in the period between 1915 and 1941, with a best estimate of 1931 (Korber et al., 2000). Pinning down the date in this way has enabled epidemiologists to begin an investigation of the historic and social conditions that might have been responsible for the start of the AIDS epidemic.

16.3.2. Molecular phylogenetics as a tool in the study of human prehistory

Now we will turn our attention to the use of molecular phylogenetics in intraspecific studies: the study of the evolutionary history of members of the same species. We could choose any one of several different organisms to illustrate the approaches and applications of intraspecific studies, but many people look on Homo sapiens as the most interesting organism so we will investigate how molecular phylogenetics is being used to deduce the origins of modern humans and the geographic patterns of their recent migrations in the Old and New Worlds.

Intraspecific studies require highly variable genetic loci

In any application of molecular phylogenetics, the genes chosen for analysis must display variability in the organisms being studied. If there is no variability then there is no phylogenetic information. This presents a problem in intraspecific studies because the organisms being compared are all members of the same species and so share a great deal of genetic similarity, even if the species has split into populations that interbreed only intermittently. This means that the DNA sequences that are used in the phylogenetic analysis must be the most variable ones that are available. In humans there are three main possibilities.

  • Multiallelic genes, such as members of the HLA family (Section 5.2.1), which exist in many different sequence forms;

  • Microsatellites, which evolve not through mutation but by replication slippage (Section 14.1.1). Cells do not appear to have any repair mechanism for reversing the effects of replication slippage, so new microsatellite alleles are generated relatively frequently.

  • Mitochondrial DNA which, as mentioned in Section 16.2.2, accumulates nucleotide substitutions relatively rapidly because mitochondria lack many of the repair systems that slow down the molecular clock in the human nucleus. The mitochondrial DNA variants present in a single species are called haplotypes.

It is important to note that it is not the potential for change that is critical to the application of these loci in phylogenetic analysis, it is the fact that different alleles or haplotypes of the locus coexist in the population as a whole. The loci are therefore polymorphic (see Box 16.3) and information pertaining to the relationships between different individuals can be obtained by comparing the combinations of alleles and/or haplotypes that those individuals possess.

The origins of modern humans - out of Africa or not?

It seems reasonably certain that the origin of humans lies in Africa because it is here that all of the oldest pre-human fossils have been found. The paleontological evidence reveals that hominids first moved outside of Africa over 1 million years ago, but these were not modern humans, they were an earlier species called Homo erectus. These were the first hominids to become geographically dispersed, eventually spreading to all parts of the Old World.

The events that followed the dispersal of Homo erectus are controversial. From comparisons using fossil skulls and bones, paleontologists have concluded that the Homo erectus populations that became located in different parts of the Old World gave rise to the modern human populations of those areas by a process called multiregional evolution (Figure 16.16A). There may have been a certain amount of interbreeding between humans from different geographic regions, but, to a large extent, these various populations remained separate throughout their evolutionary history.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.16

Two competing hypotheses for the origins of modern humans. (A) The multiregional hypothesis states that Homo erectus left Africa over 1 million years ago and then evolved into modern humans in different parts of the Old World. (B) The Out of Africa hypothesis (more...)

Doubts about the multiregional hypothesis were first raised by re-interpretations of the fossil evidence and were subsequently brought to a head by publication in 1987 of a phylogenetic tree reconstructed from mitochondrial RFLP data obtained from 147 humans representing populations from all parts of the World (Cann et al., 1987). The tree (Figure 16.17) confirmed that the ancestors of modern humans lived in Africa but suggested that they were still there about 200 000 years ago. This inference was made by applying the mitochondrial molecular clock to the tree, which showed that the ancestral mitochondrial DNA, the one from which all modern mitochondrial DNAs are descended, existed between 140 000 and 290 000 years ago. The tree showed that this mitochondrial genome was located in Africa, so the person who possessed it, the so-called mitochondrial Eve (she had to be female because mitochondrial DNA is only inherited through the female line), must have been African.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.17

Phylogenetic tree reconstructed from mitochondrial RFLP data obtained from 147 modern humans. The ancestral mitochondrial DNA is inferred to have existed in Africa because of the split in the tree between the seven modern African mitochondrial genomes (more...)

The discovery of mitochondrial Eve prompted a new scenario for the origins of modern humans. Rather than evolving in parallel throughout the world, as suggested by the multiregional hypothesis, Out of Africa states that Homo sapiens originated in Africa, members of this species then moving into the rest of the Old World between 100 000 and 50 000 years ago, displacing the descendents of Homo erectus that they encountered (see Figure 16.16B).

Such a radical change in thinking inevitably did not go unchallenged. When the RFLP data obtained by Cann et al. (1987) were examined by other molecular phylogeneticists it became clear that the original computer analysis had been flawed, and that several quite different trees could be reconstructed from the data, some of which did not have a root in Africa. These criticisms were countered by more detailed mitochondrial DNA sequence datasets, most of which are compatible with a relatively recent African origin and so support the Out of Africa hypothesis rather than multiregional evolution (e.g. Ingman et al., 2000). An interesting complement to ‘mitochondrial Eve’ has been provided by studies of the Y chromosome, which suggest that ‘Y chromosome Adam’ also lived in Africa some 200 000 years ago (Pääbo, 1999). Of course, this Eve and Adam were not equivalent to the biblical characters and were by no means the only people alive at that time: they were simply the individuals who carried the ancestral mitochondrial DNA and Y chromosomes that gave rise to all the mitochondrial DNAs and Y chromosomes in existence today. The important point is that these ancestral DNAs were still in Africa well after the spread of Homo erectus into Eurasia.

The mitochondrial DNA and Y chromosome studies appear to provide strong evidence in support of the Out of Africa theory. But complications have arisen from studies of nuclear genes other than those on the Y chromosome. For example, β-globin sequences give a much earlier date, 800 000 years ago, for the common ancestor (Harding et al., 1997), and studies of an X chromosome gene, PDHA1, place the ancestral sequence at 1 900 000 years ago (Harris and Hey, 1999). Molecular anthropologists are currently debating the significance of these results (Pääbo, 1999). More datasets, and hopefully some sort of Grand Synthesis, are eagerly awaited.

The patterns of more recent migrations into Europe are also controversial

By whatever evolutionary pathway, modern humans were present throughout most of Europe by 40 000 years ago. This is clear from the fossil and archaeological records. The next controversial issue in human prehistory concerns whether these populations were displaced about 30 000 years later by other humans migrating into Europe from the Middle East.

The question centers on the process by which agriculture spread into Europe. The transition from hunting and gathering to farming occurred in the Middle East some 9000–10 000 years ago, when early Neolithic villagers began to cultivate crops such as wheat and barley. After becoming established in the Middle East, farming spread into Asia, Europe and North Africa. By searching for evidence of agriculture at archaeological sites, for example by looking for the remains of cultivated plants or for implements used in farming, it has been possible to trace the expansion of farming along two routes through Europe, one around the coast to Italy and Spain and the second through the Danube and Rhine valleys to northern Europe (Figure 16.18).

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.18

The spread of agriculture from the Middle East to Europe. The dark-green area is the ‘Fertile Crescent’, the area of the Middle East where many of today's crops - wheat, barley, etc. - grow wild and where these plants are thought to have (more...)

How did farming spread? The simplest explanation is that farmers migrated from one part of Europe to another, taking with them their implements, animals and crops, and displacing the indigenous pre-agricultural communities that were present in Europe at that time. This wave of advance model was initially favored by geneticists because of the results of a large-scale phylogenetic analysis of the allele frequencies for 95 nuclear genes in populations from across Europe (Cavalli-Sforza, 1998). Such a large and complex dataset cannot be analyzed in any meaningful way by conventional tree building but instead has to be examined by more advanced statistical methods, ones based more in population biology than phylogenetics. One such procedure is principal component analysis, which attempts to identify patterns in a dataset corresponding to the uneven geographic distribution of alleles, these uneven distributions possibly being indicative of past population migrations. The most striking pattern within the European dataset, accounting for about 28% of the total genetic variation, is a gradation of allele frequencies across Europe (Figure 16.19). This pattern implies that a migration of people occurred either from the Middle East to northeast Europe, or in the opposite direction. Because the former coincides with the expansion of farming, as revealed by the archaeological record, this first principal component was looked upon as providing strong support for the wave of advance model.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.19

A genetic gradation across modern Europe. See the text for details.

The analysis looked convincing but two criticisms were raised. The first was that the data provided no indication of when the inferred migration took place, so the link between the first principal component and the spread of agriculture was based solely on the pattern of the allele gradation, not on any complementary evidence relating to the period when this gradation was set up. The second criticism arose because of the results of a second study of European human populations, one that did include a time dimension (Richards et al., 1996). This study looked at mitochondrial DNA haplotypes in 821 individuals from various populations across Europe. It failed to confirm the gradation of allele frequencies detected in the nuclear DNA dataset, and instead suggested that European populations have remained relatively static over the last 20 000 years. A refinement of this work led to the discovery that eleven mitochondrial DNA haplotypes predominate in the modern European population, each with a different time of origin, thought to indicate the date at which the haplotype entered Europe (Figure 16.20; Richards et al., 2000). The most ancient haplotype, called U, first appeared in Europe approximately 50 000 years ago, coinciding with the period when, according to the archaeological record, the first modern humans moved into the continent as the ice sheets withdrew to the north at the end of the last major glaciation. The youngest haplotypes, J and T1, which at 9000 years in age could correspond to the origins of agriculture, are possessed by just 8.3% of the modern European population, suggesting that the spread of farming into Europe was not the huge wave of advance indicated by the principal component study. Instead, it is now thought that farming was brought into Europe by a smaller group of ‘pioneers’ who interbred with the existing pre-farming communities rather than displacing them.

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.20

The eleven major European mitochondrial haplotypes. The calculated time of origin for each haplotype is shown, the closed and open parts of each bar indicating different degrees of confidence. The percentages refer to the proportions of the modern European (more...)

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Box 16.1

Neandertal DNA. Sequence analysis of ‘ancient DNA’ extracted from a fossil bone between 30 000 and 100 000 years old provides support for the Out of Africa hypothesis. Neandertals are extinct hominids who lived in Europe between 300 000 (more...)

Prehistoric human migrations into the New World

Finally we will examine the completely different set of controversies surrounding the hypotheses regarding the patterns of human migration that led to the first entry of people into the New World. There is no evidence for the spread of Homo erectus into the Americas, so it is presumed that humans did not enter the New World until after modern Homo sapiens had evolved in, or migrated into, Asia. The Bering Strait between Asia and North America is quite shallow and if the sea level dropped by 50 meters it would be possible to walk across from one continent to the other. It is believed that this was the route taken by the first humans to venture into the New World (Figure 16.21).

Which of the following is used by phylogenetics specifically when constructing a Cladogram?

Figure 16.21

The route by which humans first entered the New World.

The sea was 50 meters or more below its current level for most of the last Ice Age, between about 60 000 and 11 000 years ago, but for most of this time the route would have been impassable because of the build-up of ice. Also, the northern parts of America would have been arctic during much of this period, providing few game animals for the migrants to hunt and very little wood with which they could make fires. These considerations, together with the absence of archaeological evidence of humans in North America before 11 500 years ago, led to the adoption of ‘about 12 000 years ago’ as the date for the first entry of humans into the New World. Recent discoveries of evidence of human occupation at sites dating to 20 000 years ago, both in North and South America, has prompted some rethinking, but it is still generally assumed that a substantial population migration into North America, possibly the one from which all modern Native Americans are descended, occurred about 12 000 years ago.

What information does molecular phylogenetics provide? The first relevant studies were carried out in the late 1980s using RFLP data. These indicated that Native Americans are descended from Asian ancestors and identified four distinct mitochondrial haplotypes among the population as a whole (Wallace et al., 1985; Schurr et al., 1990). Linguistic studies had already shown that American languages can be divided into three different groupings, suggesting that modern Native Americans are descended from three sets of people, each speaking a different language. The inference from the molecular data that there may in fact have been four ancestral populations was not too disquieting. The first significant dataset of mitochondrial DNA sequences was obtained in 1991, enabling the rigorous application of a molecular clock. This indicated that the migrations into North America occurred between 15 000 and 8000 years ago (Ward et al., 1991), which is consistent with the archaeological evidence that humans were absent from the continent before 11 500 years ago.

These early phylogenetic analyses confirmed, or at least were not too discordant with, the complementary evidence provided by archaeological and linguistic studies. However, the additional molecular data that have been acquired since 1992 have tended to confuse rather than clarify the issue. For example, different datasets have provided a variety of estimates for the number of migrations into North America. The most comprehensive analysis, based on mitochondrial DNA (Forster et al., 1996), puts this figure at just one migration, and suggests that it occurred between 25 000 and 20 000 years ago, much earlier than the traditional date. Studies of Y chromosomes have assigned a date of approximately 22 500 years ago to the ‘Native American Adam’, the carrier of the Y chromosome that is ancestral to most, if not all, of the Y chromosomes in modern Native Americans (De Mendoza and Braginski, 1999). The implication from these studies is that humans became established in North America about 20 000 years ago, much earlier than indicated by the archaeological and early genetic evidence. This hypothesis is still being evaluated by other molecular biologists and archaeologists.

Study Aids For Chapter 16

Key terms

Give short definitions of the following terms:

  • Apomorphy

  • Character states

  • Maximum parsimony

  • Multiple hits

  • Multiple substitutions

  • Plesiomorphy

Self study questions

1.

Describe how taxonomy gradually led to phylogeny.

2.

Define the terms ‘phenetics’ and ‘cladistics’ and outline the important features of each of these approaches to phylogenetics.

3.

List the various types of molecular data that have been used in phylogenetics, indicating how each type of data is obtained.

4.

Explain why DNA sequences are the principal type of molecular data used in modern molecular phylogenetics.

5.

Draw and annotate a typical unrooted tree.

6.

Explain how an outgroup can be used to convert an unrooted tree into a rooted one.

7.

Distinguish between the terms ‘inferred tree’, ‘true tree’, ‘gene tree’ and ‘species tree’. Explain why gene trees and species trees are not equivalent.

8.

Describe how alignment of DNA sequences is used as a preliminary to tree reconstruction.

9.

Outline the key features of the neighbor-joining and maximum parsimony methods of tree reconstruction.

10.

How is the accuracy of a reconstructed tree assessed?

11.

Describe how a molecular clock is calibrated and explain why there is no universal molecular clock.

12.

Outline how molecular phylogenetics has contributed to our understanding of the evolutionary relationships between humans and other primates.

13.

Describe how molecular phylogenetics has been used to investigate the origins of AIDS.

14.

What types of variable loci are used when molecular phylogenetics is applied to intraspecific studies?

15.

Distinguish between the multiregional and Out of Africa hypotheses for the origins of modern humans. What evidence is there for either hypothesis?

16.

Describe how molecular phylogenetics has been used to trace the migrations of modern humans into Europe.

17.

Describe the current models for the migration of modern humans into the New World.

Problem-based learning

1.

Can a gene tree ever be equivalent to a species tree?

2.

How reliable are molecular clocks?

3.

Write a report on the science described in Ruvolo M (1997) Molecular phylogeny of the hominoids: inferences from multiple independent DNA sequence data sets. Mol. Biol. Evol., 14, 248–265.

4.

Evaluate the genetic evidence in support of the Out of Africa hypothesis.

5.

Explore how molecular phylogenetics has been used to study the mitochondrial DNA haplotypes present in modern European populations. What equivalent work has been done on mitochondrial DNA haplotypes among Native Americans?

6.

Phylogenetic studies of mitochondrial DNA assume that this genome is inherited through the maternal line and that there is no recombination between maternal and paternal genomes. Assess the validity of this assumption and describe how the hypotheses regarding the origins and migrations of modern humans would be affected if recombination between maternal and paternal genomes was shown to occur. Possible starting points for your research into this problem are: Ladoukakis ED and Zouros E (2001) Recombination in animal mitochondrial DNA: evidence from published sequences. Mol. Biol. Evol., 18, 2127–2131; Meunier J and Eyre-Walker A (2001) The correlation between linkage disequilibrium and distance: implications for recombination in hominid mitochondria. Mol. Biol. Evol., 18, 2132–2135.

References

  1. Cann RL, Stoneking M, Wilson AC. Mitochondrial DNA and human evolution. Nature. (1987);325:31–36. [PubMed: 3025745]

  2. Cavalli-Sforza LL. The DNA revolution in population genetics. Trends Genet. (1998);14:60–65. [PubMed: 9520599]

  3. Darwin C (1859) The Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. Penguin Books, London. [PMC free article: PMC5184128] [PubMed: 30164232]

  4. Darwin C (1871) The Descent of Man, and Selection in Relation to Sex. Princeton University Press, Princeton, NJ.

  5. De Mendoza DH, Braginski R. Y chromosomes point to Native American Adam. Science. (1999);283:1439–1440. [PubMed: 10206869]

  6. Eernisse DJ. A brief guide to phylogenetic software. Trends Genet. (1998);14:473–475. [PubMed: 9825676]

  7. Felsenstein J. Phylogenies from molecular sequences: inference and reliability. Ann. Rev. Genet. (1988);22:521–565. [PubMed: 3071258]

  8. Felsenstein J. PHYLIP - Phylogeny Inference Package (Version 3.20). Cladistics. (1989);5:164–166.

  9. Fitch WM. On the problem of discovering the most parsimonious tree. Am. Nat. (1977);111:223–257.

  10. Fitch WM, Margoliash E. Construction of phylogenetic trees. A method based on mutation distances as estimated from cytochrome c sequences is of general applicability. Science. (1967);155:279–284. [PubMed: 5334057]

  11. Forster P, Harding R, Torroni A, Bandelt HJ. Origin and evolution of native American mtDNA variation: a reappraisal. Am. J. Hum. Genet. (1996);59:935–945. [PMC free article: PMC1914796] [PubMed: 8808611]

  12. Gibbons A. Calibrating the molecular clock. Science. (1998);279:28–29. [PubMed: 9441404]

  13. Goodman M. Immunochemistry of the primates and primate evolution. Ann. N. Y. Acad. Sci. (1962);102:219–234. [PubMed: 13949097]

  14. Gu X, Li W-H. Higher rates of amino acid substitution in rodents than in humans. Mol. Phylogenet. Evol. (1992);1:211–214. [PubMed: 1342937]

  15. Harding RM, Fullerton SM, Griffiths RC. et al. Archaic African and Asian lineages in the genetic ancestry of modern humans. Am. J. Hum. Genet. (1997);60:772–789. [PMC free article: PMC1712470] [PubMed: 9106523]

  16. Hennig W (1966) Phylogenetic Systematics. University of Illinois Press, Urbana, IL.

  17. Hillis DM. Biology recapitulates phylogeny. Science. (1997);276:218–219. [PubMed: 9132943]

  18. Ingman M, Kaessmann H, Pääbo S, Gyllensten U. Mitochondrial genome variation and the origin of modern humans. Nature. (2000);408:708–713. [PubMed: 11130070]

  19. Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ. Multiple sequence alignment with Clustal X. Trends Biochem Sci. (1998);23:403–405. [PubMed: 9810230]

  20. Korber B, Muldoon M, Theiler J. et al. Timing the ancestor of the HIV-1 pandemic strains. Science. (2000);288:1789–1796. [PubMed: 10846155]

  21. Leitner T, Escanilla D, Franzen C, Uhlen M, Albert J. Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc. Natl Acad. Sci. USA. (1996);93:10864–10869. [PMC free article: PMC38248] [PubMed: 8855273]

  22. Li W-H (1997) Molecular Evolution. Sinauer, Sunderland, MA.

  23. Lincoln R, Boxshall G and Clark P (1998) A Dictionary of Ecology, Evolution and Systematics, 2nd edition. Cambridge University Press, Cambridge.

  24. Michener CD, Sokal RR. A quantitative approach to a problem in classification. Evolution. (1957);11:130–162.

  25. Needleman SB, Wunsch CD. A general method applicable to the search of similarities in the amino acid sequences of two proteins. J. Mol. Biol. (1970);48:443–453. [PubMed: 5420325]

  26. Nuttall GHF (1904) Blood Immunity and Blood Relationship. Cambridge University Press, Cambridge.

  27. Ochman H, Wilson AC. Evolution in bacteria: evidence for a universal substitution rate in cellular genomes. J. Mol. Evol. (1987);26:74–86. [PubMed: 3125340]

  28. Pääbo S. Human evolution. Trends Genet. (1999);15:M13–M16.

  29. Richards M, Côrte-Real H, Forster P. et al. Paleolithic and Neolithic lineages in the European mitochondrial gene pool. Am. J. Hum. Genet. (1996);59:185–203. [PMC free article: PMC1915109] [PubMed: 8659525]

  30. Richards M, Macauley V, Hickey E. et al. Tracing European founder lineages in the Near Eastern mtDNA pool. Am. J. Hum. Genet. (2000);67:1251–1276. [PMC free article: PMC1288566] [PubMed: 11032788]

  31. Ruvolo M. Molecular phylogeny of the hominoids: inferences from multiple independent DNA sequence data sets. Mol. Biol. Evol. (1997);14:248–265. [PubMed: 9066793]

  32. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. (1987);4:406–425. [PubMed: 3447015]

  33. Sarich VM, Wilson AC. Immunological time scale for hominid evolution. Science. (1967);158:1200–1203. [PubMed: 4964406]

  34. Schurr TG, Ballinger SW, Gan YY. et al. Amerindian mitochondrial DNAs have rare Asian mutations at high frequencies suggesting they are derived from four primary maternal lineages. Am. J. Hum. Genet. (1990);46:613–623. [PMC free article: PMC1683611] [PubMed: 1968708]

  35. Strauss E. Can mitochondrial clocks keep time? Science. (1999);283:1435–1438. [PubMed: 10206868]

  36. Swofford DL (1993) PAUP: Phylogenetic Analysis Using Parsimony. Illinois Natural History Survey, Champaign, IL.

  37. Wain-Hobson S. 1959 and all that. Nature. (1998);391:531–532. [PubMed: 9468129]

  38. Wallace DC, Garrison K, Knowler WC. Dramatic founder effects in Amerindian mitochondrial DNAs. Am. J. Phys. Anthropol. (1985);68:149–155. [PubMed: 2998196]

  39. Ward RH, Frazier BL, Dew-Jager K, Pääbo S. Extensive mitochondrial diversity within a single Amerindian tribe. Proc. Natl Acad. Sci. USA. (1991);88:8720–8724. [PMC free article: PMC52581] [PubMed: 1681540]

  40. Waterman MS, Smith TF, Beyer WA. Some biological sequence metrics. Adv. Math. (1976);20:367–387.

  41. Whelan S, Liò P, Goldman N. Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet. (2001);17:262–272. [PubMed: 11335036]

  42. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS. (1997);13:555–556. [PubMed: 9367129]

  43. Zhu T, Korber BT, Nahmias AJ, Hooper E, Sharp PM, Ho DD. An African HIV-1 sequence from 1959 and implications for the origin of the epidemic. Nature. (1998);391:594–597. [PubMed: 9468138]

Further Reading

  1. Avise JC (1994) Molecular Markers, Natural History and Evolution. Chapman & Hall, New York. —A detailed description of the use of molecular data in studies of evolution.

  2. Doolittle WF. Phylogenetic classification and the universal tree. Science. (1999);284:2124–2128. —Discusses the strengths and weaknesses of molecular phylogenetics as a means of inferring species trees. [PubMed: 10381871]

  3. Futuyama DJ (1998) Evolutionary Biology, 3rd edition. Sinauer, Sunderland, MA.

  4. Hall BG (2001) Phylogenetic Trees Made Easy: A How-To Manual for Molecular Biologists. Sinauer, Sunderland, MA.

  5. Hartl DL and Clark AG (1997) Principles of Population Genetics, 3rd edition. Sinauer, Sunderland, MA. —A good introduction to population genetics, emphasizing the relevance of this subject to evolutionary biology.

  6. Hillis DM, Moritz C and Mable BK (eds) (1996) Molecular Systematics, 2nd edition. Sinauer, Sunderland, MA. —Comprehensive coverage of techniques for phylogenetic tree reconstruction.

  7. Li WH and Graur D (1991) Fundamentals of Molecular Evolution. Sinauer, Sunderland, MA. —Another excellent introduction.

  8. Nei M. Phylogenetic analysis in molecular evolutionary genetics. Ann. Rev. Genet. (1996);30:371–403. —Brief review of tree-building techniques. [PubMed: 8982459]

  9. Thornton JW, Desalle R. Gene family evolution and homology: genomics meets phylogenetics. Annu. Rev. Genomics Hum. Genet. (2000);1:41–73. —Stresses the changes that need to occur in molecular phylogenetics in order to deal with genomic sequences. [PubMed: 11701624]

What is included in the construction of phylogenetic trees and Cladograms?

A phylogenetic tree may be built using morphological (body shape), biochemical, behavioral, or molecular features of species or other groups. In building a tree, we organize species into nested groups based on shared derived traits (traits different from those of the group's ancestor).

What is used to construct phylogeny?

How to construct a phylogenetic tree? Any DNA, RNA, or protein sequences can be used to draw a phylogenetic tree. But DNA sequences are the most widely used.

What is cladogram in phylogenetic tree?

The terms evolutionary tree, phylogenetic tree, and cladogram are often used interchangeably to mean the same thing—that is, the evolutionary relationships among taxa. The term dendrogram is also used interchangeably with cladogram, although there are subtle differences, discussed in Chapter 9.

What kind of data can be used to construct a phylogenetic tree?

Phylogenetic trees are constructed using various data derived from studies on homologous traits, analagous traits, and molecular evidence that can be used to establish relationships using polymeric molecules ( DNA, RNA, and proteins ).