MBE Advance Access published October 1, 2010Title: A Bayesian phylogenetic method to estimate unknown sequence agesFor consideration as a Research ArticleBeth Shapiro1, Simon Y. W. Ho2, Alexei J. Drummond3, Marc A. Suchard4 , Oliver G.Pybus5 and Andrew Rambaut6,71.Department of Biology, The Pennsylvania State University, University Park, PA16801 USA2.School of Biological Sciences, University of Sydney, Sydney NSW 2006,Australia3.Department of Computer Science and Bioinformatics Institute, University ofAuckland, Auckland, New Zealand4.Departments of Biomathematics, Biostatistics and Human Genetics, University ofCalifornia, Los Angeles, CA, 90095, United States5.Department of Zoology, University of Oxford, Oxford OX1 3PS, United Kingdom6.Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, EH9 3JT,United Kingdom7.Fogarty International Center, National Institutes of Health, Bethesda, MD, USAAuthor for correspondence:Beth ShapiroTel: 1 814 863 [email protected] The Author 2010. Published by Oxford University Press on behalf of the Society for Molecular Biologyand Evolution. All rights reserved. For permissions, please e-mail: [email protected]
Keywords: heterochronous sequences, ancient DNA, molecular clock, viral evolution,measurably evolving populationsRunning head: Bayesian estimation of unknown sequence agesAbstractHeterochronous data sets comprise molecular sequences sampled at different pointsin time. If the temporal range of the sampled sequences is large relative to the rate ofmutation, the sampling times can directly calibrate evolutionary rates to calendartime. Here, we extend this calibration process to provide a full probabilistic methodthat utilizes temporal information in heterochronous data sets to estimate samplingtimes (leaf-ages) for sequenced for which this information unavailable. Our methodis similar to relaxing the constraints of the molecular clock on specific lineages withina phylogenetic tree. Using a combination of synthetic and empirical data sets, wedemonstrate that the method estimates leaf-ages reliably and accurately. Potentialapplications of our approach include incorporating samples of uncertain orradiocarbon-infinite age into ancient DNA analyses, evaluating the temporal signal ina particular sequence or data set, and exploring the reliability of sequence ages thatare somehow contentious.
Introduction:The incorporation of temporal information into molecular phylogenetic and genealogicanalyses means that evolutionary processes can be investigated on a natural timescale ofyears, centuries or millennia. This is commonly achieved by one of two methods. If thesequences of interest are sampled at effectively the same time (‘isochronous’ data) then anevolutionary timescale can be calibrated by assigning a date, or date range, to one or moredivergence events in the tree. If data sets comprise sequences sampled at different timepoints (‘heterochronous’ data), these can be calibrated by fixing the age of each sequence(the leaves of the tree) to the known age of the specimen from which the sequence wasamplified (e.g. Rambaut 2000). Both approaches result in an estimated rate of evolutionand a corresponding phylogenetic timescale, which can then be used to test hypothesesabout the timing and nature of evolutionary and demographic events, such as dates ofdivergence among lineages, or changes in population size or structure (Drummond et al.2003).The two major sources of heterochronous sequence data are rapidly-evolving RNA andDNA viruses, whose high mutation rates enable the generation of phylogeneticallyinformative sequence diversity within historical time-frames (Drummond, Pybus,Rambaut 2003), and ancient DNA data isolated from preserved remains (Hofreiter et al.2001b) that may be up to several hundred thousands of years old (Willerslev et al. 2007).In both cases, the period over which sequences are isolated is sufficiently long relative tothe mutation rate to allow estimation of the evolutionary rate (Drummond et al. 2003).
The temporal information associated with RNA virus sequences typically represents theday, month, or year of sample isolation and/or storage (Taubenberger et al. 1997).Heterochronous viral data sets have been used to estimate rates of mutation for specificviruses (Jenkins et al. 2002), to investigate rates of evolution and adaptation within hosts(e.g. Lemey, Rambaut, Pybus 2006; Lemey et al. 2007), and to infer epidemic dynamicswithin and between populations of susceptible hosts (Rambaut 2000; Pybus, Rambaut2009).For ancient DNA data sets, ages of the genetic sequences are most often approximatedusing radiocarbon dates that are estimated from the same specimens from which the DNAsequences are amplified (e.g. Shapiro et al. 2004; Bunce et al. 2009; Campos et al. 2010).Dates from material associated with stratigraphic context have also been used ascalibrating information, however (e.g. Lambert et al. 2002; Valdiosera et al. 2008). Forancient DNA sequences, leaf-ages are normally assigned some number of thousands ofyears before the present (ka BP). Molecular clock analyses of these data have been used toidentify periods of population turnover (e.g. Hadly et al. 1998; Barnes et al. 2002;Hofreiter et al. 2004) to estimate divergence times within species (e.g. Shapiro et al. 2004;Debruyne et al. 2008; Stiller et al. 2010) and to correlate population-level changes ingenetic diversity with external events, such as climate change (e.g. Hadly et al. 2004;Chan, Anderson, Hadly 2006; Barnett et al. 2009; Campos et al. 2010). For example, therehas been considerable debate about the relative roles of climate change associated with thelast glacial maximum (LGM; ca 21 ka BP) and the appearance and increase in humanpopulations (ca 14 ka BP) in the recent disappearance of the North American megafauna(Alroy 2001; Barnosky et al. 2004; Stuart et al. 2004). Heterochronous data sets
comprising sequences directly dated to before, during and after these events allow explicittests of these alternative hypotheses.Despite significant technical and chemical pre-treatment advances in Accelerator MassSpectrometry (AMS) radiocarbon dating (Bronk Ramsey et al. 2004), the oldest samplesfor which finite radiocarbon dates can be routinely generated are around 50-55 ka BP (e.g.Barnett et al. 2009). The period 0-55 ka BP includes several specific, large-scaleenvironmental events that are likely to have affected the distribution and abundance ofplant and animal species. However, there are a number of circumstances under which theleaf ages of ancient DNA or viral sequences may be unknown, or at best, highly uncertain.First, ancient mitochondrial DNA sequences are routinely amplified from permafrostpreserved specimens older than the 50-55 ka BP radiocarbon limit. For example, nearly100 of the bison sequences reported in Shapiro et al (2004) were too old to be assignedfinite radiocarbon ages, and ancient DNA sequences have been reported that are perhapsas old as 800 ka BP (Willerslev et al 2007). In this case only censored temporalinformation is often available (i.e. age 55ka BP). Second, ancient DNA samples areroutinely recovered from situations in which the stratigraphic context provides calibratinginformation (e.g. Lambert et al. 2002; Coolen, Overmann 2007; Valdiosera et al. 2008) butonly within a wide range of uncertainty, such that assigning a specific mean or mediandate to such sequences is statistically inappropriate (Ho, Phillips 2009). Third, for rapidlyevolving viral sequences, the date of sampling may simply be unknown due to the loss orabsence of accurate archival information. Even if the viral sampling date is known to thenearest year, it may be important to know the isolation date more accurately. Fourth, itmay also be important to independently verify posited sampling dates due to their extremeage (Zhu et al. 1998; Taubenberger et al. 2005; Worobey et al. 2008) or because they are
in some way contentious (Sonoda et al. 2000; Coolen, Overmann 2007). Since frozen viralisolates do not accumulate mutations whilst in storage, a leaf-dating method also has thepotential to identify transmitting viruses that, after a period of storage, have been releasedinto the environment (Worobey 2008).Despite these obvious problems, few studies have attempted to estimate the unknown ageor sampling date of heterochronous sequences using molecular clock methods, and nonehave investigated the statistical reliability of such methods. Perhaps the most similaranalysis was that undertaken by Korber et al. (1998), who validated their molecular clockof HIV-1 by testing whether it could accurately predict the date of an ‘old’ isolate sampledin 1959. However, in that case the approach consists of the visual inspection of the fit to alinear regression of viral sampling date against genetic distance rather than a statisticalanalysis or estimation procedure.Here, we investigate a statistical framework for the estimation of leaf-dates usingmolecular clock models when the sample age or isolation date is either unknown or highlyuncertain. Following Drummond (2002) we estimate the age of individual DNAsequences using the temporal calibrating information from other sequences in the data set.Methodologically, the leaf-dating method is similar to relaxing the constraints of themolecular clock on specific lineages within a phylogenetic tree. Using a combination ofsimulations and empirical analyses, we show that leaf-ages can be estimated reliably andaccurately using our approach. While the analyses presented here only perform leaf-datingon one sequence within any given data set, the method can be readily extended to multiplesequences within the same analysis.
MethodsWe developed and implemented a leaf-dating method that estimates the age or date ofisolation of individual sequences within the Bayesian Markov chain Monte Carlo(MCMC) framework provided by the software package BEAST (Drummond, Rambaut2007). BEAST allows dates/times to be specified for each sequence in a sample alignmentand uses this information to estimate a timescale for the evolutionary history of thesample. The models implemented in BEAST accomplish this by fixing the external nodesof the tree (the leaves) to the specified dates and then sampling the times of the internalnodes of the tree from their posterior probability distribution using MCMC. The length ofeach branch in units of time is mapped to an expected number of substitutions per siteusing a vector of molecular evolutionary rates. The simplest model assigns the same singlerate to every branch (the strict molecular clock model). BEAST also implements methodsthat allow the evolutionary rate to vary among branches (relaxed molecular clock models)such that the vector of rates follows a specified parametric distribution (Drummond et al.2006). Under these models, BEAST can simultaneously infer the tree topology, the timesof the internal nodes, the rate of evolution and any parameters of the associated coalescentand substitution models (Drummond et al. 2002). As is required in Bayesian inference, allof these parameters are assigned one of a wide variety of possible prior probabilitydistributions. MCMC sampling is then used to obtain estimates of marginal posteriorprobability distributions for any parameters of interest.In previous molecular clock implementations all nodes in the tree are given dates/ages,that is, internal nodes are treated as unknown parameters whilst the dates/ages of the treeexternal nodes (leaves) are assumed to be known. To estimate the age of an individualsequence, we extend the framework introduced above to estimate the time associated with
the sequence, that is, the sequence’s leaf-age is treated as a random variable. Thus, anadditional parameter for the age of the external node is introduced, and is treatedidentically to the internal node age parameters in terms of proposals made by the MCMCkernel. See Drummond et al. (2006) for further specifications and parameterisations of themolecular clock models used here.1. Synthetic data setsTo explore the ability of our leaf-dating model to recover sample ages, we first estimatedthe dates of randomly chosen sequences within synthetic heterochronous data sets. Wegenerated sequence alignments of 1,000 nucleotides (nt) in length, each including 50 taxasampled at different times, by simulating sequence evolution down 1,000 random trees.We simulated sequences using a Jukes-Cantor model of nucleotide substitution under astrict molecular clock with a rate of 2.5 10-7 subs/site/year (Rambaut, Grassly 1997). Eachtree represents a random sample from the constant-size serial-sample coalescent model(Rodrigo et al. 1999), with a population size equal to 105 haploid individuals. The ages ofthe 50 individuals were 0, 0, 2000, 2000, , 48000, 48000 years. The evolutionary rateand sampling times used in these simulations are representative of those found in typicalanalyses of ancient mitochondrial DNA (mtDNA) data sets.For each of the 1,000 simulated data sets, a single leaf was chosen at random and itsknown date was removed. Each data set was analysed separately in BEAST using themethod outlined above and the age of undated leaf was estimated. This procedure thusrepresents a “leave-one-out cross validation” design and is an effective approach toexamining estimator performance. BEAST analyses were performed under the true model,
i.e. a strict clock, a constant-size coalescent model, and the Jukes-Cantor model ofnucleotide substitution. For each analysis, we ran a single MCMC chain for 5 milliongenerations, with samples drawn from the chain every 5000 generations, of which the first10% was discarded as burn-in.2. Empirical data setsWe selected two empirical heterochronous data sets for further validation of the leafdating method, enabling us to test our approach when the true evolutionary model isunknown