Bayes' TheoremLet B1, B2,.Bk be a set of mutually exclusive and exhaustive states.Let A represent some event that happens.Then, the probability of Bi on the condition that A occurs is given by Bayes' Theorem as:Pr Bi A Pr( A Bi ) Pr( Bi )k Pr( A Bj 1j) Pr( B j )Example:Pr(Bi) values are the prior probabilitiesLetB1 House FinchB2 Western TanagerB3 Ladder-backed WoodpeckerPr(B1) 0.97Pr(B2) 0.02Pr(B3) 0.01Pr(Bi) is the probability of seeing species i on campus during spring migrationLet A report from a student of seeing a bird with a red head on campus during spring migrationPr(A Bi) are the likelihoods. (Not to be confused with maximum likelihood)LetPr(A B1) 0.1, Pr(A B2) 0.9, Pr(A B3) 0.2Pr(A Bi) is the probability of a student giving report A if they have seen species Bi.Pr(Bi A) are the posterior probabilitiesFrom Bayes Theorem, the posterior probability of a bird being a House Finch if a student gives report Ais:Pr(B1 A) 0.1 * 0.97 / [0.1 * 0.97 0.9 * 0.02 0.2 * 0.01] 0.097/0.117 0.829The probability of the bird being a Western Tanager is:Pr(B2 A) 0.9 * 0.02 / 0.117 0.018 /0 .117 0.154The probability of the bird being a Ladder-backed Woodpecker is:Pr(B3 A) 0.2 * .01 / .117 .002 / .117 0.017Bayes, Thomas (b. 1702, London - d. 1761, Tunbridge Wells, Kent), mathematician who first used probability inductivelyand established a mathematical basis for probability. He set down his findings on probability in "Essay Towards Solving aProblem in the Doctrine of Chances" (1763), published posthumously in the Philosophical Transactions of the Royal Societyof London. He was a Presbyterian minister in Tunbridge Wells from 1731. It is thought that his election to the Royal Societymight have been based on a tract of 1736 in which Bayes defended the views and philosophy of Sir Isaac Newton. Anotebook of his exists, and includes a method of finding the time and place of conjunction of two planets, notes on weightsand measures, a method of differentiation, and logarithms.
1Overview of Bayesian Analysis – most of the material is from SAS 9.2 DocumentationThe most frequently used statistical methods are known as frequentist (or classical) methods. These methods assumethat unknown parameters are fixed constants, and they define probability by using limiting relative frequencies. It followsfrom these assumptions that probabilities are objective and that you cannot make probabilistic statements aboutparameters because they are fixed.Bayesian methods offer an alternative approach; they treat parameters as random variables and define probability as"degrees of belief" (that is, the probability of an event is the degree to which you believe the event is true). It follows fromthese postulates that probabilities are subjective and that you can make probability statements about parameters. Theterm "Bayesian" comes from the prevalent usage of Bayes’ theorem, which was named after the Reverend ThomasBayes, an eighteenth century Presbyterian minister. Bayes was interested in solving the question of inverse probability:after observing a collection of events, what is the probability of one event?by using a statistical model described by a densitySuppose you are interested in estimating from data. Bayesian philosophy states that cannot be determined exactly, and uncertainty about the parameter isexpressed through probability statements and distributions. You can say that follows a normal distribution with meanand variance , if it is believed that this distribution best describes the uncertainty associated with the parameter. Thefollowing steps describe the essential elements of Bayesian inference:, which is known as the prior distribution, or just the prior. The1. A probability distribution for is formulated asprior distribution expresses your beliefs (for example, on the mean, the spread, the skewness, and so forth) aboutthe parameter before you examine the data.2. Given the observed data , you choose a statistical modelto describe the distribution ofgiven .3. You update your beliefs about by combining information from the prior distribution and the data through thecalculation of the posterior distribution,.Simply put, Bayes’ theorem tells you how to update existing knowledge with new information. You begin with a prior belief, and after learning information from data , you change or update your belief about and obtain. These arethe essential elements of the Bayesian approach to data analysis.In theory, Bayesian methods offer simple alternatives to statistical inference—all inferences follow from the posteriordistribution. In practice, however, you can obtain the posterior distribution with straightforward analytical solutionsonly in the most rudimentary problems. Most Bayesian analyses require sophisticated computations, including the use ofsimulation methods. You generate samples from the posterior distribution and use these samples to estimate thequantities of interest.Prior DistributionsA prior distribution of a parameter is the probability distribution that represents your uncertainty about the parameterbefore the current data are examined. Multiplying the prior distribution and the likelihood function together leads to theposterior distribution of the parameter. You use the posterior distribution to carry out all inferences. You cannot carry outany Bayesian inference or perform any modeling without using a prior distribution.Objective Priors versus Subjective PriorsBayesian probability measures the degree of belief that you have in a random event. By this definition, probability is highlysubjective. It follows that all priors are subjective priors. Not everyone agrees with this notion of subjectivity when itcomes to specifying prior distributions. There has long been a desire to obtain results that are objectively valid. Within theBayesian paradigm, this can be somewhat achieved by using prior distributions that are "objective" (that is, that have aminimal impact on the posterior distribution). Such distributions are called objective or noninformative priors. However,while noninformative priors are very popular in some applications, they are not always easy to construct.
2Bayesian InferenceBayesian inference about is primarily based on the posterior distribution of . There are various ways in which you cansummarize this distribution. For example, you can report your findings through point estimates. You can also use theposterior distribution to construct hypothesis tests or probability statements.If you know the distributional form of the posterior density of interest, you can report the exact posterior point estimates.When models become too difficult to analyze analytically, you have to use simulation algorithms, such as the Markovchain Monte Carlo (MCMC) method to obtain posterior estimates.Markov Chain Monte Carlo methodThe Markov chain Monte Carlo (MCMC) method is a general simulation method for sampling from posterior distributionsand computing posterior quantities of interest. MCMC methods sample successively from a target distribution. Eachsample depends on the previous one, hence the notion of the Markov chain. A Markov chain is a sequence of randomvariables, , , , for which the random variable depends on all previous s only through its immediate predecessor. You can think of a Markov chain applied to sampling as a mechanism that traverses randomly through a targetdistribution without having any memory of where it has been. Where it moves next is entirely dependent on where it isnow.Algorithms that implement the MCMC method include the Metropolis, Metropolis-Hastings, and Gibbs sampler. (Note:Metropolis refers to American physicist and computer scientist Nicholas C. Metropolis).Burn In and ThinningBurn-in refers to the practice of discarding an initial portion of a Markov chain sample so that the effect of initial values onand the Markov chain wasthe posterior inference is minimized. For example, suppose the target distribution is. The chain might quickly travel to regions around 0 in a few iterations. However, including samplesstarted at the valuein the posterior mean calculation can produce substantial bias in the mean estimate. In theory, if thearound the valueMarkov chain is run for an infinite amount of time, the effect of the initial values decreases to zero. In practice, you do nothave the luxury of infinite samples. In practice, you assume that after iterations, the chain has reached its targetdistribution and you can throw away the early portion and use the good samples for posterior inference. The value of isthe burn-in number.With some models you might experience poor mixing (or slow convergence) of the Markov chain. This can happen, forexample, when parameters are highly correlated with each other. Poor mixing means that the Markov chain slowlytraverses the parameter space and the chain has high dependence. High sample autocorrelation can result in biasedMonte Carlo standard errors. A common strategy is to thin the Markov chain in order to reduce sample autocorrelations.You thin a chain by keeping every th simulated draw from each sequence. You can safely use a thinned Markov chainfor posterior inference as long as the chain converges. It is important to note that thinning a Markov chain can be wastefulfraction of all the posterior samples generated. You always get more precisebecause you are throwing away aposterior estimates if the entire Markov chain is used. However, other factors, such as computer storage or plotting time,might prevent you from keeping all samples.Advantages and DisadvantagesBayesian methods and classical methods both have advantages and disadvantages, and there are some similarities.When the sample size is large, Bayesian inference often provides results for parametric models that are very similar to theresults produced by frequentist methods. Some advantages to using Bayesian analysis include the following:It provides a natural and principled way of combining prior information with data, within a solid decision theoreticalframework. You can incorporate past information about a parameter and form a prior distribution for future analysis. Whennew observations become available, the previous posterior distribution can be used as a prior. All inferences logicallyfollow from Bayes’ theorem.It provides inferences that are conditional on the data and are exact, without reliance on asymptoticapproximation. Small sample inference proceeds in the same manner as if one had a large sample.
3It obeys the likelihood principle. If two distinct sampling designs yield proportional likelihood functions for , thenall inferences about should be identical from these two designs. Classical inference does not in general obey thelikelihood principle.It provides interpretable answers, such as “the true parametercredible interval.”has a probability of 0.95 of falling in a 95%It provides a convenient setting for a wide range of models, such as hierarchical models and missing dataproblems. MCMC, along with other numerical methods, makes computations tractable for virtually all parametric models.There are also disadvantages to using Bayesian analysis:It does not tell you how to select a prior. There is no correct way to choose a prior. Bayesian inferences requireskills to translate subjective prior beliefs into a mathematically formulated prior. If you do not proceed with caution, youcan generate misleading results.It can produce posterior distributions that are heavily influenced by the priors. From a practical point of view, itmight sometimes be difficult to convince subject matter experts who do not agree with the validity of the chosen prior.It often comes with a high computational cost, especially in models with a large number of parameters. Inaddition, simulations provide slightly different answers unless the same random seed is used. Note that slight variations insimulation results do not contradict the early claim that Bayesian inferences are exact. The posterior distribution of aparameter is exact, given the likelihood function and the priors, while simulation-based estimates of posterior quantitiescan vary due to the random number generator used in the procedures.[ We will skip to the example (below) at this point, and return here after examining the example. ]Bayesian Analysis of Phylogenies – most of the material is from Hall, BG, 2004. Phylogenetic TreesMade Easy: A How To Manual, Sinauer Assoc., Sunderland, MA.In phylogenetics, Bayesian analysis (B) is related to the Maximum Likelihood method (ML). You select a model ofevolution, and the computer searches for the best trees relative to the model and the data (the alignment).ML seeks the tree that maximizes the probability of observing the alignment given the tree.B seeks the tree that maximizes the probability of the tree, given the alignment and the model of evolution. This “rescales” likelihoods to true probabilities (sum over all trees 1). This permits using probability to analyze the data.ML seeks the single most likely tree. As it searches a “landscape” of possible trees, it continually seeks higher points onthe landscape (more likely trees). If there is more than one hill, ML can get trapped on a peak, even though there may bea higher peak somewhere. ML cannot traverse the “valleys” to get to the higher peak.B seeks the best set of trees, therefore B may consider the same tree many times. B searches the landscape by MCMCmethods (the MrBayes program uses a Metropolis-coupled algorithm to implement the MCMC). The probability of treesusually cannot be calculated analytically. The MCMC method allows the calculation of probabilities by sampling theposterior distribution of tree probabilities.More detail .B begins with either a randomly chosen or user-specified tree. This tree has a combination of branch lengths, nucleotidesubstitution parameters, and rate variation across sites parameter. This de