6d ago

3 Views

0 Downloads

326.03 KB

37 Pages

Transcription

Bayesian Inferencein Astronomy & AstrophysicsA Short CourseTom LoredoDept. of Astronomy, Cornell Universityp.1/37

Five Lectures Overview of Bayesian Inference Why Try Bayesian Methods?From Gaussians to PeriodogramsLearning How To Count: Poisson ProcessesMiscellany: Frequentist Behavior, ExperimentalDesignp.2/37

Overview of Bayesian Inference What to doWhat’s different about itHow to do it: Tools for Bayesian calculationp.3/37

What To Do: The Bayesian RecipeAssess hypotheses by calculating their probabilitiesp(Hi . . .) conditional on known and/or presumedinformation using the rules of probability theory.But . . . what does p(Hi . . .) mean?p.4/37

What is distributed in p(x)?Frequentist: Probability describes “randomness”Venn, Boole, Fisher, Neymann, Pearson. . .x is a random variable if it takes different valuesthroughout an infinite (imaginary?) ensemble of“identical” sytems/experiments.p(x) describes how x is distributed throughout theensemble.Px is distributedxProbability frequency (pdf histogram).p.5/37

Bayesian: Probability describes uncertaintyBernoulli, Laplace, Bayes, Gauss. . .p(x) describes how probability (plausibility) is distributedamong the possible choices for x in the case at hand.Analog: a mass density, ρ(x)Pp is distributedx has a single,uncertain valuexRelationships between probability and frequency weredemonstrated mathematically (large number theorems,Bayes’s theorem).p.6/37

Interpreting Abstract ProbabilitiesSymmetry/Invariance/Counting Resolve possibilities into equally plausible “microstates”using symmetries Count microstates in each possibilityFrequency from probabilityBernoulli’s laws of large numbers: In repeated trials,given P (success), predictNsuccess PNtotalas N p.7/37

Probability from frequencyBayes’s “An Essay Towards Solving a Problem in theDoctrine of Chances” Bayes’s theoremProbability 6 Frequency!p.8/37

Bayesian Probability:A Thermal AnalogyIntuitive notionQuantificationCalibrationHot, coldTemperature, TCold as ice 273KBoiling hot 373KuncertaintyProbability, PCertainty 0, 1p 1/36:plausible as “snake’s eyes”p 1/1024:plausible as 10 headsp.9/37

The Bayesian RecipeAssess hypotheses by calculating their probabilitiesp(Hi . . .) conditional on known and/or presumedinformation using the rules of probability theory.Probability Theory Axioms (“grammar”):‘OR’ (sum rule)P (H1 H2 I) P (H1 I) P (H2 I) P (H1 , H2 I)‘AND’ (product rule)P (H1 , D I) P (H1 I) P (D H1 , I) P (D I) P (H1 D, I)p.10/37

Direct Probabilities (“vocabulary”): Certainty: If A is certainly true given B, P (A B) 1Falsity: If A is certainly false given B, P (A B) 0Other rules exist for more complicated types ofinformation; for example, invariance arguments,maximum (information) entropy, limit theorems (CLT; tyingprobabilities to frequencies), bold (or desperate!)presumption. . .p.11/37

Three Important TheoremsNormalization:For exclusive, exhaustive HiXP (Hi · · ·) 1iBayes’s Theorem:P (D Hi , I)P (Hi D, I) P (Hi I)P (D I)posterior prior likelihoodp.12/37

Marginalization:Note that for exclusive, exhaustive {Bi },XP (A, Bi I) XP (Bi A, I)P (A I) P (A I) XP (Bi I)P (A Bi , I)iii We can use {Bi } as a “basis” to get P (A I).Example: Take A D, Bi Hi ; thenP (D I) XP (D, Hi I) XP (Hi I)P (D Hi , I)iiprior predictive for D Average likelihood for Hip.13/37

Inference With Parametric ModelsParameter EstimationI Model M with parameters θ ( any add’l info)Hi statements about θ; e.g. “θ [2.5, 3.5],” or “θ 0”Probability for any such statement can be found using aprobability density function (pdf) for θ:P (θ [θ, θ dθ] · · ·) f (θ)dθ p(θ · · ·)dθp.14/37

Posterior probability density:p(θ M ) L(θ)p(θ D, M ) Rdθ p(θ M ) L(θ)Summaries of posterior: “Best fit” values: mode, posterior meanUncertainties: Credible regions (e.g., HPD regions)Marginal distributions:I Interesting parameters ψ, nuisance parameters φI Marginal dist’n for ψ:p(ψ D, M ) Zdφ p(ψ, φ D, M )Generalizes “propagation of errors”p.15/37

Model Uncertainty: Model ComparisonI (M1 M2 . . .) — Specify a set of models.Hi Mi — Hypothesis chooses a model.Posterior probability for a model:p(D Mi , I)p(Mi D, I) p(Mi I)p(D I) p(Mi )L(Mi )RBut L(Mi ) p(D Mi ) dθi p(θi Mi )p(D θi , Mi ).Likelihood for model Average likelihood for itsparametersL(Mi ) hL(θi )ip.16/37

Model Uncertainty: Model AveragingModels have a common subset of interestingparameters, ψ .Each has different set of nuisance parameters φi (ordifferent prior info about them).Hi statements about ψ .Calculate posterior PDF for ψ :p(ψ D, I) Xp(Mi D, I) p(ψ D, Mi )i XiL(Mi )Zdθi p(ψ, φi D, Mi )The model choice is itself a (discrete) nuisanceparameter here.p.17/37

An Automatic Occam’s RazorPredictive probabilities can favor simpler models:p(D Mi ) Zdθi p(θi M ) L(θi )P(D H)Simple HComplicated HD obsDp.18/37

The Occam Factor:p, LLikelihoodδθPriorθp(D Mi ) Z θdθi p(θi M ) L(θi ) p(θ̂i M )L(θ̂i )δθiδθi L(θ̂i ) θi Maximum Likelihood Occam FactorModels with more parameters often make the data moreprobable— for the best fit.Occam factor penalizes models for “wasted” volume ofparameter space.p.19/37

What’s the Difference?Bayesian Inference (BI): Specify at least two competing hypotheses and priorsCalculate their probabilities using probability theoryI Parameter estimation:p(θ D, M ) Rp(θ M )L(θ)dθ p(θ M )L(θ)I Model Comparison:Rdθ1 p(θ1 M1 ) L(θ1 )RO dθ2 p(θ2 M2 ) L(θ2 )p.20/37

Frequentist Statistics (FS): Specify null hypothesis H0 such that rejecting it implies aninteresting effect is present Specify statistic S(D) that measures departure of thedata from null expectationsRCalculate p(S H0 ) dD p(D H0 )δ[S S(D)](e.g. by Monte Carlo simulation of data) EvaluateR S(Dobs ); decide whether to reject H0 based on,e.g., Sobs dS p(S H0 )p.21/37

Crucial DistinctionsThe role of subjectivity:BI exchanges (implicit) subjectivity in the choice of null &statistic for (explicit) subjectivity in the specification ofalternatives. Makes assumptions explicit Guides specification of further alternatives thatgeneralize the analysis Automates identification of statistics:I BI is a problem-solving approachI FS is a solution-characterization approachThe types of mathematical calculations: BI requires integrals over hypothesis/parameter spaceFS requires integrals over sample/data spacep.22/37

Complexity of Statistical IntegralsInference with independent data:Consider N data, D {xi }; and model M with mparameters (m ¿ N ).Suppose L(θ) p(x1 θ) p(x2 θ) · · · p(xN θ).Frequentist integrals:Zdx1 p(x1 θ)Zdx2 p(x2 θ) · · ·ZdxN p(xN θ)f (D)Seek integrals with properties independent of θ. Suchrigorous frequentist integrals usually can’t be found.Approximate (e.g., asymptotic) results are easy via MonteCarlo (due to independence).p.23/37

Bayesian integrals:Zdm θ g(θ) p(θ M ) L(θ)Such integrals are sometimes easy if analytic (especiallyin low dimensions).Asymptotic approximations require ingredients familiarfrom frequentist calculations.For large m ( 4 is often enough!) the integrals are oftenvery challenging because of correlations (lack ofindependence) in parameter space.p.24/37

How To Do It Tools for Bayesian CalculationAsymptotic (large N ) approximation: Laplaceapproximation Low-D Models (m 10):I Randomized Quadrature: Quadrature ditheringI Subregion-Adaptive Quadrature: ADAPT,DCUHRE, BAYESPACKI Adaptive Monte Carlo: VEGAS, miser High-D Models (m 5–106 ): Posterior SamplingI Rejection methodI Markov Chain Monte Carlo (MCMC)p.25/37

Laplace ApproximationsSuppose posterior has a single dominant (interior) modeat θ̂, with m parameters·1 p(θ M )L(θ) p(θ̂ M )L(θ̂) exp (θ θ̂)I(θ θ̂)2whereI 2 ln[p(θ M )L(θ)] 2θ , Info matrixθ̂p.26/37

Bayes Factors:Zdθ p(θ M )L(θ) p(θ̂ M )L(θ̂) (2π)m/2 I 1/2Marginals:Profile likelihood Lp (θ) max L(θ, φ)φ 1/2 p(θ D, M ) L(θ) I(θ) pp.27/37

The Laplace approximation:Uses same ingredients as common frequentistcalculationsUses ratios approximation is often O(1/N )Using “unit info prior” in i.i.d. setting Schwarz criterion;Bayesian Information Criterion (BIC)1ln B ln L(θ̂) ln L(θ̂, φ̂) (m2 m1 ) ln N2Bayesian counterpart to adjusting χ2 for d.o.f., butaccounts for parameter space volume.p.28/37

Low-D (m 10): Quadrature & Monte CarloQuadrature/Cubature Rules:Zdθ f (θ) Xwi f (θi ) O(n 2 ) or O(n 4 )iSmoothness fast convergence in 1-DCurse of dimensionality O(n 2/m ) or O(n 4/m ) in m-Dp.29/37

Monte Carlo Integration:Zdθ g(θ)p(θ) Xθi p(θ)g(θi ) O(n 1/2 ) O(n 1 ) withquasi-MC Ignores smoothness poor performance in 1-DAvoids curse: O(n 1/2 ) regardless of dimensionPractical problem: multiplier is large (variance of g) hard if m 6 (need good “importance sampler” p)p.30/37

Randomized Quadrature:Quadrature rule random dithering of abscissas get benefits of both methodsMost useful in settings resembling Gaussian quadratureSubregion-Adaptive Quadrature/MC:Concentrate points where most of the probability lies viarecursionAdaptive quadrature: Use a pair of lattice rules (for errorestim’n), subdivide regions w/ large error (ADAPT,DCUHRE, BAYESPACK by Genz et al.)Adaptive Monte Carlo: Build the importance sampleron-the-fly (e.g., VEGAS, miser in Numerical Recipes)p.31/37

Subregion-Adaptive QuadratureConcentrate points where most of the probability lies viarecursion. Use a pair of lattice rules (for error estim’n),subdivide regions w/ large error.ADAPT in action (galaxy polarizations)p.32/37

Posterior SamplingGeneral Approach:Draw samples of θ, φ from p(θ, φ D, M ); then:P Integrals, moments easily found via i f (θi, φi) {θi} are samples from p(θ D, M )But how can we obtain {θi , φi }?Rejection Method:P(θ ) θHard to find efficient comparison function if m 6.p.33/37

Markov Chain Monte Carlo (MCMC)Let Λ(θ) ln [p(θ M ) p(D θ, M )]Then p(θ D, M ) e Λ(θ)ZZ Zdθ e Λ(θ)Bayesian integration looks like problems addressed incomputational statmech and Euclidean QFT.Markov chain methods are standard: Metropolis;Metropolis-Hastings; molecular dynamics; hybrid MonteCarlo; simulated annealing; thermodynamic integrationp.34/37

A Complicated Marginal DistributionNascent neutron star properties inferred from neutrinodata from SN 1987ATwo variables derived from 9-dimensional posterior distribution.p.35/37

The MCMC Recipe:Create a “time series” of samples θi from p(θ): Draw a candidate θi 1 from a kernel T (θi 1 θi )Enforce “detailed balance” by accepting with p α ·T (θi θi 1 )p(θi 1 )α(θi 1 θi ) min 1,T (θi 1 θi )p(θi )Choosing T to minimize “burn-in” and corr’ns is an art.Coupled, parallel chains eliminate this for select problems(“exact sampling”).p.36/37

SummaryBayesian/frequentist differences: Probabilities for hypotheses vs. for dataProblem solving vs. solution characterizationIntegrals: Parameter space vs. sample spaceComputational techniques for Bayesian inference: Large N : Laplace approximationExact:I Adaptive quadrature for low-dI Posterior sampling for hi-dp.37/37