## Notes On Statistics From A Bayesian Perspective

6d ago
1 Views
1.15 MB
83 Pages
Transcription

Notes on StatisticsFrom a Bayesian Perspective(last update May 9, 2007)Brian S. Blais

Notes on StatisticsPreface: Why Make These Notes?I’m writing these notes to clarify statistics for myself, never having had a formal class in statistics.It turns out that there is an interesting reason why I, as a physicist, have never taken a statisticscourse. This reason is best described in such references as (Loredo, 1990) and (Jaynes, 2003).I have been hearing about “p-values” and “t-tests” for many years, and had only a vagueunderstanding of them. I understood probability theory reasonably well, but somehow these termsthat I was hearing didn’t quite fit into the way I was thinking. After reading some recent work onBayesian approaches, I realized that the standard (orthodox) approach to statistics made no senseto me, and the Bayesian way was entirely intuitive. Further, it seemed as if there were seriousproblems with some of the orthodox approaches, even on some trivial problems (see (Lindley andPhillips, 1976), and Section 4.1 for an illustrative example).Still, many of the Bayesian articles and books that I read went over examples that were nevercovered in standard statistics books, and so their similarities and differences were not always clear.So I decided I would pick up a standard statistics book (in this case, (Bowerman and O’Connell,2003)) and go through the examples and exercises in the book, but from a Bayesian perspective.Along the way, I leaned on such sources as (Jeffreys, 1939; Sivia, 1996; Jaynes, 2003), as well asnumerous other articles, to get me through the Bayesian approach. These notes are the (work-inprogress) results.I am also including Python (www.python.org) code for the examples, so I invite any readersto reproduce anything I’ve done.A SuggestionIf you are familiar with the orthodox perspectives, and are dubious of the Bayesian methods, I urgeyou to read in its entirety (Jaynes, 1976) f).It presents in detail six easy problems for which it is clear that the orthodox methods are lacking, and there are two responses from orthodox statisticians along with replies to those. It’s anexcellent read.ii

Notes on StatisticsCONTENTSContents1 Introduction1.1 History . . . . . . . . . . . . . . . . . . . . . .1.1.1 Derivation of Bayes’ Theorem . . . . .1.1.2 Further History . . . . . . . . . . . . .1.1.3 Response . . . . . . . . . . . . . . . .1.2 Procedure . . . . . . . . . . . . . . . . . . . .1.2.1 Orthodox Hypothesis Tests . . . . . .1.3 Numerical Analysis . . . . . . . . . . . . . . .1.3.1 Plotting Distributions and Histograms2 One Sample Inferences2.1 Unknown µ, Known σ . . . .2.1.1 Exercises . . . . . . . .2.2 Unknown µ, Unknown σ . . .2.2.1 Exercises . . . . . . . .2.3 Unknown proportion . . . . .2.3.1 Confidence . . . . . . .2.3.2 Median and Percentiles2.3.3 Numerical Examples .3 Two Sample Inferences3.1 Paired Data Difference of Means, δk xk yk . . . .3.2 Difference of Means, δ µx µy , known σx and σy .3.3 Difference of Means, δ µx µy , unknown σx and σy3.3.1 Jaynes 1976 Difference of Means . . . . . . . .3.4 Ratio of Two Variances κ σx2 /σy2 . . . . . . . . . . .3.5 Simple Linear Regression, yk mxk b . . . . .3.6 Linear Regression with Errors on both x and y . . . .3.7 Goodness of Fit . . . . . . . . . . . . . . . . . . . . .3.7.1 Jaynes’ Alternative to χ2 . . . . . . . . . . . 34

Notes on StatisticsCONTENTS4 Orthodox versus Bayesian Approaches4.1 Flipping a Tack . . . . . . . . . . . . .4.1.1 Orthodox Statistics . . . . . . .4.1.2 Bayesian Statistics . . . . . . .4.2 Type A Stars . . . . . . . . . . . . . .4.3 Cauchy Distribution . . . . . . . . . .4.3.1 Orthodox estimator? . . . . . .35353537373940.414141414244444546A Supplementary CodeA.1 utils.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5050B Derivations for Single SamplesB.1 Unknown µ, Known σ . . . . . . . .B.1.1 Introducing the DistributionsB.1.2 An Aside on Log Posterior . .B.1.3 Continuing . . . . . . . . . .B.2 Unknown µ, Unknown σ . . . . . . .B.2.1 Setting up the Problem . . . .B.2.2 Estimating the Mean . . . . .B.2.3 A More Convenient Form . .B.2.4 Estimating σ . . . . . . . . .B.3 Unknown proportion . . . . . . . . .B.3.1 Max, Mean, Variance . . . . .525252535354555657575960.6262626363.5 Misc5.1 Max Entropy Derivations of Priors . . . . .5.1.1 Mean . . . . . . . . . . . . . . . . . .5.1.2 Mean and Second Moment . . . . . .5.2 Derivation of Maximum Entropy . . . . . . .5.3 Problem from Loredo . . . . . . . . . . . . .5.3.1 Estimating the Amplitude of a Signal5.3.2 Measuring a Weak Counting Signal .5.4 Anova and T-distributions . . . . . . . . . .C Derivations for Two SamplesC.1 Paired Data Difference of Means, δk xk yk .C.1.1 Changing Variables . . . . . . . . . . . .C.1.2 Continuing with Paired Data . . . . . .C.2 Difference of Means, δ µx µy , known σx andiv. . . .σy.

Notes on StatisticsCONTENTSC.3 Difference of Means, δ µx µy , unknown σx and σyC.4 Ratio of Two Variances κ σx2 /σy2 . . . . . . . . . . .C.5 Simple Linear Regression, yk mxk b . . . . .C.5.1 Quick recipe for solving 2 2 equations . . . .C.5.2 Solution to the Simple Least Squares Problem.6565676869D DataD.1 Bank Waiting Times (in Minutes) . . . . . . . . . . . . . . . . . . . . . . . . . . .D.2 Customer Satisfaction (7-pt Likert Scale 7 responses) . . . . . . . . . . . . . . .707070E Probability Distributions and IntegralsE.1 Binomial . . . . . . . . . . . . . . . . . . . .E.1.1 Normalization . . . . . . . . . . . . .E.1.2 Mean . . . . . . . . . . . . . . . . . .E.1.3 Variance . . . . . . . . . . . . . . . .E.1.4 Gaussian Approximation . . . . . . .E.2 Negative Binomial . . . . . . . . . . . . . .E.3 Beta . . . . . . . . . . . . . . . . . . . . . .E.4 Gaussian . . . . . . . . . . . . . . . . . . . .E.4.1 Aside about Cool Tricks for Gaussian72727272737374747475v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Integrals .

Notes on StatisticsCHAPTER 1. INTRODUCTIONChapter 1IntroductionProbability is often defined as the “long-run relative frequency of occurrence of an event, either ina sequence of repeated experiments or in an ensemble of “identically prepared” systems.”(Loredo,1990) This definition is referred to as the “frequentist” view, or the “orthodox” view of probability.The definition of probability used in these notes is the Bayesian one, where “probability is regardedas the real-number valued measure of the plausibility of a proposition when incomplete knowledgedoes not allow us to establish its truth or falsehood with certainty”(Loredo, 1990) In this way,it is a real-number extension of Boolean logic, where 0 represents certainty of falsehood and 1represents certainty of truth.Although this may seem like an argument for philosophers, there are real and demonstrabledifferences between the Bayesian and orthodox methods. In many cases, however, they lead toidentical predictions. Most of the results in an introductory statistics text are derivable with eitherperspective. In some cases, the difference reflects different choices of problems under consideration.Regarding Brian’s comments: . I would refer everyone on this thread to the wonderful article "From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics" by Tom Loredo, located at l.html From my personal experience trying to learn basic statistics, I always got hung up on the notion of a population, and of the standard deviation of the mean. I found the Bayesian approach to be both more intuitive, easier to apply to real data, and more mathematically sound (there is a great article by E.T. Jaynes at http://bayes.wustl.edu/etj/articles/confidence.pdf where he outlines several pathologies in standard stats).I too, want to second Brian’s endorsement of the Bayesian approach toprobability theory--especially as interpreted by Jaynes and hisschool of the maximum entropy procedure for determining probability1

Notes on StatisticsCHAPTER 1. INTRODUCTIONdistributions based on incomplete data. Bottom line: there is no population in the Bayesian approach.True. Probability is a measure of ones state of knowledge, not a property of the system.Whoa, there. It’s true that this approach does not interpret thefundamental meaning of probabilites as the asymptotic relativefrequencies of particular outcomes for infinite ensembles of copiesof the system (or random process) at hand. But a Bayesianinterpretation of probability is not really based on anything assubjective as the state of knowledge any particular person’s mind.Such a characterization is really a straw man that opponents of thatapproach tend to level at it. Rather, this interpretation ofprobablity is just as objectively defined as the relative frequencyapproach. In the Bayesian approach probability is a measure of theideal degree of confidence that an ideal perfectly rational mindwould have about the state (or outcome) of the system (or randomprocess) given only all the objectively available data extant forthat system/process upon which to assign such a confidence measure onthe various possible outcomes. The Baysian approach is about theprocedure of most rationally assigning a measure of the variousdegrees of confidence to the possible outcomes of some random processarmed with just all the objectively available data. In doing so, all of the strained attempts at creating a fictitious population out of measurements vanish (such as, say, analyzing measurements of the mass of the moon by imagining many hypothetical universes of "identical" measurements). On instead is quantifying your state of knowledge.Again, it’s not the subjective state of the actual knowledge of anactual less-than-completely-rational mind that it relevant. Ratherit would be better considered to be the state of knowledge of an’ideal’perfectly rational mind that is supplied with *only* *all*the objectively available data of the situation about which there isonly partial information extant. In almost all easy cases, the Bayesian approach yields the *exact same* numerical result as the standard approach. The interpretation is a lot easier, and a lot easier to communicate to students. 2

Notes on Statistics CHAPTER 1. INTRODUCTIONBrian BlaisDavid BowmanForum for Physics 1.1HistoryFor a much more detailed account, please see (Loredo, 1990; Jaynes, 2003)First formal account of the calculation of probabilities from Bernoulli(Bernoulli, 1713), whodefined probability as a “degree of certainty”. His theorem states that, if the probability of anevent is p then the limiting frequency of that event converges to p. It was later, by Bayes andLaplace that the inverse problem was solved: given n occurrences out of N trials, what is theprobability p of a single occurrence?The solution was published posthumously by Rev. Thomas Bayes (1763), and soon rediscovered, generalized, and applied to astrophysics by Laplace. It is Laplace who really broughtprobability theory to a mature state, applying it to problems in astrophysics, geology, meteorology, and others. One famous application was the determination of the masses of Jupiter andSaturn and the quantification of the uncertainties.1.1.1Derivation of Bayes’ TheoremLaplace took as axioms the sum and product rules for probability:p(A C) p(Ā C) 1p(AB C) p(A BC)p(B C)from there, given the obvious symmetry p(AB C) p(BA C) we getp(A BC)p(B) p(B AC)p(A)p(B AC)p(A)p(A BC) p(B)which is Bayes’ Theorem.1.1.2Further HistoryAfter Laplace’s death, his ideas came under attack by mathematicians. They criticized two aspectsof the theory:3

Notes on StatisticsCHAPTER 1. INTRODUCTION1. The axioms, although reasonable, were not clearly unique for a definition of probability asvague as “degrees of plausibility”. The definition seemed vague, and thus the axioms whichsupport the theory seemed arbitrary.If one defines probabilities as limiting frequencies of events, then these axioms are justified.2. It was unclear how to assign the prior probabilities in the first place. Bernoulli introducedthe Principle of Insufficient Reason, which states that if the evidence does not provide anyreason to choose A1 or A2 , then one assigns equal probability to