Fast Search And Estimation Of Bayesian Nonparametric .

6d ago
1.53 MB
12 Pages

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS2020, VOL. 00, NO. 0, 5Fast Search and Estimation of Bayesian Nonparametric Mixture Models Using aClassification Annealing EM AlgorithmGeorge KarabatsosDepartment of Educational Psychology, and Department of Mathematics, Statistics, and Computer Science, University of Illinois-Chicago, Chicago, ILABSTRACTARTICLE HISTORYBayesian nonparametric (BNP) infinite-mixture models provide flexible and accurate density estimation,cluster analysis, and regression. However, for the posterior inference of such a model, MCMC algorithmsare complex, often need to be tailor-made for different BNP priors, and are intractable for large datasets. Weintroduce a BNP classification annealing EM algorithm which employs importance sampling estimation. Thisnew fast-search algorithm, for virtually any given BNP mixture model, can quickly and accurately calculatethe posterior predictive density estimate (by posterior averaging) and the maximum a-posteriori clusteringestimate (by simulated annealing), even for datasets containing millions of observations. The algorithm canhandle a wide range of BNP priors because it primarily relies on the ability to generate prior samples. Thealgorithm can be fast because in each iteration, it performs a sampling step for the (missing) clusteringof the data points, instead of a costly E-step; and then performs direct posterior calculations in the Mstep, given the sampled (imputed) clustering. The new algorithm is illustrated and evaluated through BNPGaussian mixture model analyses of benchmark simulated data and real datasets. MATLAB code for the newalgorithm is provided in the supplementary materials. Supplementary materials for this article are availableonline.Received June 2019Revised May 20201. IntroductionBayesian nonparametric (BNP) infinite-mixture models provideflexible and accurate methods of density estimation, clusteranalysis, and regression, for many scientific fields (e.g., DauméIII 2007; Hjort et al. 2010; Mitra and Müller 2015, and referencestherein). A typical BNP mixture model has a mixing distributiondefined by a completely random measure (CRM) (introduced byKingman 1967), an infinite mixture of point mass distributionsassigned a BNP (CRM) prior distribution (Lijoi and Prünster 2010). For example, the popular Dirichlet process mixture(DPM) model (Lo 1984) has a mixing distribution defined by aDirichlet process (DP) (Ferguson 1973), a random probabilitymeasure (BNP prior) which can be obtained by normalizing theincrements of a gamma CRM. Indeed, other more general andflexible classes of BNP priors are available.The increasing availability of big data through cheap computing power has motivated developments of various deterministic fast-search algorithms for estimating BNP mixturemodels (e.g., Daumé III 2007; Raykov, Boukouvalas, and Little 2016; Fuentes-García, Meña, and Walker 2019; Zuanettiet al. 2019, and references therein). Such an algorithm provides faster alternative to MCMC, sequential Monte Carlo(SMC), and related algorithms which can compute or converge slowly for such data. Certain fast-search algorithms,including variational Bayes, predictive recursion, and sequential methods, also aim to improve computational speed, butKEYWORDSBayesian nonparametrics;Clustering; Completelyrandom measures; Densityestimation; Normalizedrandom measuresdo so by relying on factorization or prediction rule assumptions which depart from the underlying probabilistic properties of the general BNP mixture model (Raykov, Boukouvalas, and Little 2016; Fortini and Petrone 2020; Zuanetti et al.2019).A typical fast-search algorithm can rapidly produce theapproximate maximum-a-posteriori (MAP) estimate of theclustering of the data points, and the posterior predictive densityestimate for the given BNP mixture model, within seconds orminutes, even from a large dataset, while making one or fewpasses over all the data points. For the model, in research practice, these are the estimates of main interest from the posteriordistribution (e.g., Daumé III 2007; Raykov, Boukouvalas, andLittle 2016), while the MAP estimator of the clusters is coherent(Fuentes-García, Meña, and Walker 2019).The current fast-search algorithms do not easily apply to theentire class of BNP priors, but instead are limited to BNP priorswhich admit tractable representations, such as the class of Gibbstype priors (DeBlasi et al. 2015), and most stick-breaking priors(Ishwaran and James 2001), including the normalized generalized gamma process (see Lijoi, Meña, and Prünster 2007), thePoisson–Dirichlet process, and submodels such as the DP. However, other BNP priors can provide better fit and more realisticclustering of data (Lijoi and Prünster 2010). But the currentfast-search algorithms are not easily applicable to less-tractableBNP (CRM) priors, such as the generalized Dirichlet process(Lijoi, Meña, and Prünster 2005a), the stable 3-parameter betaCONTACT George [email protected] of Illinois-Chicago, Chicago, IL 60607.Supplementary materials for this article are available online. Please go to 2020 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America

2G. KARABATSOS(Teh and Gorür 2009) process, other priors with explicit series(e.g., inverse Lévy, species sampling) or superposition representations (Campbell et al. 2019), or any future novel BNPpriors.Also, the current fast-search algorithms are deterministic.Thus, for research practice, the literature has suggested restarting such an algorithm for several random starting parametervalues or permutations of the data ordering, to produce estimates that are not trapped in a suboptimal local posterior modeor overly influenced by poor starting values. In contrast, MCMCand SMC algorithms have been developed for many BNP priors.However, such algorithms typically need to be tailor-made tothe specific BNP prior considered, and appear complex and verydifferent across the BNP priors. Further, MAP clustering estimation from MCMC (or SMC) posterior samples is nontrivial(Rastelli and Friel 2018).This article introduces the BNP-CAEM algorithm, a novelfast-search algorithm for clustering and posterior predictivedensity estimation, which employs importance sampling. Thisextends the classification annealing EM (CAEM) algorithm, aK-means type algorithm for maximum likelihood estimation(MLE) of the Gaussian mixture model (Celeux and Govaert1992, sec. 4.2) without importance sampling. The BNP-CAEMalgorithm is based on the complete likelihood function for themixture model, which, for any set of “missing” cluster assignments of the data points, is defined by the component densitylikelihood of the data points (resp.) times a multinomial kerneldensity for the cluster groups frequencies. This algorithm easilyapplies to any BNP mixture model defined by any chosen BNPprior, with simple changes to the algorithm; relies on any suitable, fixed finite-dimensional truncation approximation of theprior (see Arbel and Prünster 2017); all while merely relying onthe ability to sample from the chosen BNP prior distribution.The default BNP-CAEM algorithm applies to any usual BNPprior with conjugate baseline measure for the mixture component parameters. In each algorithm iteration, the C-step generates a sample from the current posterior predictive distributionof the clustering of the data points; and the M-step directlycalculates posterior updates of the component predictive densities and the mixture weight parameters, given the sampled(imputed) clustering. The C-step replaces the computationallycostly E-step of the original EM algorithm (Dempster, Laird,and Rubin 1977) with a less-costly sampling step. The M-stepupdates the mixture weights using an importance sampling estimator, based on a large number of BNP prior (proposal) samplesof the mixture weights, which are (resp.) assigned multinomial(kernel) importance weights given the updated cluster groupfrequencies. The same large sample can be reused in the M-stepover BNP-CAEM iterations, a computational savings feature ofthe general importance sampling method (Beckman and McKay1987). The BNP-CAEM algorithm can be extended to handlenon-conjugate priors for the component parameters, with someextra computational cost, by adding an importance samplingestimator for the component predictive densities based on aprior (proposal) sample of these parameters.Over initial iterations of the general BNP-CAEM algorithm,the temperature is at the maximum value of 1. During this time,the algorithm acts as a stochastic EM algorithm (Celeux andGovaert 1992) which eventually produces a sequence of ergodictime-homogeneous Markov chain of posterior predictive density estimates. This sequence converges to a random densityestimate variable, which has the stationary distribution of theMarkov chain as the number of iterations grows, and has anasymptotic normal distribution centered on the true densitywhen the data sample size is large (from Nielsen 2000). Hence,the density estimates produced over the initial converged BNPCAEM iterations can be averaged to provide the final densityestimate, a multiple (missing clustering) imputation estimate.In later BNP-CAEM algorithm iterations, using annealing, thetemperature is gradually decreased toward 0, so that the amountof randomness in the simulations decreases with the iterations,ending up with an approximate MAP clustering estimate of thedata points. Yet, because this algorithm is stochastic, it can produce clustering and density estimates that can randomly escapesuboptimal local posterior modes. Thus, it is not necessary torestart this algorithm for several random starting values, unlikethe other fast search algorithms.The BNP-CAEM algorithm is described next. Details aregiven by Section 2.4, after Section 2.1 concisely reviews CRMsand BNP priors, Section 2.2 describes the -approximationtruncation methods used for BNP (CRM) priors, and Section 2.3represents the posterior distribution for general BNP mixturemodels. Appendices A1–A6 in the supplementary materials givemore technical details. In Section 3, the BNP-CAEM algorithmis illustrated through the BNP Gaussian mixture model analysisof simulated and real benchmark datasets, and evaluated andcompared with standard MCMC, VB, and EM MLE algorithms,in terms of density and clustering estimation accuracy, and computation time. Finally, Section 4 discusses how the BNP-CAEMalgorithm can be used to accelerate MCMC convergence, usedin a distributed parallel computing scheme for massive dataanalysis, and used for real-time streaming data analysis.2. Methodology2.1. Review of CRMsLet Y be a complete and separable metric space, and MY be thespace of boundedly finite measures on Y, with Borel σ -algebrasY and MY , respectively. (Then μ MY implies μ(A) for any bounded set A.) A basic object in BNP modeling is theCRM, Jj δθ j (·),(2.1) μ(·) j 1an almost-surely discrete random measure that is definedon some probability space ( , F, P) and takes on values inμ(A1 ), . . . , μ(AK ) mutually independent for(MY , MY ); with any K 1 pairwise-disjoint sets A1 , . . . , AK in Y; and definedby random jumps (masses) (Jj )j 1 at random locations (θ j )j 1 ,where δθ is a unit point mass measure at θ (Kingman 1967).The distribution of the CRM (2.1) has expectation (E)determined by its Laplace functional transform, with Lévy–Khintchine representation: μ(dy)}](2.2)E[exp{ Y ϕ(y) [1 exp{ υϕ(y)}]ν(dυ, dy) , exp R Y

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS for anyμ measurable function ϕ : Y R, with Y ϕ d and R Y min{υ, 1}ν(dυ, dy) . Above, ν is the Lévyintensity measure on R Y that describes the distribution ofthe jump sizes Jj ’s and locations θ j ’s of the CRM, and can beconveniently rewritten as ν(dυ, dy) ρ(dυ y) α(dy), whereρ is a transition kernel on R Y controlling the jump intensity,and α is a measure on Y determining the locations of the jumps.The Lévy intensity measure ν is homogeneous if it has the formν(dυ, dy) ρ(dυ) α(dy), which implies that the jumps andlocations are independent.Any BNP prior is uniquely defined by its Lévy intensity νfor the corresponding CRM, and thus can be denoted by μ CRM(ν) (e.g., Lijoi and Prünster 2010). Virtually any standardBNP (CRM) prior (almost-surely) supports positive and finiteCRMs, such that 0 μ(Y) with probability 1, based onthe conditions that ρ(R ) and α(Y) (0, ). The lattercondition allows α to be treated as a probability measure, and isusually rewritten as α(dy) aG0 (dy), with positive parametera 0, and with G0 a probability measure on Y (Regazzini, Lijoi,and Prünster 2003).A normalized random measure with independent increments (NRMI) refers to a CRM μ (with BNP prior) that istransformed to the random probability measure (r.p.m.), G(·) μ(·)/ μ(Y) (Regazzini, Lijoi, and Prünster 2003). NRMIs definea large class of BNP priors. Table 1 presents important examplesof CRM (BNP), among others. Inhomogeneous BNP (CRM)3priors (not shown) include neutral to the right and extendedgamma processes (see Lijoi and Prünster 2010).Any BNP (CRM) prior can admit another more tractablerepresentation that implies μ CRM(ν) based on the priordefining intensity measure ν (e.g., Campbell et al. 2019). Forexample, an inverse-Lévy (series) representation of the CRM μ(BNP prior) is given by (2.1), with decreasing jumps J1 J2 J3 · · · obtained by Jj ν (ξj ) inf{υ : ν([υ, ), Y) jindiidξj } and θ j G(t Jj ), where ξj 1 E , E Exp(1),are the ordered jumps times of a standard Poisson process ofunit rate on R (Ferguson and Klass 1972). Here, ν([υ, ), Y)is the Lévy jump intensity of CRM(ν), a decreasing function ofυ. And ψ (J1 , J2 , . . .) (Jj ) j 1 .Some normalized BNP priors admit a series representationin terms of a species