Inference Over Heterogeneous Finite-/Infinite-Dimensional .

6d ago
3 Views
0 Downloads
475.62 KB
8 Pages
Transcription

Inference Over Heterogeneous Finite-/Infinite-Dimensional Systems Using FactorGraphs and Gaussian ProcessesDavid M. Rosen, Guoquan Huang, and John J. LeonardAbstract—The ability to reason over partially observablenetworks of interacting states is a fundamental competency inprobabilistic robotics. While the well-known factor graph andGaussian process models provide flexible and computationallyefficient solutions for this inference problem in the special casesin which all of the hidden states are either finite-dimensionalparameters or real-valued functions, respectively, in many caseswe are interested in reasoning about heterogeneous networkswhose hidden states are comprised of both finite-dimensionalparameters and functions. To that end, in this paper we propose anovel probabilistic generative model that incorporates both factorgraphs and Gaussian processes to model these heterogeneoussystems. Our model improves upon prior approaches to inferencewithin these networks by removing the assumption of any specificset of conditional independences amongst the modeled states,thereby significantly expanding the class of systems that canbe represented. Furthermore, we show that inference withinthis model can always be performed by means of a two-stageprocedure involving inference within a factor graph followed byinference over a Gaussian process; by exploiting fast inferencemethods for the individual factor graph and Gaussian processmodels to solve each of these subproblems in succession, wethus obtain a general framework for computationally efficientinference over heterogeneous finite-/infinite-dimensional systems.I. I NTRODUCTIONMany fundamental problems in robotics can be formulatedas instances of inference over a network of interacting randomstates in which the goal is to estimate the values of some subsetof the states given noisy observations of others; for example,the canonical problems of filtering, smoothing, localization,and mapping all belong to this class [1]. The developmentof computationally efficient inference methods for solvingproblems of this type is thus of significant practical import.For the common special case in which all of the states arefinite-dimensional parameters, factor graphs [2] have proven tobe particularly useful: these models generalize and encompassboth Bayesian networks and Markov random fields [3] (thusproviding a unified theoretical framework for inference), andrecent work in the robotics community has led to the development of efficient inference algorithms for these models thatcan solve problems involving tens of thousands of continuousrandom variables in real-time on a single processor [4]–[7].More recently, Gaussian processes [8] have emerged asanother useful class of models for the special case in which thehidden states are infinite collections of real values (i.e. realvalued functions); for example, recent work has applied theseThe authors are with the Computer Science and Artificial IntelligenceLaboratory of the Massachusetts Institute of Technology, Cambridge, MA02139, USA. Email: {dmrosen,gqhuang,jleonard}@mit.edu.models to develop generalized Bayes filters [9] and RauchTung-Striebel smoothers [10] in which the entire processand observation models can be estimated nonparametricallydirectly from data.However, we are often interested in modeling heterogeneous networks whose states are comprised of both finitedimensional parameters and entire functions (this is the case,for example, when modeling hybrid discrete-/continuous-timesystems). While some special cases of this problem haverecently been addressed in the robotics literature (e.g. in[9]–[11]), the approaches presented therein are developedfor networks that assume the conditional independence ofspecific subsets of the modeled states (e.g. the hidden Markovmodels in [9], [10]), and therefore may not generalize tobroader classes of networks in which these relations no longerhold. Indeed, to the best of our knowledge, there has beenno discussion in the robotics literature of efficient inferencemethods for heterogeneous finite-/infinite-dimensional systemsin the general case.To that end, in this paper we develop a probabilisticgenerative model incorporating factor graphs and Gaussianprocesses to represent generic heterogeneous finite-/infinitedimensional systems. In contrast to prior work, our modeldoes not assume any specific set of conditional independencesamongst the modeled states; the only requirement is that anyinteraction between states must be mediated by a finite set offinite-dimensional values (a significantly weaker condition).Furthermore, we show that inference within this model canalways be performed by means of a two-stage procedureinvolving inference within a factor graph followed by inference over a Gaussian process; by applying efficient inferencemethods (based upon recent advances in sparse numericallinear algebra [12]–[14]) for the individual factor graph andGaussian process models to solve each of these subproblems insuccession, we are thus still able to opportunistically exploitwhatever conditional independences do hold in a particularapplication (in the form of sparse linear systems, as describedin Section II) to achieve fast computation without the needto explicitly require these independences in the design ofour model itself. Through this approach we thus obtain aflexible and computationally efficient framework for inferenceover heterogeneous finite-/infinite-dimensional systems in thegeneral case.II. R EVIEW OF MATHEMATICAL PRELIMINARIESIn this section we review the formulations of the factorgraph and Gaussian process models together with their associated inference algorithms.

A. Factor graphs1) Model formulation: A factor graph [2] is a bipartitegraph that encodes the factorization of a probability distribution: given a distribution p : Ω R over several variablesΘ (θ1 , . . . , θn ) Ω with factorizationp(Θ) mYUnder these conditions, the computational cost of optimizing(6) using Newton-type methods [15] turns out to be determinedby the sparsity pattern of the factor graph corresponding to (7).To see this, observe thatΘ̂MAP argmax p(Z, Θ)Θpi (Θi ), argmin ln p(Z, Θ)(1) argmin where Θi {θ1 , . . . , θn } for all 1 i m, the correspondingfactor graph G (F, Θ, E) is:(2)E {(pi , θj ) θj Θi }.By (2), the factor nodes pi F of G are in one-to-onecorrespondence with the factors of p in (1), the variable nodesθj Θ are in one-to-one correspondence with the argumentsof p, and factor node pi and variable node θj share an edgeeij (pi , θj ) E if and only if the variable θj appears as anargument to factor pi in (1).2) Inference in factor graphs: In general, we will beinterested in the problem of Bayesian inference: given ahidden parameter Θ, an observable variable Z, and the jointdistribution p(Z, Θ) p(Z Θ)·p(Θ) relating the two, we wishto obtain the posterior belief p(Θ Z) for Θ given a measuredvalue of Z:p(Z, Θ).(3)p(Θ Z) p(Z)Unfortunately, without assuming special structure in thejoint distribution p(Z, Θ) (e.g. prior conjugacy, etc.), the computation of the exact posterior p(Θ Z) is generally intractable,since this requires the evaluation of a (possibly very highdimensional) integral to obtain the evidence p(Z):Zp(Z) p(Z, Θ) dΘ.(4)On the other hand, it is often relatively straightforward toobtain the maximum a posteriori (MAP) point estimate Θ̂MAP :Θ̂MAP argmax p(Θ Z);for even if p(Z) in (3) is unknown, it is constant with respectto Θ, and therefore (by virtue of (3) and (5)),Θp(Z, Θ) argmax p(Z, Θ).p(Z)Θ(6)Computation of Θ̂MAP thus only requires that it be tractableto optimize the joint distribution p(Z, Θ) as a function of Θ.Let us assume in the sequel (as is commonly the casein practice) that p(Z, Θ) is a twice-differentiable probabilitydensity function, and that it factors asp(Z, Θ) mYi 1pi (Zi , Θi ).ln pi (Zi , Θi ),i 1 2dim(Θ)(9)[ ln p(Z, Θ)] (Hjk )j,k 1 Θ2has Hjk 6 0 only if θj , θk Θi for some factor pi in(7), i.e., only if variable nodes θj and θk are connected toa common factor node pi in G. The edge set E of the factorgraph G corresponding to (7) thus directly encodes the sparsitypattern of the Hessian H, and this in turn determines thecost of computing the update step during each iteration ofthe optimization (8) (cf. e.g. [5]–[7] for details).After computing the point estimate Θ̂MAP in (8), one canadditionally recover an approximation for the entire posteriorp(Θ Z) by means of the Laplace approximation [16, Sec. 4.4]:H Θ Z N Θ̂MAP , Λ 1Θ ZΛΘ Z 2[ ln p(Z, Θ)] Θ2(10a);(10b)Θ Θ̂MAPthis approach approximates the true posterior using a Gaussiandistribution that is locally fitted to p(Θ Z) at Θ̂MAP (moreprecisely, it fits a second-order Taylor series expansion to ln p(Z, Θ) at Θ̂MAP ). We observe that the information matrixΛΘ Z defined in (10b) is simply the Hessian in (9) evaluatedat Θ̂MAP ; since Newton-type methods compute this matrix (oran approximation to it) as part of solving the optimization (8),it is thus available at no additional computational cost beyondthat needed to compute Θ̂MAP itself.B. Gaussian processes(5)ΘΘ̂MAP argmaxΘmXso that (by virtue of the final line of (8)) the HessianF {p1 , . . . , pm },Θ {θ1 , . . . , θn },(8)Θi 1(7)1) Model formulation: Formally, a Gaussian process is acollection of random variables {Fx }x X Rd indexed bysome (possibly infinite) set X , any finite subset of which havea jointly Gaussian distribution [8], [17].Since finite-dimensional Gaussian distributions are completely determined by their means and (co)variances, thejoint distribution for a finite subset {Fxi }ni 1 of Gaussianprocess-distributed random variables can be completely specified by first and second moments of the form E[Fx ] andE[(Fx E[Fx ])(Fx0 E[Fx0 ])T ] for x, x0 {xi }ni 1 . Sincethese moments exist for any choices of x, x0 X , we maydefine functions according tom : X Rdm(x) E[Fx ](11)

andk : X X Rd d k(x, x0 ) E (Fx E[Fx ])(Fx0 E[Fx0 ])T E (Fx m(x))(Fx0 m(x0 ))T ;(12)the functions m and k defined in (11) and (12) are calledthe mean function and covariance function for the Gaussian process, respectively. Conversely, given any functionm : X Rd and any positive-definite matrix-valued kernelk : X X Rd d , m and k determine (by means of (11)and (12)) a Gaussian process {Fx }x X Rd for which theyare the mean and covariance functions [8], [17]. We writef GP(m, k)(13)to denote that f , {Fx }x X is a collection of randomvariables distributed according to the Gaussian process withmean function m and covariance function k.Since the Gaussian process (13) assigns to each x Xa Gaussian-distributed random value Fx Rd , the entirecollection of random variables f {Fx }x X can be thoughtof as a random function f : X Rd . In this view (the socalled function-space view [8]), a Gaussian process GP(m, k)specifies a probability distribution over the entire set of functions {f : X Rd } whose domain is X . Gaussian processesthus provide a very useful class of priors for performingnonparametric regression and interpolation/extrapolation ina Bayesian setting when the true parametric form of theregression function is itself uncertain.2) Inference in Gaussian processes: When performing inference with Gaussian process models, we will generally beinterested in determining the posterior belief for a functionwith a Gaussian process prior given observations of its valuesat several points. More precisely, we wish to determine thebelief for f , f F , where f GP(m, k) andX (x1 , . . . , xnX ) X nX ,F f (X) (f (x1 ), . . . , f (xnX )) RdnX(14)denote a vector of nX points in X and the correspondingvector of f ’s values at those points, respectively.First, we observe that in order to recover the posterior belieffor f , it suffices to determine the posterior belief F F forf ’s values F f (X ) on some second finite subset of testpoints X X nX , since we can then obtain f F from F Fby simply taking X x X to be a single test point andthen allowing it vary pointwise over the entire domain X .To that end, consider the joint distribution for (F, F ): FMK(X, X) K(X, X ) N,, (15)F M K(X , X) K(X , X )whereM m(X) (m(x1 ), . . . , m(xnX )) RdnX ,(16)M m(X ) analogously, and K(X, Y ) denotes the Grammatrix k(x1 , y1 ) · · ·k(x1 , ynY ) .dn dnYK(X, Y ) R X.k(xnX , y1 ) · · · k(xnX , ynY )(17)for two vectors X X nX and Y X nY . Since F and F arejointly Gaussian-distributed according to (15), the conditionaldistribution F F is also Gaussian-distributed according to F F N MF F , ΣF F , 1MF F M K(X , X)KX(F M ),ΣF F K(X , X ) (18) 1K(X , X)KXK(X, X ),where we have introduced the notation KX , K(X, X) sincethis quantity is constant with respect to X .Now since (18) holds for every choice of X and X , thenletting X x X be a single point (so that F f (x )is just the value of f at x ), we find that the posterior belieffor f is again Gaussian process-distributed according tof GP(m̄, k̄),(19a) 1m̄(x) m(x) kX (x)KX(F M ),(19b) 1k̄(x, x0 ) k(x, x0 ) kX (x)KXkX (x0 )T ,(19c)where kX is the single-variable function defined bykX : X Rd dnXkX (x) K(x, X)(20)for fixed X. Gaussian processes thus provide a class of priorson the set of functions {f : X Rd } that is closed underposterior updates given observations of the function valuesF f (X) on some finite subset of points X X nX .3) A word on the design of Gaussian process models:As in all kernel-based methods, the covariance (i.e. kernel)function k(x, x0 ) plays a crucial role in Gaussian processmodels. In this subsection, we show how to characterizeseveral practically-important properties of Gaussian processes(e.g. sample function differentiability class) in terms of easilyascertained properties of their kernel functions, and provide afew guidelines for the design of k