Bayesian Approximate Kernel Regression With Variable

6d ago
2.28 MB
13 Pages

Journal of the American Statistical AssociationISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: Approximate Kernel Regression WithVariable SelectionLorin Crawford, Kris C. Wood, Xiang Zhou & Sayan MukherjeeTo cite this article: Lorin Crawford, Kris C. Wood, Xiang Zhou & Sayan Mukherjee (2018)Bayesian Approximate Kernel Regression With Variable Selection, Journal of the AmericanStatistical Association, 113:524, 1710-1721, DOI: 10.1080/01621459.2017.1361830To link to this article: supplementary materialAccepted author version posted online: 18Aug 2017.Published online: 19 Jun 2018.Submit your article to this journalArticle views: 1015View Crossmark dataCiting articles: 2 View citing articlesFull Terms & Conditions of access and use can be found tion?journalCode uasa20

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION , VOL. , NO. , – , Theory and Methods . / . . Bayesian Approximate Kernel Regression With Variable SelectionLorin Crawforda,b,c, Kris C. Woodd, Xiang Zhoue,f, and Sayan Mukherjeeg,h,i,jaDepartment of Biostatistics, Brown University, Providence, RI; b Center for Statistical Sciences, Brown University, Providence, RI; c Center forComputational Molecular Biology, Brown University, Providence, RI; d Department of Pharmacology & Cancer Biology, Duke University, Durham, NC;eDepartment of Biostatistics, University of Michigan, Ann Arbor, MI; f Center for Statistical Genetics, University of Michigan, Ann Arbor, MI; g Departmentof Statistical Science, Duke University, Durham, NC; h Department of Computer Science, Duke University, Durham, NC; i Department of Mathematics,Duke University, Durham, NC; j Department of Bioinformatics & Biostatistics, Duke University, Durham, NCABSTRACTNonlinear kernel regression models are often used in statistics and machine learning because they are moreaccurate than linear models. Variable selection for kernel regression models is a challenge partly because,unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. Inthis article, we propose a novel framework that provides an effect size analog for each explanatory variablein Bayesian kernel regression models when the kernel is shift-invariant—for example, the Gaussian kernel.We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define alinear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. This projection onto the original explanatory variables serves as an analog of effect sizes. Thespecific function analytic property we use is that shift-invariant kernel functions can be approximated viarandom Fourier bases. Based on the random Fourier expansion, we propose a computationally efficient classof Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. We illustrate the utility of BAKR by examiningtwo important problems in statistical genetics: genomic selection (i.e., phenotypic prediction) and association mapping (i.e., inference of significant variants or loci). State-of-the-art methods for genomic selectionand association mapping are based on kernel regression and linear models, respectively. BAKR is the firstmethod that is competitive in both settings. Supplementary materials for this article are available online.1. IntroductionIn this article, we formulate a nonlinear regression frameworkwhich simultaneously achieves the predictive accuracy of themost powerful nonlinear regression methods in machine learning and statistics, as well as provides an analog of effect sizes andprobability of association for regression coefficients—which arestandard quantities in linear regression models.Methodology and theory for variable selection is far moredeveloped for linear regression models than nonlinear regression models. In linear models, regression coefficients (i.e., theeffect size of a covariate) provide useful information for variable selection. The magnitude and correlation structure ofthese effect sizes are used by various probabilistic models andalgorithms to select relevant covariates associated with theresponse. Classic variable selection methods, such as forwardand stepwise selection (Roman and Speed 2002), use effect sizesto search for main interaction effects. Sparse regression models,both Bayesian (Park and Casella 2008) and frequentist (Tibshirani 1996; Efron et al. 2004), shrink small effect sizes to zero.Factor models use the covariance structure of the observed datato shrink effect sizes for variable selection (West 2003; Hahn,ARTICLE HISTORYReceived April Revised June KEYWORDSEffect size; Epistasis; Kernelregression; Variableselection; Random Fourierfeatures; Statistical geneticsCarvalho, and Mukherjee 2013). Lastly, stochastic search variable selection (SSVS) uses Markov chain Monte Carlo (MCMC)procedures to search the space of all possible subsets of variables (George and McCulloch 1993). All of these methods,except SSVS, use the magnitude and correlation structure ofthe regression coefficients explicitly in variable selection—SSVSuses this information implicitly.The main contribution of this article is a (Bayesian) nonlinear regression methodology that is computationally efficient,predicts accurately, and allows for variable selection. The maintechnical challenge in formulating this novel method is defining and efficiently computing an analog of effect sizes for kernelregression models, a popular class of nonlinear regression models. Kernel regression models have a long history in statisticsand applied mathematics (Wahba 1990), and more recently inmachine learning (Schölkopf and Smola 2002; Rasmussen andWilliams 2006). There is also a large (partially overlapping) literature in Bayesian inference (Pillai et al. 2007; Zhang, Dai, andJordan 2011; Chakraborty, Ghosh, and Mallick 2012).The key idea we develop in this article is that for shiftinvariant kernels in the p n regime (i.e., the number ofCONTACT Lorin Crawfordlorin [email protected] of Biostatistics, Brown University, Providence, RI .Contact information for the co-corresponding authors for this article follows: Xiang Zhou, Department of Biostatistics, University of Michigan, Ann Arbor, MI . E-mail:[email protected]; Sayan Mukherjee, Department of Statistical Science, Duke University, Durham, NC . E-mail: [email protected] versions of one or more of the figures in the article can be found online at materials for this article are available online. Please go to American Statistical Association

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATIONvariables p is much larger than the number of observations n)there exists an accurate linear (in the covariates) approximationto the nonlinear kernel model that allows for the efficient computation of effect sizes. We specify a linear projection from thereproducing kernel Hilbert space (RKHS) to the space of theoriginal covariates to implement the linear approximation. Thislinear transformation is based on the fact that shift-invariantkernels can be approximated by a linear expansion of randomFourier bases. The idea of random Fourier expansions wasinitially exploited to obtain kernel regression models with superior runtime properties in both training and testing (Rahimiand Recht 2007; Băzăvan, Li, and Sminchisescu 2012). In thisarticle, we use the random Fourier bases to efficiently computethe analog of effect sizes for nonlinear kernel models.The variable selection framework we develop is implementedas a Bayesian empirical factor model that scales to large datasets.Previous efforts to carry out variable selection in fully Bayesiankernel models have faced challenges when applied to largedatasets due to either solving nonconvex optimization problems(Chapelle et al. 2002; Rakotomamonjy 2003; Rosasco et al. 2013)or due to sampling from Markov chains that typically mix poorly(Chakraborty et al. 2007; Chakraborty 2009). Indeed, therehave been recent works that attempt to overcome this problem with various approaches (Snoek et al. 2015; Gray-Davies,Holmes, and Caron 2016; Sharp et al. 2016). The main utility ofour approach is that variable selection for nonlinear functionsreduces to a factor model coupled with a linear projection.In Section 2, we introduce properties of RKHS models anddetail some of the basic functional analysis tools that allowfor mapping from the RKHS of nonlinear functions to functions linear in the covariates. In Section 3, we specify theBayesian approximate kernel regression (BAKR) model for nonlinear regression with variable selection. Here, we also definethe posterior probability of association analog (PPAA) whichprovides marginal evidence for the relevance of each variable. In Section 4, we show the utility of our methodologyon real and simulated data. Specifically, we focus on how ourmodel addresses two important problems in statistical genetics: genomic selection and association mapping. Finally, we closewith a discussion in Section 5.2. Theoretical OverviewIn this article, we focus on nonlinear regression functionsthat belong to an infinite-dimensional function space calleda reproducing kernel Hilbert space (RKHS). The theory wedevelop in this section will help to formalize the following twoobservations in the p n setting: (i) the predictive accuracy ofsmooth nonlinear functions is typically greater than both linearfunctions and sharply varying nonlinear functions; (ii) in thehigh-dimensional setting, a smooth nonlinear function can bereasonably approximated by a linear function. In the remainderof this section, we develop a framework that we will use inSection 3 to define a linear projection from an RKHS onto theoriginal covariates. This projection will serve as an analog foreffect sizes in linear models. Thorough reviews of the utilityand theory of RKHS can be found in other selected works (e.g.,Pillai et al. 2007; Bach 2017).17112.1. Reproducing Kernel Hilbert SpacesOne can define an RKHS based on a positive definite kernelfunction, k : X X R, or based on the eigenfunctions {ψi } i 1 and eigenvalues {λi }i 1 of the integral operator definedby the kernel function, λi ψi (u) X k(u, v)ψi (v) dv. For aMercer kernel (Mercer 1909) the following expansion holdsk(u, v) i 1 λi ψi (u)ψi (v), and the RKHS can be alternatively defined as the closure of linear combinations of basisfunctions {ψi } i 1 , H f f (x) ψ(x) c, x X and f K with f 2K c2i .λ2i 1 i Here, f K is the RKHS norm, ψ(x) { λi ψi (x)} i 1 is avector space spanned by the bases, and c {ci } i 1 are thecorresponding coefficients. The above specification of an RKHSlooks very much like a linear regression model, except the basesare ψ(x) (rather than the unit basis), and the space can beinfinite-dimensional.Kernel regression models in machine learning are oftendefined by the following penalized loss function (Hastie,Tibshirani, and Friedman 2001, sec. 5.8) n1 L( f (xi ), yi ) λ f 2K ,f arg min(1)f H ni 1where {(xi , yi )}ni 1 represents n observations of covariatesxi X R p and responses yi Y R, L is a loss function,and λ 0 is a tuning parameter chosen to balance the trade-offbetween fitting errors and the smoothness of the function. Thepopularity of kernel models is that the minimizer of (1) is alinear combination of kernel functions k(u, v) centered at theobserved data (Schölkopf, Herbrich, and Smola 2001): f (x) n αi k(x, xi ),(2)i 1where α {αi }ni 1 are the corresponding kernel coefficients. Thekey point here is that the form of (2) turns an -dimensionaloptimization problem into an optimization problem over nparameters. We denote the subspace of the RKHS realized bythe representer theorem as n n2αi k(x, xi ), α R and f K .HX f f (x) i 1We can also define the subspace HX in terms of the operator X [ψ(x1 ), . . . , ψ(xn )] withHX f f (x) X c and f 2K .(3)To extract an analog of effect sizes from our Bayesian kernelmodel, we will use the equivalent representations (2) and (3).Indeed, one can verify that c X α.2.2. Variable Selection in Kernel ModelsVariable selection in kernel models has often been formulatedin terms of anisotropic functionskϑ (u, v) k((u v) Diag(ϑ)(u v)),ϑ j 0, j 1, . . . , p,

1712L. CRAWFORD ET AL.where the vector ϑ represents the weights each coordinate andis to be inferred from data. Optimization-based approaches(Chapelle et al. 2002; Rakotomamonjy 2003; Rosasco et al.2013) implement variable selection by solving an optimizationproblem { f , ϑ} arg minf Hϑ ,ϑn1 L( f (xi ), yi ) λ f Kϑ2 ,n i 1where Hϑ is the RKHS induced by the kernel kϑ , and the magnitude of ϑ is evidence of the relevance of each variable. The jointoptimization over (Hϑ , ϑ) is a nonconvex problem and does notscale well with respect to the number of variables or the numberof observations. In the case of Bayesian algorithms, the idea is tosample or stochastically search over the posterior distribution p(ϑ, α {yi , xi }ni 1 ) exp n L( f (xi ), yi ) π (ϑ, α),i 1where π (ϑ, α) is the prior distribution over the parametersand exp{ ni 1 L( f (xi ), yi )} is the likelihood. Samplingpover ϑ R is challenging due to the complicated landscape,and Markov chains typically do not mix well in this setting(Chakraborty et al. 2007; Chakraborty 2009). We will proposea very different approach to variable selection by projecting theRKHS onto linear functions with little loss of information. Thisprojection operator will be based on random Fourier features.2.3. Random Fourier FeaturesIn