Sequential Complexities And Uniform Martingale Laws Of .

5m ago
38 Views
0 Downloads
564.89 KB
37 Pages
Transcription

Noname manuscript No.(will be inserted by the editor)Sequential Complexities and Uniform MartingaleLaws of Large NumbersAlexander Rakhlin Karthik Sridharan Ambuj TewariReceived: date / Accepted: dateAbstract We establish necessary and sufficient conditions for a uniform martingale Law of Large Numbers. We extend the technique of symmetrization tothe case of dependent random variables and provide “sequential” (non-i.i.d.)analogues of various classical measures of complexity, such as covering numbers and combinatorial dimensions from empirical process theory. We establishrelationships between these various sequential complexity measures and showthat they provide a tight control on the uniform convergence rates for empirical processes with dependent data. As a direct application of our results, weprovide exponential inequalities for sums of martingale differences in Banachspaces.Keywords empirical processes dependent data uniform Glivenko-Cantelliclasses Rademacher averages sequential predictionMathematics Subject Classification (2000) 60E15 60C05 60F15 91A20Alexander RakhlinUniversity of Pennsylvania, 3730 Walnut St, Philadelphia, PA 19103E-mail: [email protected] SridharanUniversity of Pennsylvania, 3730 Walnut St, Philadelphia, PA 19103E-mail: [email protected] TewariUniversity of Michigan, 439 West Hall, 1085 South University, Ann Arbor, MI 48109E-mail: [email protected]

2Alexander Rakhlin et al.1 IntroductionLet (Ω, A, P) be an arbitrary complete probability space. Let Z be a separable metric space and F {f Z R} be a set of bounded real valued functions on Z. Consider independent and identically distributed random variablesZ1 , . . . , Zn , . . . in Z with the common distribution P. The empirical processindexed by f F is defined asf Gn (f ) 1 n (Ef (Z) f (Zt )) .n t 1The study of the behavior of the supremum of this process is a central topic inempirical process theory, and it is well known that this behavior depends onthe “richness” of F. Statements about convergence of the supremum to zeroare known as uniform Laws of Large Numbers (LLN). More precisely, a class Fis said to be (strong) Glivenko-Cantelli for the distribution P if the supremumof Gn (f ) converges to zero almost surely as n . Of particular interest areclasses for which this convergence happens uniformly for all distributions. Aclass F is said to be uniform Glivenko-Cantelli if δ 0, limsup P (sup sup Gn (f ) δ) 0′n Pn n′ f F(1)where P is the product measure P . As a classical example, consider i.i.d. realvalued random variables Z1 , . . . , Zn and a class F {z 1 {z θ} θ R},where 1 {} is the indicator function. For this class, (1) holds by the well knownresults of Glivenko and Cantelli: almost surely, the supremum of the differencebetween the cumulative distribution function and the empirical distributionfunction converges to zero. A number of necessary and sufficient conditions forthe Glivenko-Cantelli and the uniform Glivenko-Cantelli properties have beenderived over the past several decades [11].In this paper, we are interested in the martingale analogues of the uniformLLN, as well as in the analogues to the various notions of complexity thatappear in empirical process theory. Specifically, consider a sequence of randomvariables (Zt )t 1 adapted to a filtration (At )t 1 . We are interested in thefollowing process indexed by f F:f Mn (f ) 1 n (E[f (Zt ) At 1 ] f (Zt )) .n t 1The central object of study in this paper is the supremum of the processMn (f ), and in particular we address the question of whether a uniform convergence similar to (1) holds. Evidently, Mn (f ) coincides with Gn (f ) in thecase when Z1 , Z2 , . . . are i.i.d. random variables. More generally, for any fixedf F, the sequence (E [f (Zt ) At 1 ] f (Zt ))t 1 is a martingale differencesequence. Similar to the notion of uniform Glivenko-Cantelli class F, we candefine the notion of uniform convergence for dependent random variables overfunction class F as follows.

Sequential Complexities and Uniform Martingale Laws of Large Numbers3Definition 1 A function class F satisfies Sequential Uniform Convergence if, δ 0, limsup P (sup sup Mn (f ) δ) 0 ,′n Pn n′ f F(2)where the supremum is over all distributions P on the space (Ω, A).The gap between properties (1) and (2) is already witnessed by the exampleof the class F {z 1 {z θ} θ R} of functions on R, discussed earlier. Incontrast to the uniform Glivenko-Cantelli property, the martingale analogue(2) does not hold for this class. On the positive side, the necessary and sufficientconditions for a class F to satisfy sequential uniform convergence, as derivedin this paper, can be verified for a wide range of interesting classes.2 Summary of the ResultsOne of the main results in this paper is the following equivalence.Theorem 1 Let F be a class of [ 1, 1]-valued functions. Then the followingstatements are equivalent.1. F satisfies Sequential Uniform Convergence.2. For any α 0, the sequential fat-shattering dimension fatα (F) is finite.3. Sequential Rademacher complexity Rn (F) satisfies limn Rn (F) 0.Theorem 1 yields a characterization of the uniform convergence property interms of two quantities. The first one is a combinatorial “dimension” of theclass at scale α (Definition 7). The second is a measure of complexity of theclass through random averages (Definition 3). In addition to these quantities,we define sequential versions of covering numbers and the associated Dudleytype entropy integral. En route to proving Theorem 1, we obtain key relationships between the introduced covering numbers, the combinatorial dimensions,and random averages. These relationships constitute the bulk of the paper, andcan be considered as martingale extensions of the results in empirical processtheory. Specifically, we show– A relationship between the empirical process with dependent random variables and the sequential Rademacher complexity (Theorem 2), obtainedthrough sequential symmetrization.– An upper bound of sequential Rademacher complexity by a Dudley-typeentropy integral through the chaining technique (Theorem 3).– An upper bound on sequential covering numbers in terms of the combinatorial dimensions (Theorems 4 and 5, as well as Corollary 1). In particular,Theorem 5 is a sequential analogue of the celebrated Vapnik-ChervonenkisSauer-Shelah lemma.

4Alexander Rakhlin et al.– A relationship between the combinatorial dimension and sequential Rademacher complexity (Lemma 2) and, as a consequence, equivalence of manyof the introduced complexity notions up to a poly-logarithmic factor.– Properties of sequential Rademacher complexity and, in particular, thecontraction inequality (Lemma 7).– An extension of the above results to high-probability statements (Lemmas4, 5, and 6) and an application to concentration of martingales in Banachspaces (Corollary 2).This paper is organized as follows. In the next section we place the presentpaper in the context of previous work. In Sections 4-6 we introduce sequentialcomplexities. A characterization of sequential uniform convergence appears inSection 7. We conclude the paper with some structural results in Section 8 andan application to exponential inequalities for sums of martingale differencesequences in Banach spaces in Section 9. Most proofs are deferred to theappendix.3 Related LiteratureThe seminal work of Vapnik and Chervonenkis [37] provided the first necessary and sufficient conditions – via a notion of random VC entropy – for aclass F of binary valued functions to be a Glivenko-Cantelli (GC) class. Theseresults were strengthened by Steele [29], who showed almost sure convergence.A similar characterization of the GC property via a notion of a covering number in the case of uniformly bounded real-valued functions appears in [38].For the binary-valued case, a distribution-independent version of the VC entropy (termed the growth function) was shown by Vapnik and Chervonenkis[37] to yield a sufficient condition for the uniform GC property. The “necessary” direction was first shown (according to [11, p. 229]) in an unpublishedmanuscript of Assouad, 1982. For real-valued classes of functions, the necessaryand sufficient conditions for the uniform GC property were established in [12]through a notion of a covering number similar to the Koltchinskii-Pollard entropy. A characterization of GC classes for a fixed distribution were also givenby Talagrand [30, 31] through a notion of a “witness of irregularity”. Similarin spirit, the pseudo-dimension introduced in [24] was shown by Pollard to besufficient, though not necessary, for the uniform GC property. A scale-sensitiveversion of pseudo-dimension (termed the fat-shattering dimension by [5]) wasintroduced by Kearns and Schapire [16]. Finiteness of this dimension at allscales was shown in [3] to characterize the uniform GC classes. We refer thereader to [11, Chapter 6] and [33, 34] for a much more detailed account of theresults.The GC-type theorems have also been extended to the case of weakly dependent random variables. For instance, Yukich [40] relies on a φ-mixing assumption, while Nobel and Dembo [21] and Yu [39] consider β-mixing sequences.

Sequential Complexities and Uniform Martingale Laws of Large Numbers5For a countable class with a finite VC dimension, a GC theorem has beenrecently shown by Adams and Nobel [2] for ergodic sequences. We refer thereader to [2, 10] for a more comprehensive survey of results for non-i.i.d. data.Notably, the aforementioned papers prove a GC-type property under much thesame type of complexity measures as in the i.i.d. case. This is in contrast tothe present paper, where the classical notions do not provide answers to thequestions of convergence.In this paper, we do not make mixing or ergodicity assumptions on the sequence of random variables. However, the definition of Mn (f ) imposes a certain structure which is not present when an average is compared with a single expected value. Thus, our results yield an extension of the GC propertyto non-i.i.d. data in a direction that is different from the papers mentionedabove. Such an extension has already been considered in the literature: thequantity supf F Mn (f ) has been studied by S. van de Geer [34] (see Chapter 8.2). Dudley integral type upper bounds for a given distribution P wereprovided in terms of the so called generalized entropy with bracketing, corresponding to the particular distribution P. This is a sufficient condition forconvergence of the supremum of Mn (f ) for the given distribution. In this work,however, we are interested in providing necessary and sufficient conditions forthe uniform analogue of the GC property, as well as in extending the ideas ofsymmetrization, covering numbers, and scale-sensitive dimensions to the noni.i.d. case. Towards the end of Section 7, we discuss the relationship betweenthe generalized entropy with bracketing of [34] and the tools provided in thiswork.We also stress that this paper studies martingale uniform laws of large numbers rather than a convergence of nMn (f ), which only holds under stringentconditions; such a convergence for reverse martingaleshas been studied in [36]. The question of the limiting behavior of nMn (f ) (that is, the analogue ofthe Donsker property [11]) is also outside of the scope of this paper.The study of the supremum of the process Mn (f ) has many potential applications. For instance, in [35], the quantity supf F Mn (f ) is used to providebounds on estimation rates for autoregressive models. In [1, 25] connectionsbetween minimax rates of sequential prediction problems and the supremumof the process Mn (f ) over the associated class of predictors F are established.In Section 9 of this work, we show how the supremum of Mn (f ) over class oflinear functionals can be used to derive exponential inequalities for sums ofmartingale differences in general Banach spaces.4 Symmetrization and the Tree ProcessA key tool in deriving classical uniform convergence theorems (for i.i.d. random variables) is symmetrization. The main idea behind symmetrization is tocompare the empirical process Gn (f ) over a probability space (Ω, A, P) to a

6Alexander Rakhlin et al.symmetrized empirical process, called the Rademacher process, over the probability space (Ω , B, P ) where Ω { 1, 1}N , B the Borel σ-algebra and P the uniform probability measure. We use the notation E to represent expectation under the measure P , and (Bt )t 0 to denote the dyadic filtration on Ω given by Bt σ( 1 , . . . , t ), where t ’s are independent symmetric { 1}-valuedRademacher random variables and B0 {{}, Ω }.(z)Given z1 , . . . , zn Z, the Rademacher process Sn 1 n (f ) is defined1 as1 n )f S(z(f ) n1 n t f (zt ) .n t 1(3)It is well-known (e.g. [33]) that the behavior of the supremum of the sym(z )metrized process Sn 1 n (f ) is closely related to the behavior of the supremumof the empirical process asE sup Gn (f ) 2 supf Fz1 ,.,zn Z1 n )E sup S(z(f )nf F(4)and a similar high-probability statement can also be proved. Note that theRademacher process is defined on the probability space (Ω , B, P ), which ispotentially easier to handle than the original probability space for the empiricalprocess.In the non-i.i.d. case, however, a similar symmetrization argument requiressignificantly more care and relies on the notion of decoupled tangent sequences[9, Def. 6.1.4]. Fix a sequence of random variables (Zt )t 1 adapted to the filtration (At )t 1 . A sequence of random variables (Zt′ )t 1 is said to be a decoupledsequence tangent to (Zt )t 1 if for each t, conditioned on Z1 , . . . , Zt 1 , the random variables Zt and Zt′ are independent and identically distributed. Thus,the random variables (Zt′ )t 1 are conditionally independent given (Zt )t 1 . InTheorem 2 below, a sequential symmetrization argumen