Adaptive Index Models For Marker-based Risk Strati Cation

5m ago
30 Views
0 Downloads
278.00 KB
29 Pages
Transcription

Adaptive index models for marker-based riskstratificationLu Tian Robert Tibshirani†November 16, 2009AbstractWe use the term index predictor to denote a score that consists ofK binary rules such as “age 60” or “blood pressure 120 mm Hg”.The index predictor is the sum of the scores, yielding a value from 0 toK. Such scores as often used in clinical studies to stratify populationrisk: they are usually derived from subject area considerations. Inthis paper we propose a fast procedure for automatically constructingsuch indices based on a training dataset, for linear regression, logisticregression and Cox survival models. We also extend the procedureto create indices for detecting treatment-marker interactions. Themethods are illustrated on a study with protein biomarkers as well astwo microarray gene expression studies.1IntroductionWhen predicting a phenotype such as clinical response or survival time froma set of biomarkers, an “index predictor” is sometimes used. This consists ofa set of binary rules such as “marker xk ck ” or “marker xk ck for eachof K markers. For each observation we add up the binary scores yieldingan index s taking values in {0, 1, · · · , K}. This has the advantage of simplicity: it is easy to state and interpret, and also can capture situations where Depts. of Health, Research & Policy, Stanford Univ, Stanford CA, 94305; [email protected]†Depts. of Health, Research & Policy, and Statistics, Stanford Univ, Stanford CA,94305; [email protected]

prognostic effects are shared by multiple markers. A popular example isthe International Prognostic Index (IPI) used for risk classification in NonHodgkins lymphoma (TIN-HsLPF (1993)). The IPI consists of one point foreach of: Age greater than 60 years Stage III or IV disease elevated serum LDH ( 1) ECOG/Zubrod performance status of 2, 3, or 4 More than 1 extranodal site.The resulting score lies in 0 to 5, with higher scores indicating greater risk.Sometimes the IPI score is further simplified into two or three categories as(low, high) or (low, medium, and high) for risk stratification.An example is shown Figure 1. Shown are the survival curves from a setof patients with Non-Hodgkins lymphoma, for each of the levels of the IPI.There is clear separation in the groups.In this paper we propose a method for adaptively constructing an indexpredictor from a set of training data. We also return to this example anddemonstrate that our proposal can re-construct the IPI empirically from aset of training data.This paper is organized as follows. In section 2 we introduce the adaptiveindex model and our algorithm for its estimation. We discuss an examplethat in which protein biomarkers are used to predict the presence of ovariancancer. In Section 3 we discuss the AIM model for survival model, usingCox’s proportional hazards model and illustrate how the AIM procedure canre-discover the international prognostic index (IPI) discussed above. Weextend the AIM procedure to look for interactions between markers and abinary treatment factor in Section 5. We also discuss the construction ofsurrogate markers. In Section 6 we investigate the performance of the AIMprocedure with a large number of predictors, and propose the use of “preconditioning” to avoid overfitting. The degrees of freedom of the AIM procedure is studied both mathematically and numerically in Section 7. There areclear connections to other methods such as CART (Breiman et al. (1984)),PRIM (Friedman & Fisher (1999)), their more recent refinements in LeBlancet al. (2005) and LeBlanc et al. (2002), boosted trees (see e.g. Friedmanet al. (2000),Friedman (2001)), and the “logic regression”. (Ruczinski et al.(2003)). We discuss these and make some concluding remarks in Section 8.2

1.00.60.40.00.2Prob survival0.8IPI 0IPI 1IPI 2IPI 3IPI 4051015monthsFigure 1: Survival curves from a set of patients with Non-Hodgkins lymphoma, for each of the levels of the International Prognostic Index (IPI).2Adaptive index modelsConsider a supervised learning problem (regression or generalized regression)with data (xi , yi ), i 1, 2, . . . N . Here xi is a p-vector of predictor variablesand yi in an outcome variable. The three major applications that we considerare the linear regression model, the logistic model for binary data whereyi {0, 1} and Cox’s proportional hazards model for survival data whereyi (Ti , δi ), where Ti is a right censored survival time and δi is the censoringindicator. Denote the log-likelihood or log partial log-likelihood by (η; x, y),where η is the usual linear combination of predictors. For example η is thelinear predictor in a regression model, the log-odds in the logistic model andthe log-hazard in the proportional hazards model. We consider an indexmodel in the form ofη β0 β ·KXk 13I(x̃ k ck )(1)

with K p. The predictors x̃ k are from the set { x1 , x2 , . . . xp ) and thecorresponding cutpoints ck are chosen in a forward stepwise manner to maximizePthe log-likelihood (η; x, y). The result is a simple “index” predictor s Kk 1 I(x̃k ck ) which is just a count ranging from 0 to K. By allowing x̃k to equal xj , we effectively allow cuts of the complementary form xj cj .What makes our procedure attractive is the fact that as we change thecutpoint ck , updating formulas can be derived for the score test for testingβ 0. Next we give the details of the updating scheme for linear and logisticmodel.Our model isKXηi β0 βI(x̃ j cj ).k 1ηi E(Yi xi ) in the linear model, while ηi log[pi /(1 pi )] in the logisticmodel,pi Prob(y 1 xi ). Suppose that we have a score s Pk 1 where j 1 I(x̃j cj ) and want to decide whether to add a term z I(x c).Hence we fit η β0 β(s z) and test β 0 in the regression model.Letting µ̂0 ȳ, the average of {yi , i 1, · · · , N }, we have the score vectorand information matricesU (U1 , U2 ) N Xi 1andI v̂0(yi µ̂0 ),nv̂PN0i 1siNXi 1si (yi µ̂0 ) Psv̂0 NiP i 1 2 ,v̂0 Ni 1 siwhere v̂0 is the empirical variance of {yi } in both the linear and logisticmodels. The score test is U2 /V 1/2 where V 1 I11 /(I11 I22 I12 I21 ).Without the loss of generality, we assume that the observations are sortedaccording to x {x i , i 1, · · · , N }, the predictor of interest, i.e., x 1 x 2 · · · x N . We have the following updating formulas as the cutpoint c movesfrom x i 1 to x i :U2 U2 (yi µ̂0 )I12 I12 v̂0I22 I22 v̂0 (1 2si ).Thus we can scan through all possible cutpoints for a given predictor injust O(n) operations. We summarize the algorithm below, called “AIM” for“Adaptive Index Models”.4

AIM procedure1. Begin with k 0, s 0.2. For k 1, 2, . . . K update s s I(xj cj ) or s s I(xj cj )where (j, cj ) maximize the score test over the markers not yet entered.In practice we set the maximum model size K to, say, 10 or 20 and estimatethe best model size k by cross-validation.Figure 2 shows the run time in seconds of the AIM procedure for logisticregression for various combinations of n, p and the maximum model size K.In the left and middle panels we run the algorithm until K 20 terms havebeen added. In the right panel we have fixed p at 100. We see that thealgorithm is remarkably fast, and scales roughly linearly in n, p and K.n 100n 200n 500n 1000n 2000p 10010K 20p 100p 200p 500p 1000p 2000K 5K 10K 20K 50K 10050time (sec)640500100015002000010001022020430time (sec)30time (sec)405086060K 20500p1000n1500200050010001500nFigure 2: Timings (sec) for the AIM procedure, for various values of n, p,and the number of terms K.52000

In some cases it can be helpful to iteratively re-adjust the split pointsvia a backfitting procedure post-AIM analysis or pre-process the data withmethod such as supervised principal components analysis pre-AIM analysis.We illustrate this in Section 6. We also note that this model is related toboosted trees (see e.g. Friedman et al. (2000),Friedman (2001)) in the casewhere the trees are stumps (single split trees). In the AIM model we furtherconstrain all of the stumps to share the same multiplier.2.1Ovarian cancer dataThe data for this example is taken from Fredriksson et al. (2008). It consistsof 20 blood protein biomarkers in each of two groups: healthy and patientswith ovarian cancer. There are 20 patients in each group. We applied theAIM procedure with a maximum of 10 biomarkers. Tenfold cross-validationwas used to assess the model prediction, producing the curves in Figure 3. Inthe figure we have used two different methods for making predictions. Giventhe (integer) score s, the “logit” method fits a logistic regression in the training set to s and uses the resulting model to make predictions in the validationset. The “cutpoint method” finds the cutpoint c that produces the fewesterrors in predicting y (or 1 y) as I(s c) in the training data, and then usesthis cutpoint to make predictions in the validation set. Both methods yielderror rates of about 10-15% with 2-3 markers, and perform better than nearest shrunken centroids procedure that was used in the original paper. Alsoshown are the error rates for standard forward stepwise logistic regression.The AIM score s has the form of I(x7 11.8) I(x16 9.0) I(x9 12.0).The optimal cutpoint is s 2, so the prediction rule classifies to the cancer group if all three biomarkers fall in their “red” regions. Figure 4 showsthat the 3 markers over the training set, with red points indicating that thecorresponding condition (such as “x7 11.8”) is satisfied. Figure 5 shows aschematic of the final model.Figure 6 shows the result of applying CART to these data, using a minimum node size of five on the left and three on the right. The cross-validatederrors for the two trees were 10% and 35% respectively, with the first beingabout the same as that for the AIM fit with 3 markers. The values beloweach node are the numbers of observations in the training set in each of thetwo classes. We see that in each case the CART nodes are pure or almostpure. In contrast, the AIM procedure produces a model with a score equal to0, 1, 2, 3 or 4. The corresponding counts are (1, 0), (15, 0), (4, 3) and (0, 17).6

0.60.30.00.10.2CV error0.40.5nearest shrunken centroidslogistic regAIM cutpointAIM logit05101520Number of markersFigure 3: Protein biomarkers data: Cross-validated error curves for AIM andnearest shrunken centroids. the standard error of each curve is about 1%.Hence AIM has (potentially) found an intermediate group with a score oftwo and approximately equal numbers of disease and non-disease patients.3AIM for Survival dataIn the following section, we present the algorithm for scanning through allpossible cutpoints for a given predictor in survival analysis. Here we have anoutcome yi (Ti , δi ). and predictors x1 , . . . xp . Our model is the proportionalhazards modelh(t x) h0 (t) exp (η)7(2)

PatientCancerHealthyEGFR/50OPN/50Erbb2MarkerFigure 4: Protein biomarkers data. Focussing on the model with 3 biomarkers, a red point indicates that the biomarker satisfies its corresponding splitcondition in that sample.P where h(t x) is the hazard function and and η Kj 1 I(x̃j cj ) for x̃j { x1 , · · · , xp }.We construct the score test as follows. Suppose that we want to decidewhether to add a term z (x c) to the existing score s. Let w s z.The score test statistics is U/I 1/2 where ()2 P NNXXXXw11lδi wi l RiU and I δi wl2 wl ,nnniiii 1i 1l Ril Riwhere Ri is risk set and ni is the size of risk set at time Ti . Without loss ofgenerality, we assume that x 0 x 1 · · · x N . Hence if we move c8

Marker 7 11.7620/260/14Marker 16 8.991/1019/30Marker 9 12.0318/252/15Overall score01320/10/153/717/17Figure 5: Protein biomarkers data. Schematic of the three splits and finalmodel in bottom panel. The numbers such as 20/26 indicate that there are20 out of 26 patients in the training set with cancer in the correspondingstratum.from x i 1 to x i , the test statistics can be updated asU U δi X δknkT TikandX δk (nk 1)X X δk {sj I(x j x i 1 )}X δkI I 2 2sin2knknkT TT T T TT Tkikikikjwith the initial valuesU NXk 1 δk sk 9Pl Rknksl ,

xall.EGFR/50 11.72 015/0xall.EGFR/50 11.72 xall.OPN/50 9.05015/0xall.OPN/50 9.05xall.CA 19 9 10.4405/110/1905/010/110/19Figure 6: Protein biomarkers data: CART trees with three terminal nodes(left) and four terminal nodes (right).andI NXk 13.1 δk 1 X 2sl nkl Rk()2 1 Xsl .nkl RkLenz data and IPIHere we analyze data on Non-Hodgkin’s lymphoma from Lenz et al. (2008).There are 248 patients, with the outcome being overall survival time and alarge set of gene expression measurements from microarrays. The patientsreceived either CHOP or RCHOP treatments. Later we analyze the interaction between gene expression and efficacy of the treatment. Here we explorewhether the AIM procedure can re-construct the widely used internationalprognostic index (IPI). The details of the IPI are given in the Introduction.We divided the data into approximately equal-sized training and testsets, and input the five predictors (age, stage, LDH, ECOG status, numberof sites) into the AIM procedure. We then computed the Cox score statistic10

for resulting index over the test set. This process was repeated 20 times,giving an average Cox score of 3.24(.18). The actual IPI had an averagescore of 3.37(.18). The cutpoints for stage were 1 versus 1, 19 out of20 times, as opposed to the standard definition of (1, 2) versus (3, 4). Thenumbers of extranodal sites split as 1, 2 and 3 are 11, 6 and 3times respectively. ECOG split as 0, 1 versus 1 18 times out of 20. Thedistribution of split points for age and LDH are shown in Figure 7 with thecorresponding standard IPI split points shown by the red lines. The cutpointsfor age are approximately centered at the standard cutpoint of 60, but thosefor LDH are considerably above the standard cutpoint of 1