================================================================ * BOOTSTRAP -- A METHOD FOR ESTIMATING SE/VARIANCE, BIAS, AND CIs - Preliminaries: statistical functionals . Consider i.i.d. samples y1,...,yn ~ F Form the empirical distribution of y1,...,yn: Fn = 1/n sum_i delta_yi . 'Statistical functionals': (0) T(y1,...,yn) = T(Fn) (1) T(Fn) --> T(F) as n --> Inf (2) (T(Fn)-T(F))*sqrt(n) --> N(0,AV(F,T)) in distrib./weakly as n --> Inf . Interpretations: (1) "a generalized law of large numbers" a limited form of continuity of T() (2) asymptotic normality "a generalized CLT" AV(F,T) is called the asymptotic variance. [Notation: We may drop the argument T.] . Examples for property (1): + T(Fn)=mean, T(F)=expectation + T(Fn)=empirical 90% quantile; T(F)=distributional 90% quantile [question of uniqueness, may be declared not to exist for some F] Counter-examples: + T(Fn) = SSn/(n-1) However, the biased estimate T(Fn)=SSn/n is a statistical functional. + T(Fn) = y{n/2} is not a statistical functional. However, the median is a statistical functional. . This framework gives justification to asymptotic CIs: CI = [T(Fn)-1.96*sqrt(AV(F)/n), T(Fn)+1.96*sqrt(AV(F)/n)] P(T(F) in CI) --> 0.95 as n --> Inf. . Problems: + AV(F) is not known; it needs to be estimated. + Even though asymptotic theory says the bias of T(Fn) disappears faster than the variance as n-->Inf, the bias can still be substantial for n of interest. - Bootstrap: Efron''s 'trivial idea' (1979 -- 30 year anniversary in 2009) . Mother nature/deity presents us with data y1,...,yn i.i.d. F. . We do not know F, but we can estimate it. Call the estimate 'Fhat'. + Fhat = Fn (the empirical distribution): --------------------------- | NONPARAMETRIC BOOTSTRAP | --------------------------- + Fhat = P(dy;theta.hat) plug-in wrt a model P(dy:theta): --------------------------- | PARAMETRIC BOOTSTRAP | --------------------------- . Bootstrap principle: + Act as if you were mother nature and Fhat were the truth. That is, sample artificial iid data from Fhat, repeatedly: (y1*1,...,yn*1) # artificial dataset 1 (y1*2,...,yn*2) # artificial dataset 2 (y1*3,...,yn*3) # artificial dataset 3 ... Repeat, for example, Nboot=1000 times. + Compute statistics T() from the observed dataset {y1,...,yn} and from all bootstrap datasets {y1*j,...,yn*j} (j=1...Nboot). ==> The "bootstrap values" of the statistic T() { T*1=T(y1*1,...,yn*1), T*2=T(y1*2,...,yn*2), ... T*Nboot=T(y1*Nboot,...,yn*Nboot) } can be used as an estimate of the sampling or "dataset-to-dataset" distribution of T(). . Uses of the bootstrap values: + Their SD is as an estimate of the SE of T(Fhat). --------------------------------------- | SE.boot = SD({T*1,T*2,...,T*Nboot}) | --------------------------------------- + Form bootstrap CIs: CI = [ T(Fhat)-2*SE, T(Fhat)+2*SE ] = '[ T(Fhat) +- 2*SE ]' + Caveat: This is the simplest form of bootstrap CIs. It trusts that n is large enough to make asymptotic normality a good approximation. - Comparison of the parametric and nonparametric bootstrap: . Parametric bootstrap may not give you the proper idea of dataset-to-dataset variability if the fitted model overfits. (Think about it!) . Nonparametric bootstrap is vastly more popular when it applies, but the datasets it creates have dups and do not use some cases: P[case 'i' not used in n draws] = (1-1/n)^n -> exp(-1) ~ 0.3678794 > 1/3 => Approx fraction of cases used is (1-exp(-1)) ~=~ 0.6321206 or just under 2/3. (After many bootstrap samples, one will have used all data, though.) - General remarks on bootstrap: . Nboot is not part of the concept really... For theory, Nboot=Inf, i.e., complete enumeration. Nboot In order for CIs to be justified, one needs a sufficiently large sample size n so the sampling distribution is close to normal. . Resample size and sample size: + While it is true that our main interest is to estimate the sampling variation of estimates for samples of size n, we could also estimate it for samples of any other size m (m smaller OR larger than n). ==> Resample iid m values from {y1,...,yn} to form bootstrap resamples {y*1,...,y*m} for m=!=n. + Standard error estimates can always be translated to asymptotic standard error estimates by ------------------------------------------ | SEn*sqrt(n) --> sqrt(AV) as n --> Inf | ------------------------------------------ Asymptotic standard error estimates can always be translated to finite-sample standard error estimates for arbitrary sample size m by SEm = ASD/sqr(m) + It follows that we could use an arbitrary resample size m and translate back to sample size n by ----------------------- | SEn ~ SEm*sqrt(m/n) | ----------------------- This raises the question of choice of resample size. There is some theory to suggest that small resample size could work even if the n-of-n bootstrap 'does not work'. [see Bickel, Goetze, van Zwet (Sinica)] # - Non-parametric bootstrap standard errors: linear regression # We will use the x-y or pairs bootstrap, not the residual bootstrap! # Linear OLS regression raises the question of when to use which version: # . If you trust the linear model to to be first order correct # AND the errors to be iid, then use the residual bootstrap. # . If you wish to make no assumptions other than that the (x,y) tuples # are iid sampled, then use the x-y boostrap. # Data Example: the 'bodyfat' data site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-961/" bodyfat <- read.table(paste(site,"bodyfat.dat",sep="")) names(bodyfat) <- scan(paste(site,"bodyfat.col",sep=""), w="") colnames(bodyfat) # Regress 'PercBF' on all predictors: bf.lm <- lm(PercBF ~ . - Density, data=bodyfat) bf.lm.smry <- summary(bf.lm) bf.lm.smry # LS coefficients: bf.b <- bf.lm.smry$coeff[,"Estimate"] cbind(signif(bf.b,3)) # Standard errors computed the usual way: bf.b.se <- bf.lm.smry$coeff[,"Std. Error"] cbind(signif(bf.b.se,3)) # Bootstrap simulation: ------------------------------------- N <- nrow(bodyfat) Nboot <- 10000 bf.b.bs <- matrix(NA, ncol=length(bf.b), nrow=Nboot) colnames(bf.b.bs) <- names(bf.b.se) for(iboot in 1:Nboot) { sel <- sample(N, replace=T) # check dups: sort(table(sel)) bf.b.bs[iboot,] <- coef(lm(PercBF ~ . - Density, data=bodyfat[sel,])) if(iboot%%200==0) print(iboot) # ^^^^^^^^^^^^^ BS dataset } # One-liner, w/o intermediate output: bf.b.bs <- t(replicate(Nboot, { coef(lm(PercBF ~ . - Density, data=bodyfat[sample(N,repl=T),])) })) dim(bf.b.bs) dimnames(bf.b.bs) # Bootstrap standard errors: -------------------------------------- bf.b.bs.se <- apply(bf.b.bs, 2, sd) sqrt(diag(cov(bf.b.bs))) # Same thing... # Compare bootstrap with usual standard errors: cbind(signif( cbind(boot=bf.b.bs.se, classic=bf.b.se), 2)) # Ratio comparison: cbind(round( bf.b.bs.se / bf.b.se, 3)) # Mostly a non-event: Bootstrap SEs are very close to the usual SEs # except for 'Height': classic=.096, bootstrap=0.140, ratio=1.45 # In other data we see greater disparities, sometimes a factor of 2. # The ratios can also be < 1, but empirically this happens less often # than > 1. # - Bootstrap CIs from bootstrap SEs: round( cbind( bf.b - 2*bf.b.bs.se, bf.b + 2*bf.b.bs.se), 3) # Prefer 2 over 1.96 # - Bootstrap 2-sided p-value with asymptotic approximation: cbind(round(2*pmin(pnorm( bf.b, m=0, s=bf.b.bs.se), pnorm( -bf.b, m=0, s=bf.b.bs.se)), 7)) # [round() is used to avoid exponential/scientific form of the p-values. Compare readability] cbind(2*pmin(pnorm( bf.b, m=0, s=bf.b.bs.se), pnorm( -bf.b, m=0, s=bf.b.bs.se))) # - We applied non-parametric bootstrap to regression, # thus capturing the full variability of (xi,yi) sampling, # whereas 'usual' inference CONDITIONS ON THE xi-VALUES # and considers variation in yi given xi only. # In this particular dataset, the discrepancies between # conventional and BS SEs is not a concern other than for # 'Height', and a little for 'Weight' and 'Ankle'. # - How meaningful are the bootstrap SEs? # Is the bootstrap distribution approximately normal? # Maybe this is a good place to CHECK NORMALITY OF THE ESTIMATES: par(mfrow=c(4,4), mgp=c(1.5,.5,0), mar=c(1.5,1.5,2,1)) for(j in 1:ncol(bf.b.bs)) { qqnorm(bf.b.bs[,j], pch=16, cex=.3, main=names(bf.b)[j], xlab="", ylab="", cex.main=1) co <- "gray"; lw=.5 abline(v=0, col=co, lwd=lw) # vertical: 0 to mark median abline(h=bf.b[j], col=co, lwd=lw) # horizontal: b[j] # show the normal line implied by classic theory: N(b[j], SE(b[j])^2) abline(a=bf.b[j], b=bf.b.se[j], col=co, lwd=lw) # slope: SE(b[j]) abline(h=0, col="red", lwd=lw) # horizontal: 0 to show significance } # However, these pictures are misleading: What really matters is not # non-normality of estimates, but of estimates standardized by their # standard error ESTIMATES. # PS: If we had TRUE standard errors, the above pictures would be relevant, # but we do not have them; we only have estimates thereof. # . Note: # + 'Height' has a problem with non-normality. # + 'Ankle' has a problem with a non-normal lower tail. # + 'Weight' has a minor problem in the upper tail. # + 'Chest' has a minor problem in the lower tail. # + Intercept looks nearly normal but wider than classical. # Most of these problems do not matter because of clear insignificance. # Issues arise only in the moderate tails where we seek a decision # about significance. # . Reading statistical significance from these plots: # The red line is at zero, hence we have significance # if ... # . BIAS: b[j] (=hor. line) and median(b.bs[,j]) are close # hence there is little median bias in the coefficients. # 'Little' means 'little compared to SE.BS(b[j])' # - BIAS ESTIMATION with bootstrap: # . Estimate bias = E[estimates|truth] - truth # by bias.hat = E[boostrap estimates|estimate] - estimate # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # average of bootstrap estimates # . Example of bootstrap bias estimates: for all regr coeffs bf.b.bs.bias <- apply(bf.b.bs, 2, mean) - bf.b cbind(signif( bf.b.bs.bias, 3)) # Compare bootstrap bias estimate and bootstrap SE: cbind(signif( bf.b.bs.bias / bf.b.bs.se, 3)) # Sizable biases: Weight (0.16), Chest (-0.19), Ankle (-0.15) # Still, these biases are relatively small compared to the SE, # which is natural because bias tends to disappear at the rate 1/n # whereas the SE shrinks at the slower rate 1/sqrt(n). # On the other hand, bias can be non-neglibible for small n. # Median biases seem to be a touch smaller than mean biases: bf.b.bs.bias.med <- apply(bf.b.bs, 2, median) - bf.b cbind(signif( bf.b.bs.bias.med / bf.b.bs.se, 3)) # Bias should be compared to standard error: signif(cbind(meanBiasPerSE=bf.b.bs.bias / bf.b.bs.se, medBiasPerSE= bf.b.bs.bias.med / bf.b.bs.se ),3 ) # - Bootstrap bias correction/reduction: # T(Fn) - BS.Bias.hat # = T(Fn) - ( ave[T(Fn*)] - T(Fn) ) # = 2*T(Fn) - ave[T(Fn*)] # This is the typical form of bias correction: 2*estimate - ave.est # - Further potential problem: non-normality of T(Fn)-distribution # Skewness, for example. Imagine T(Fn) is right skewed: # T(F) # ------|---------|-------------------------|--------- # 2.5% 50% 97.5% # |-----------------------------------| # 95% range of T(Fn)-distribution # Q: What interval should you put around T(Fn) to catch T(F)? # A: An asymmetric interval that reaches farther left than right # to compensate for the fact that T(Fn) tends to fall # farther to the right than to the left. # - Idea: Construct CIs not symmetrically from T(Fn) +- 2*SE, # but use the direct 2.5% and 97.5% BS quantiles # to estimate lower and upper CI bounds. # - Statement of the naive BS quantile (!) CI: round(cbind( CI.BS.lo= apply(bf.b.bs, 2, quantile, .025), # bootstrap quantile CI.BS.hi= apply(bf.b.bs, 2, quantile, .975), # " CI.lo = bf.b + qt(.025,df=238)*bf.b.se, # classical SE CI.hi = bf.b + qt(.975,df=238)*bf.b.se), 3) # " ## BS quantile CIs Classical CIs ## CI.BS.lo CI.BS.hi CI.lo CI.hi ## (Intercept) -57.462 25.190 -52.365 15.988 ## Age 0.003 0.124 -0.002 0.126 ## Weight -0.201 0.060 -0.194 0.017 ## Height -0.403 0.233 -0.259 0.120 ## Neck -0.928 0.007 -0.929 -0.013 ## Chest -0.262 0.144 -0.219 0.171 ## Abdomen 0.795 1.112 0.784 1.125 ## Hip -0.497 0.090 -0.495 0.080 ## Thigh -0.034 0.493 -0.048 0.520 ## Knee -0.486 0.450 -0.461 0.492 ## Ankle -0.414 0.578 -0.262 0.610 ## Biceps -0.129 0.515 -0.156 0.519 ## Forearm 0.063 0.910 0.060 0.844 ## Wrist -2.603 -0.562 -2.674 -0.567 # The naive quantile bootstrap is applicable only if # . bias is negligible # . bootstrap distribution is approximately symmetric - A bias & skewness-correcting quantile method for bootstrap CIs: ---------------------------------------------- | Proper quantile inversion from RIs to CIs, | | not assuming symmetry and unbiasedness | ---------------------------------------------- . Approach: Glean from population calculations how to use bootstrap simulation to address both . bias of estimate and . non-normality of dataset-to-dataset ('sampling') distribution. . 'If we were nature, we would know the dataset-to-dataset distribution of b=T(Fn) and hence its upper/lower 2.5% percentiles (qb.lo, qb.hi):' P[ qb.lo < b < qb.hi | beta ] = 0.95 ^^^^^ ^^^^^ 2.5% 97.5% quantiles of the dataset-to-dataset distribution of 'b' . Invert a RI for 'b' to a CI for 'beta': P[ qb.lo - beta < b - beta < qb.hi - beta | beta ] = 0.95 Re-express in terms of lower and upper cut-offs c.lo and c.hi: P[ -c.lo < b - beta < c.hi | beta ] = 0.95 Invert the two inequalities: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! P[ b-c.hi < beta < b+c.lo | beta ] = 0.95 ==> Confidence interval: CI = [ b - c.hi, b + c.lo ] where c.hi = qb.hi - beta c.lo = beta - qb.lo [Of course we don''t know beta, so this is not actionable! Next we makes this scheme actionable with bootstrap plug-in.] . Bootstrap plug-in estimation of c.hi and c.lo (replacing +-2*SE): beta <-- b qb.hi <-- qb*.hi (97.5% quantile of bootstrap values b*) qb.lo <-- qb*.lo ( 2.5% quantile of bootstrap values b*) c.hi <-- qb*.hi - b c.lo <-- b - qb*.lo ^^^^ ^^^^^^^^^^^^^^^^^^^ population bootstrap ==> Bias-correcting and skewness-correcting BS quantile CI: ------------------------------------------------------------ | | | [ b - c.hi, b + c.lo ] = [ 2b - qb*.hi, 2b - qb*.lo ] | | | ------------------------------------------------------------ ^^^^^^^^^^^^^^^^^^^^^^^^^ Reminiscent of bias-corrected estimate . The correctly inverted BS quantile CI: --------------------------------------------------------------------------------- | | | [ 2*T(Fn) - quantile(T(Fn*),1-alpha/2), 2*T(Fn) - quantile(T(Fn*),alpha/2) ] | | | --------------------------------------------------------------------------------- Compare with the naive BS quantile CI: ------------------------------------------------------------- | | | [ quantile(T(Fn*),alpha/2), quantile(T(Fn*),1-alpha/2) ] | | | ------------------------------------------------------------- And compare with the classical BS CI based on the BS SE: --------------------------------------------------------------------------------- | | | [ T(Fn) - qnorm(1-alpha/2)*SD(T(Fn*)), T(Fn) + qnorm(1-alpha/2)*SD(T(Fn*)) ] | | | --------------------------------------------------------------------------------- And finally the classical linear models CI based on the 'usual' SEhat: ------------------------------------------------------------------------- | | | [ T(Fn) - qt(1-alpha/2,dfs)*SEhat, T(Fn) + qt(1-alpha/2,dfs)*SEhat ] | | | ------------------------------------------------------------------------- Try to memorize them and write them down from memory! # . R Computations: options(width=120) # to print long lines round(cbind( q.biasc.l = 2*bf.b - apply(bf.b.bs, 2, quantile, .975), # lo, bias/skew-corr BS q.naive.l = apply(bf.b.bs, 2, quantile, .025), # lo, naive quantile BS normal.l = bf.b + qt(.025,df=238)*bf.b.bs.se, # lo, normal/t BS classic.l = bf.b + qt(.025,df=238)*bf.b.se, # lo, classical t q.biasc.h = 2*bf.b - apply(bf.b.bs, 2, quantile, .025), # hi, bias/skew-corr BS q.naive.h = apply(bf.b.bs, 2, quantile, .975), # hi, naive quantile BS normal.h = bf.b + qt(.975,df=238)*bf.b.bs.se, # hi, normal/t BS classic.h = bf.b + qt(.975,df=238)*bf.b.se), # hi, classical t 3) # ^^^^^^^^^^^^^^^ # exact, instead of 2 # How to eye-ball the table below: # Look for identical signs of '.l' and '.h' limits because this # means the CI does not contain 0, i.e., there is significance. ## q.biasc.l q.naive.l normal.l classic.l q.biasc.h q.naive.h normal.h classic.h ## (Intercept) -61.567 -57.462 -59.568 -52.365 21.085 25.190 23.191 15.988 ## Age 0.000 0.003 0.002 -0.002 0.121 0.124 0.122 0.126 ## Weight -0.237 -0.201 -0.220 -0.194 0.024 0.060 0.044 0.017 ## Height -0.372 -0.403 -0.345 -0.259 0.264 0.233 0.206 0.120 ## Neck -0.948 -0.928 -0.935 -0.929 -0.014 0.007 -0.006 -0.013 ## Chest -0.192 -0.262 -0.237 -0.219 0.214 0.144 0.189 0.171 ## Abdomen 0.798 0.795 0.784 0.784 1.115 1.112 1.125 1.125 ## Hip -0.505 -0.497 -0.490 -0.495 0.082 0.090 0.075 0.080 ## Thigh -0.020 -0.034 -0.029 -0.048 0.507 0.493 0.501 0.520 ## Knee -0.420 -0.486 -0.457 -0.461 0.516 0.450 0.487 0.492 ## Ankle -0.230 -0.414 -0.355 -0.262 0.762 0.578 0.703 0.610 ## Biceps -0.151 -0.129 -0.147 -0.156 0.493 0.515 0.510 0.519 ## Wrist -2.679 -2.603 -2.656 -2.674 -0.638 -0.562 -0.585 -0.567 We note: . 'Age' is insignificant w.r.t. the classical CI, but borderline significant (+) w.r.t. all BS CIs. . 'Neck' is insignificant w.r.t. naive quantile BS CIs, but borderline significant (-) w.r.t. all others. * Issues: - Peter Hall says to use the bootstrap distribution of pivotal quantities. I.e., in linear models, apply BS to t-statistics to learn where to place the cut-off. Don''t use the BS standard error estimate. - Efron says it''s not that simple: Much depends on whether T(F) is interval scale or ratio scale or anything else. ==> Invents intricate BS corrections. Hall''s use of pivotal quantities is fine for interval scale statistical functionals. ==> Transform your parameters to interval scale if they are not. (E.g., sigma = SD(F) ==> Use log(sigma) instead(?)) - Based on the 2013 qualifying exam, neither of the above intervals has a clear advantage in terms of more correct coverage! This is a bit disappointing given the mental effort we put into correct inversion from RI to CI. More needs to be understood. Efron''s correction methods seem promising, but the instructor has no experience with them yet. - Dan McCarthy and Kai Zhang completed a paper that moves the inverted quantile method to a new level: ---------------------------------------------------------------------------- | Use TWO-STAGE BOOTSTRAP to calibrate the BS inverted quantile intervals. | ---------------------------------------------------------------------------- The idea is 'lack of calibration': If we thought we had constructed a 95% CI, maybe it is a good interval, not for 95% but 98%, for example. * Lit. on bootstrap: - Efron, B. and and Tibshirani, R. (1993) An Introduction to the Bootstrap - Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. - AS papers by Efron and Hall ================================================================