================================================================


* BOOTSTRAP -- A METHOD FOR ESTIMATING SE/VARIANCE, BIAS, AND CIs


 - Preliminaries: statistical functionals

   . Consider i.i.d. samples  y1,...,yn  ~  F
       Form the empirical distribution of y1,...,yn:
         Fn = 1/n sum_i delta_yi

   . 'Statistical functionals':
       (0)  T(y1,...,yn) = T(Fn)
       (1)  T(Fn) --> T(F)  as  n --> Inf
       (2)  (T(Fn)-T(F))*sqrt(n) --> N(0,AV(F,T))  in distrib./weakly as  n --> Inf

   . Interpretations:
       (1) "a generalized law of large numbers"
           a limited form of continuity of T()
       (2) asymptotic normality
           "a generalized CLT"
           AV(F,T) is called the asymptotic variance.
           [Notation: We may drop the argument T.]

   . Examples for property (1):
     +  T(Fn)=mean, T(F)=expectation
     +  T(Fn)=empirical 90% quantile;  T(F)=distributional 90% quantile
            [question of uniqueness, may be declared not to exist for some F]

     Counter-examples:
     +  T(Fn) = SSn/(n-1)
        However, the biased estimate T(Fn)=SSn/n is a statistical functional.
     +  T(Fn) = y{n/2}  is not a statistical functional.
        However, the median is a statistical functional.


   . This framework gives justification to asymptotic CIs:
       CI = [T(Fn)-1.96*sqrt(AV(F)/n), T(Fn)+1.96*sqrt(AV(F)/n)]
       P(T(F) in CI) --> 0.95  as  n --> Inf.

   . Problems:

     + AV(F) is not known; it needs to be estimated.

     + Even though asymptotic theory says the bias of T(Fn)
         disappears faster than the variance as n-->Inf,
         the bias can still be substantial for  n  of interest.


 - Bootstrap: Efron''s 'trivial idea' (1979 -- 30 year anniversary in 2009)

   . Mother nature/deity presents us with data y1,...,yn i.i.d. F.

   . We do not know F, but we can estimate it.
     Call the estimate 'Fhat'.

     + Fhat = Fn (the empirical distribution):
                 ---------------------------
                |  NONPARAMETRIC BOOTSTRAP  |
                 ---------------------------

     + Fhat = P(dy;theta.hat) plug-in wrt a model P(dy:theta):
                 ---------------------------
                |   PARAMETRIC BOOTSTRAP    |
                 ---------------------------

   . Bootstrap principle:

     + Act as if you were mother nature and Fhat were the truth.
       That is, sample artificial iid data from Fhat, repeatedly:
          (y1*1,...,yn*1)      # artificial dataset 1
          (y1*2,...,yn*2)      # artificial dataset 2
          (y1*3,...,yn*3)      # artificial dataset 3
          ...
          Repeat, for example, Nboot=1000 times.

     + Compute statistics T()
       from the observed dataset  {y1,...,yn} and
       from all bootstrap datasets {y1*j,...,yn*j} (j=1...Nboot).

       ==> The "bootstrap values" of the statistic T()
           { T*1=T(y1*1,...,yn*1),
             T*2=T(y1*2,...,yn*2),
             ...
             T*Nboot=T(y1*Nboot,...,yn*Nboot)
           }
           can be used as an estimate of the sampling
           or "dataset-to-dataset" distribution of T().

   . Uses of the bootstrap values:

     + Their SD is as an estimate of the SE of T(Fhat).

            ---------------------------------------
           |  SE.boot = SD({T*1,T*2,...,T*Nboot})  |
            ---------------------------------------

     + Form bootstrap CIs:

         CI = [ T(Fhat)-2*SE, T(Fhat)+2*SE ] = '[ T(Fhat) +- 2*SE ]'

     + Caveat: This is the simplest form of bootstrap CIs.
               It trusts that n is large enough to make
               asymptotic normality a good approximation.


  - Comparison of the parametric and nonparametric bootstrap:

    . Parametric bootstrap may not give you the proper idea of
      dataset-to-dataset variability if the fitted model overfits.
      (Think about it!)

    . Nonparametric bootstrap is vastly more popular when it applies,
      but the datasets it creates have dups and do not use some cases:
        P[case 'i' not used in n draws] = (1-1/n)^n -> exp(-1) ~ 0.3678794 > 1/3
        => Approx fraction of cases used is (1-exp(-1)) ~=~ 0.6321206
           or just under 2/3.
      (After many bootstrap samples, one will have used all data, though.)


  - General remarks on bootstrap:

    . Nboot is not part of the concept really...
      For theory, Nboot=Inf, i.e., complete enumeration.
      Nboot<Inf is just a human weakness.

    . Catastrophes can occur for certain bootstrap resamples,
      e.g., in terms of  (X^T X)  losing full rank...
      [See a paper by Manuel Koller on resampling conditional on full rank.]
      [Example: one of the regressors is a dummy with few values 1.
                All of them might not appear in a bootstrap sample...]
      [This problem is avoided by the residual bootstrap.]


    . Bootstrap in its principal application provides CIs, not RIs.
      This is again unlike permutation tests which provide RIs.

    . There is no null hypothesis underlying bootstrap.
      It simply wants to mimic the sampling variation
      of estimates as best as possible given the available data.

    . Bootstrap CIs are justified by uniform asymptotic normality.
      There is no finite-sample justification (unlike permutation tests).

      ==> In order for CIs to be justified, one needs a sufficiently large
          sample size n so the sampling distribution is close to normal.

    . Resample size and sample size:

      + While it is true that our main interest is to estimate
        the sampling variation of estimates for samples of size n,
        we could also estimate it for samples of any other size m
        (m smaller OR larger than n).
        ==> Resample iid m values from {y1,...,yn}
            to form bootstrap resamples {y*1,...,y*m} for m=!=n.

      + Standard error estimates can always be translated to
        asymptotic standard error estimates by

         ------------------------------------------
        |  SEn*sqrt(n) --> sqrt(AV)  as  n --> Inf |
         ------------------------------------------

        Asymptotic standard error estimates can always be translated to
        finite-sample standard error estimates
        for arbitrary sample size m by            SEm = ASD/sqr(m)

      + It follows that we could use an arbitrary resample size m
        and translate back to sample size n by

               -----------------------
              |  SEn ~ SEm*sqrt(m/n)  |
               -----------------------

        This raises the question of choice of resample size.
        There is some theory to suggest that small resample size
        could work even if the n-of-n bootstrap 'does not work'.
          [see Bickel, Goetze, van Zwet (Sinica)]


# - Non-parametric bootstrap standard errors: linear regression

#   We will use the x-y or pairs bootstrap, not the residual bootstrap!
#   Linear OLS regression raises the question of when to use which version:
#   . If you trust the linear model to to be first order correct
#     AND the errors to be iid, then use the residual bootstrap.
#   . If you wish to make no assumptions other than that the (x,y) tuples
#     are iid sampled, then use the x-y boostrap.

#   Data Example: the 'bodyfat' data
      site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-961/"
      bodyfat <- read.table(paste(site,"bodyfat.dat",sep=""))
      names(bodyfat) <- scan(paste(site,"bodyfat.col",sep=""), w="")
      colnames(bodyfat)

#   Regress 'PercBF' on all predictors:
      bf.lm <- lm(PercBF ~ . - Density, data=bodyfat)
      bf.lm.smry <- summary(bf.lm)
      bf.lm.smry
#   LS coefficients:
      bf.b <- bf.lm.smry$coeff[,"Estimate"]
      cbind(signif(bf.b,3))
#   Standard errors computed the usual way:
      bf.b.se <- bf.lm.smry$coeff[,"Std. Error"]
      cbind(signif(bf.b.se,3))

#   Bootstrap simulation: -------------------------------------

      N <- nrow(bodyfat)
      Nboot <- 10000
      bf.b.bs <- matrix(NA, ncol=length(bf.b), nrow=Nboot)
      colnames(bf.b.bs) <- names(bf.b.se)

      for(iboot in 1:Nboot) {
        sel <- sample(N, replace=T)  # check dups:  sort(table(sel))
        bf.b.bs[iboot,] <- coef(lm(PercBF ~ . - Density, data=bodyfat[sel,]))
        if(iboot%%200==0) print(iboot)     #                  ^^^^^^^^^^^^^ BS dataset
      }

#   One-liner, w/o intermediate output:

      bf.b.bs <- t(replicate(Nboot, { coef(lm(PercBF ~ . - Density, data=bodyfat[sample(N,repl=T),])) }))
      dim(bf.b.bs)
      dimnames(bf.b.bs)

#   Bootstrap standard errors: --------------------------------------

      bf.b.bs.se <- apply(bf.b.bs, 2, sd)
      sqrt(diag(cov(bf.b.bs)))                # Same thing...

#   Compare bootstrap with usual standard errors:

      cbind(signif( cbind(boot=bf.b.bs.se, classic=bf.b.se), 2))

#   Ratio comparison:

      cbind(round( bf.b.bs.se / bf.b.se, 3))

#   Mostly a non-event:  Bootstrap SEs are very close to the usual SEs
#   except for 'Height': classic=.096,  bootstrap=0.140, ratio=1.45
#   In other data we see greater disparities, sometimes a factor of 2.
#   The ratios can also be < 1, but empirically this happens less often
#   than > 1.

# - Bootstrap CIs from bootstrap SEs:

      round( cbind( bf.b - 2*bf.b.bs.se, bf.b + 2*bf.b.bs.se), 3)  # Prefer 2 over 1.96

# - Bootstrap 2-sided p-value with asymptotic approximation:
      cbind(round(2*pmin(pnorm( bf.b, m=0, s=bf.b.bs.se), pnorm( -bf.b, m=0, s=bf.b.bs.se)), 7))
#   [round() is used to avoid exponential/scientific form of the p-values.  Compare readability]
      cbind(2*pmin(pnorm( bf.b, m=0, s=bf.b.bs.se), pnorm( -bf.b, m=0, s=bf.b.bs.se)))

# - We applied non-parametric bootstrap to regression,
#   thus capturing the full variability of (xi,yi) sampling,
#   whereas 'usual' inference CONDITIONS ON THE xi-VALUES
#   and considers variation in yi given xi only.
#   In this particular dataset, the discrepancies between
#   conventional and BS SEs is not a concern other than for
#   'Height', and a little for 'Weight' and 'Ankle'.

# - How meaningful are the bootstrap SEs?
#   Is the bootstrap distribution approximately normal?
#   Maybe this is a good place to CHECK NORMALITY OF THE ESTIMATES:

      par(mfrow=c(4,4), mgp=c(1.5,.5,0), mar=c(1.5,1.5,2,1))
      for(j in 1:ncol(bf.b.bs)) {
        qqnorm(bf.b.bs[,j], pch=16, cex=.3, main=names(bf.b)[j],
               xlab="", ylab="", cex.main=1)
        co <- "gray";  lw=.5
         abline(v=0, col=co, lwd=lw)       # vertical:   0 to mark median
         abline(h=bf.b[j], col=co, lwd=lw) # horizontal: b[j]
         # show the normal line implied by classic theory:  N(b[j], SE(b[j])^2)
         abline(a=bf.b[j], b=bf.b.se[j], col=co, lwd=lw) # slope: SE(b[j])
         abline(h=0, col="red", lwd=lw)    # horizontal: 0 to show significance
      }
#   However, these pictures are misleading: What really matters is not
#   non-normality of estimates, but of estimates standardized by their
#   standard error ESTIMATES.
#   PS: If we had TRUE standard errors, the above pictures would be relevant,
#       but we do not have them; we only have estimates thereof.

#   . Note:
#     + 'Height' has a problem with non-normality.
#     + 'Ankle' has a problem with a non-normal lower tail.
#     + 'Weight' has a minor problem in the upper tail.
#     + 'Chest' has a minor problem in the lower tail.
#     + Intercept looks nearly normal but wider than classical.
#     Most of these problems do not matter because of clear insignificance.
#     Issues arise only in the moderate tails where we seek a decision
#     about significance.

#   . Reading statistical significance from these plots:
#     The red line is at zero, hence we have significance
#     if ...

#   . BIAS: b[j] (=hor. line) and median(b.bs[,j]) are close
#     hence there is little median bias in the coefficients.
#     'Little' means 'little compared to SE.BS(b[j])'


# - BIAS ESTIMATION with bootstrap:

#   . Estimate  bias     = E[estimates|truth] - truth

#     by        bias.hat = E[boostrap estimates|estimate] - estimate
#                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#                          average of bootstrap estimates

#   . Example of bootstrap bias estimates: for all regr coeffs
        bf.b.bs.bias <- apply(bf.b.bs, 2, mean) - bf.b
        cbind(signif( bf.b.bs.bias, 3))
#     Compare bootstrap bias estimate and bootstrap SE:
        cbind(signif( bf.b.bs.bias / bf.b.bs.se, 3))
#     Sizable biases: Weight (0.16), Chest (-0.19), Ankle (-0.15)
#     Still, these biases are relatively small compared to the SE,
#     which is natural because bias tends to disappear at the rate 1/n
#     whereas the SE shrinks at the slower rate 1/sqrt(n).
#     On the other hand, bias can be non-neglibible for small n.

#     Median biases seem to be a touch smaller than mean biases:
        bf.b.bs.bias.med <- apply(bf.b.bs, 2, median) - bf.b
        cbind(signif( bf.b.bs.bias.med / bf.b.bs.se, 3))

#     Bias should be compared to standard error:
        signif(cbind(meanBiasPerSE=bf.b.bs.bias     / bf.b.bs.se,
                     medBiasPerSE= bf.b.bs.bias.med / bf.b.bs.se ),3 )


# - Bootstrap bias correction/reduction:

#                T(Fn) - BS.Bias.hat
#             =  T(Fn) - ( ave[T(Fn*)] - T(Fn) )
#             =  2*T(Fn) - ave[T(Fn*)]

#    This is the typical form of bias correction: 2*estimate - ave.est


# - Further potential problem: non-normality of T(Fn)-distribution
#     Skewness, for example.  Imagine T(Fn) is right skewed:
#                               T(F)
#               ------|---------|-------------------------|---------
#                    2.5%      50%                      97.5%
#                     |-----------------------------------|
#                         95% range of T(Fn)-distribution

#    Q: What interval should you put around T(Fn) to catch T(F)?
#    A: An asymmetric interval that reaches farther left than right
#       to compensate for the fact that T(Fn) tends to fall
#       farther to the right than to the left.


# - Idea: Construct CIs not symmetrically from  T(Fn) +- 2*SE,
#         but use the direct 2.5% and 97.5% BS quantiles
#         to estimate lower and upper CI bounds.


# - Statement of the naive BS quantile (!) CI:
        round(cbind(
          CI.BS.lo= apply(bf.b.bs, 2, quantile, .025),  # bootstrap quantile
          CI.BS.hi= apply(bf.b.bs, 2, quantile, .975),  #   "
          CI.lo   = bf.b + qt(.025,df=238)*bf.b.se,     # classical SE
          CI.hi   = bf.b + qt(.975,df=238)*bf.b.se), 3) #   "

                ##              BS quantile CIs   Classical CIs
                ##             CI.BS.lo CI.BS.hi   CI.lo  CI.hi
                ## (Intercept)  -57.462   25.190 -52.365 15.988
                ## Age            0.003    0.124  -0.002  0.126
                ## Weight        -0.201    0.060  -0.194  0.017
                ## Height        -0.403    0.233  -0.259  0.120
                ## Neck          -0.928    0.007  -0.929 -0.013
                ## Chest         -0.262    0.144  -0.219  0.171
                ## Abdomen        0.795    1.112   0.784  1.125
                ## Hip           -0.497    0.090  -0.495  0.080
                ## Thigh         -0.034    0.493  -0.048  0.520
                ## Knee          -0.486    0.450  -0.461  0.492
                ## Ankle         -0.414    0.578  -0.262  0.610
                ## Biceps        -0.129    0.515  -0.156  0.519
                ## Forearm        0.063    0.910   0.060  0.844
                ## Wrist         -2.603   -0.562  -2.674 -0.567

#     The naive quantile bootstrap is applicable only if
#     . bias is negligible
#     . bootstrap distribution is approximately symmetric


  - A bias & skewness-correcting quantile method for bootstrap CIs:

          ----------------------------------------------
         |  Proper quantile inversion from RIs to CIs,  |
         |  not assuming symmetry and unbiasedness      |
          ----------------------------------------------

    . Approach: Glean from population calculations how to use
      bootstrap simulation to address both
      . bias of estimate and
      . non-normality of dataset-to-dataset ('sampling') distribution.

    . 'If we were nature, we would know the dataset-to-dataset distribution
      of b=T(Fn) and hence its upper/lower 2.5% percentiles (qb.lo, qb.hi):'

        P[ qb.lo < b < qb.hi | beta ] = 0.95
           ^^^^^       ^^^^^
            2.5%       97.5%
        quantiles of the dataset-to-dataset distribution of 'b'

    . Invert a RI for 'b' to a CI for 'beta':

          P[ qb.lo - beta < b - beta < qb.hi - beta | beta ] = 0.95

        Re-express in terms of lower and upper cut-offs c.lo and c.hi:

          P[   -c.lo      < b - beta <     c.hi     | beta ] = 0.95

        Invert the two inequalities:    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

          P[    b-c.hi    <  beta    <   b+c.lo     | beta ] = 0.95

      ==> Confidence interval:

          CI  =  [ b - c.hi, b + c.lo ]

        where  c.hi = qb.hi - beta
               c.lo =         beta - qb.lo

        [Of course we don''t know beta, so this is not actionable!
         Next we makes this scheme actionable with bootstrap plug-in.]

    . Bootstrap plug-in estimation of c.hi and c.lo (replacing +-2*SE):

        beta  <-- b
        qb.hi <-- qb*.hi    (97.5% quantile of bootstrap values  b*)
        qb.lo <-- qb*.lo    ( 2.5% quantile of bootstrap values  b*)
        c.hi  <-- qb*.hi - b
        c.lo  <--          b - qb*.lo
        ^^^^      ^^^^^^^^^^^^^^^^^^^
      population      bootstrap

      ==> Bias-correcting and skewness-correcting BS quantile CI:

       ------------------------------------------------------------
      |                                                            |
      | [ b - c.hi,  b + c.lo ]  =  [ 2b - qb*.hi,  2b - qb*.lo ]  |
      |                                                            |
       ------------------------------------------------------------
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
                                 Reminiscent of bias-corrected estimate


    . The correctly inverted BS quantile CI:

       ---------------------------------------------------------------------------------
      |                                                                                 |
      |  [ 2*T(Fn) - quantile(T(Fn*),1-alpha/2),  2*T(Fn) - quantile(T(Fn*),alpha/2) ]  |
      |                                                                                 |
       ---------------------------------------------------------------------------------

      Compare with the naive BS quantile CI:

       -------------------------------------------------------------
      |                                                             |
      |  [ quantile(T(Fn*),alpha/2),  quantile(T(Fn*),1-alpha/2) ]  |
      |                                                             |
       -------------------------------------------------------------

      And compare with the classical BS CI based on the BS SE:

       ---------------------------------------------------------------------------------
      |                                                                                 |
      |  [ T(Fn) - qnorm(1-alpha/2)*SD(T(Fn*)),  T(Fn) + qnorm(1-alpha/2)*SD(T(Fn*)) ]  |
      |                                                                                 |
       ---------------------------------------------------------------------------------

      And finally the classical linear models CI based on the 'usual' SEhat:

       -------------------------------------------------------------------------
      |                                                                         |
      |  [ T(Fn) - qt(1-alpha/2,dfs)*SEhat,  T(Fn) + qt(1-alpha/2,dfs)*SEhat ]  |
      |                                                                         |
       -------------------------------------------------------------------------

      Try to memorize them and write them down from memory!


#   . R Computations:

        options(width=120)  # to print long lines
        round(cbind(
          q.biasc.l = 2*bf.b - apply(bf.b.bs, 2, quantile, .975), # lo, bias/skew-corr BS
          q.naive.l = apply(bf.b.bs, 2, quantile, .025),          # lo, naive quantile BS
          normal.l  = bf.b + qt(.025,df=238)*bf.b.bs.se,          # lo, normal/t BS
          classic.l = bf.b + qt(.025,df=238)*bf.b.se,             # lo, classical t
          q.biasc.h = 2*bf.b - apply(bf.b.bs, 2, quantile, .025), # hi, bias/skew-corr BS
          q.naive.h = apply(bf.b.bs, 2, quantile, .975),          # hi, naive quantile BS
          normal.h  = bf.b + qt(.975,df=238)*bf.b.bs.se,          # hi, normal/t BS
          classic.h = bf.b + qt(.975,df=238)*bf.b.se),            # hi, classical t
              3)
#                              ^^^^^^^^^^^^^^^
#                              exact, instead of 2

#     How to eye-ball the table below:
#       Look for identical signs of '.l' and '.h' limits because this
#       means the CI does not contain 0, i.e., there is significance.

##             q.biasc.l q.naive.l normal.l classic.l q.biasc.h q.naive.h normal.h classic.h
## (Intercept)   -61.567   -57.462  -59.568   -52.365    21.085    25.190   23.191    15.988
## Age             0.000     0.003    0.002    -0.002     0.121     0.124    0.122     0.126
## Weight         -0.237    -0.201   -0.220    -0.194     0.024     0.060    0.044     0.017
## Height         -0.372    -0.403   -0.345    -0.259     0.264     0.233    0.206     0.120
## Neck           -0.948    -0.928   -0.935    -0.929    -0.014     0.007   -0.006    -0.013
## Chest          -0.192    -0.262   -0.237    -0.219     0.214     0.144    0.189     0.171
## Abdomen         0.798     0.795    0.784     0.784     1.115     1.112    1.125     1.125
## Hip            -0.505    -0.497   -0.490    -0.495     0.082     0.090    0.075     0.080
## Thigh          -0.020    -0.034   -0.029    -0.048     0.507     0.493    0.501     0.520
## Knee           -0.420    -0.486   -0.457    -0.461     0.516     0.450    0.487     0.492
## Ankle          -0.230    -0.414   -0.355    -0.262     0.762     0.578    0.703     0.610
## Biceps         -0.151    -0.129   -0.147    -0.156     0.493     0.515    0.510     0.519
## Wrist          -2.679    -2.603   -2.656    -2.674    -0.638    -0.562   -0.585    -0.567

         We note:
         . 'Age' is insignificant w.r.t. the classical CI,
             but borderline significant (+) w.r.t. all BS CIs.
         . 'Neck' is insignificant w.r.t. naive quantile BS CIs,
             but borderline significant (-) w.r.t. all others.


* Issues:

  - Peter Hall says to use the bootstrap distribution of pivotal quantities.
    I.e., in linear models, apply BS to t-statistics to learn where to
    place the cut-off.  Don''t use the BS standard error estimate.

  - Efron says it''s not that simple:
      Much depends on whether T(F) is interval scale or ratio scale or anything else.
      ==> Invents intricate BS corrections.
      Hall''s use of pivotal quantities is fine for interval scale statistical functionals.
      ==> Transform your parameters to interval scale if they are not.
          (E.g., sigma = SD(F) ==> Use  log(sigma)  instead(?))

  - Based on the 2013 qualifying exam, neither of the above intervals
    has a clear advantage in terms of more correct coverage!
    This is a bit disappointing given the mental effort we put
    into correct inversion from RI to CI.
    More needs to be understood.  Efron''s correction methods seem promising,
    but the instructor has no experience with them yet.

  - Dan McCarthy and Kai Zhang completed a paper that moves the
    inverted quantile method to a new level:

     ----------------------------------------------------------------------------
    |  Use TWO-STAGE BOOTSTRAP to calibrate the BS inverted quantile intervals.  |
     ----------------------------------------------------------------------------

    The idea is 'lack of calibration': If we thought we had constructed a 95% CI,
    maybe it is a good interval, not for 95% but 98%, for example.


* Lit. on bootstrap:

  - Efron, B. and and Tibshirani, R. (1993)
    An Introduction to the Bootstrap

  - Davison, A.C. and Hinkley, D.V. (1997)
    Bootstrap Methods and Their Application.

  - AS papers by Efron and Hall


================================================================