================================================================


* DIAGNOSTICS FOR LACK OF FIT

  - Generalities: Three types of diagnostics:
    . lack of fit
    . case influence
    . non-identifiability (collinearity in linear models)
    This section is about

           -----------------
          |   LACK OF FIT   |
           -----------------

    Synonyms: model violation, misspecification, biased model

  - Potential lack-of-fit problems in linear models:
    (1) E[y] != X beta      for any beta       (1st order misspecification)
    (2) V[y] != sigma^2 I   for any sigma      (2nd order misspecification)
    (3) err  !~ N(0_N,sigma^2 I_N)             (distributional misspecification)
    => Need
       . Detection: diagnostics => graphical methods (plots),
           combined with inferential ideas
       . Fixes: better models  (may not be possible)

  - How to 'see' misspecification: diagnostics based on residual analysis

    . If (1), (2) and (3) were satisfied, and if the N >> p,
      then residuals would approximate the errors:  r_i ~=~ err_i

    . The errors err_i would have to be 'random' (centered at zero).
      ==> 'Random' would imply 'no association with the regressors.'
      ==> Any association (nonlinear, clustered, ...) of err_i with
          the regressors xx_i would amount to a model violation.

    . Further motivations: If we consider the regressor observations
      xx_i = X[i,] as iid random vectors instead of fixed numbers
      (i.e., if we avoid conditioning on the regressors),
      we would require independence of errors and regressors:
                 ---------------
                |  err _||_ xx  |
                 ---------------
      Independence is equivalent to
                 ----------------------------------------------
                |  cov(g(err),f(xx)) = 0  for all g() and f()  |
                 ----------------------------------------------

          'Proof': Apply this to indicator functions g(err)=1_{err in B}
           and f(xx)=1_{xx in A}, and you get
                   P[err in B & xx in A] - P[err in B]*P[xx in A] = 0
           For the reverse, approximate any g(err) and g(xx) by step functions

       Thus, if there exists any correlation between a function
       of the errors and a function of the regressors,
       it constitutes a violation of the linear models assumptions.

       ==> Diagnostics rely on finding associations between g(r_i) and f(xx_i).
           Model fixes rely on transforming the data to achieve the
           appearance of independence of residuals and regressors.


  - Common Fixes for Lack of Fit:

    . If one can assume that all relevant predictors are at hand,
      and yet the model does not fit,
      try TRANSFORMATIONS and additional FUNCTIONAL TERMS such as interactions.

    . Box-Cox family of power/log-transformation: assuming y>0
      Try power transformations, but Box & Cox proposed a power family
      that contains the logarithm as well:

      + R code:

        box.cox <- function(y, pow=1)
          if(pow != 0) return( (y^pow - 1)/pow ) else return( log(y) )

      + Illustration of the Box-Cox family of transformations:
          y <- seq(0,3,length=201)
          plot(x=range(y), y=c(-3,1.5), type="n", xlab="y", ylab="box.cox(y)")
          abline(h=0, col="gray90");  abline(v=1, col="gray90")
          for(pow in seq(-2,2,by=.2)) {
            lines(y, box.cox(y,pow), col="gray50")
            points(0, box.cox(0,pow), pch=16, cex=.3, col="gray50")
          }
          pow <--1;  lines(y, box.cox(y,pow), col="green",  lwd=2)
          pow <- 0;  lines(y, box.cox(y,pow), col="red",    lwd=2)
          pow <- 1;  lines(y, box.cox(y,pow), col="blue",   lwd=2)
          pow <- 2;  lines(y, box.cox(y,pow), col="orange", lwd=2)
          legend("bottomright", paste("pow = ",(-1):2,sep=""),
                 col=c("green","red","blue","orange"), lwd=2, merge=T)

      + Properties of the Box-Cox family of transformations:
        . ascending for ALL powers from -Inf to +Inf (even for powers < 0 !!)
        . convex for  power>1
        . concave for power<1
        . BC(1) = 0        for all powers
        . d/dx BC(1) = 1   for all powers


      + Data Example: Boston housing data (stay tuned for background)
             (Famous from Belseley, Kuh, Welsch, "Regression Diagnostics" 1980s)

                 site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-961/"
                 boston <- read.table(paste(site,"boston.dat",sep=""), header=T)
                 rownames(boston) <- paste(scan(paste(site,"boston.geography",sep=""),
                                                w=""),   1:nrow(boston))
                 boston[1:5,]
                 str(boston)

       + Select the "best power" according to highest R2:

                 for(pow in seq(2,-2,by=-.1))
                   cat("power =",pow,":\n                R^2 =",
                       summary(lm(box.cox(MEDV, pow) ~ .,
                                  data=boston))$r.squared,"\n")

             Winner:  pow=...    (but by how much?)

             Graphical view:
                 x <- seq(2,-2,by=-.1)
                 y <- sapply(x, function(pow) summary(lm(box.cox(MEDV, pow) ~ ., data=boston))["r.squared"][[1]] )
                 plot(x, y, xlab="Power", ylab="RSquare", type="l")
                 ind.max <- which(y==max(y))
                 points(x[ind.max], y[ind.max], pch=16)
                 text(x[ind.max], y[ind.max], lab=round(y[ind.max],2), pos=1)
                 abline(h=y[ind.max]-0.01, col="gray80")

       + Caveat: Estimating the Box-Cox power is an iffy endeavor.
                 Controversy in the 1980s, Hinkley versus Bickel.

       + The approach could be used for estimating transformations of predictors.
                 This was done by Box-Tiao (19??).


    . The ultimate hammer for estimating transformations of Y and X''s:
        the ACE algorithm (Breiman-Friedman 1985, JASA) -- see Stat 926
        [see 'acepack' in R]

    . Non-linear model terms:
        When it is likely that  y = f(x1,x2,...) + eps  with f(...) nonlinear, try:

      + Interactions:  y ~ x1 * x2  =  y ~ x1 + x2 + x1:x2
        =>   b1*x1 + b2*x2 + b12*x1*x2
           = b1*x1 + (b2 + b12*x1)*x2
           = b2*x2 + (b1 + b12*x2)*x1
        the "effect" of x1 on y depends on level of x2

      + Polynomial terms:  y ~ x + I(x^2)
        amounts to a quadratic transformation of the predictor
        but with free parameters:  y = b0 + b1*x + b2*x^2 + r
        (b2 is the "curvature parameter" for the transformation of x.)


----------------


  - R diagnostics plots:

    . R Code: created by the 'plot.lm' method
              for the generic function 'plot'

          help(plot.lm)
          boston.lm <- lm(MEDV ~ ., data=boston)
          par(mfrow=c(2,2), mar=c(3,3,2,2), mgp=c(1.8,0.5,0.0))
          plot(boston.lm)

      Refined code:

          par(mfrow=c(2,2), mgp=c(1.8,.5,0), mar=c(3,3,2.5,2))
          plot(boston.lm, pch=16, cex=.5, cex.axis=.8, cex.lab=.8)

    . Types of plots:

      + Residuals vs fitted values:
          to see violations of ...

      + Normal q-q plot of stdized residuals:
          to see violations of ...

      + Scale-Location plot of sqrt(|stdized residuals|):
          to see violations of ...

      + Stdized residuals vs leverage values H_{ii}:
          to see ...


  - Missing diagnostic plots in R:

       ----------------------------
      |   RESIDUALS VERSUS ORDER   |
       ----------------------------

    . Purpose: to see "lurking" variables that are not formally part
               of the data but 'lurk' in the order of the data collection.

    . Examples of lurking variables: time order, spatial order

    . Data example: Boston housing data

      + R code:

           plot.order <- function(r) {
               plot(r, xlab="order", ylab="residuals", type="n",
                    pch=16, cex=.5, cex.axis=.8, cex.lab=.8);  abline(h=0, col="gray50")
               lines(r, col="gray");  points(r, cex=.5, pch=16) }
           windows()
           plot.order(resid(lm(MEDV ~ ., data=boston)))

      + Do you think the residuals are random in terms of order?
        What would a random order look like?

           plot.order(sample(resid(lm(MEDV ~ ., data=boston))))

        Escalate the idea as follows: the "lineup"

           r <- resid(lm(MEDV ~ ., data=boston))
           Nrep <- 10
           ihere <- sample(Nrep,size=1)
           par(mfrow=c(Nrep/2,2), mgp=c(1.5,.5,0), mar=c(2.5,2.5,0.5,0.5))
           for(i in seq(Nrep))
               if(i==ihere) plot.order(r) else  plot.order(sample(r))

      + What interpretation does order have in this case?
        Keep in mind this is geographic data.
        Do you think the order is random in terms of geography?
        Check this:

           rownames(boston)

        So??

        PS: One could apply spatial statistical modeling.
            Here is a thought: introduce long & lat as predictors.
            Would you be happy with beta_long and beta_lat ?


    . Alternative plot to detect lurking variables:

                Lag Plot:  r_{i} versus r_{i-1}.

        This is borrowed from time series analysis, AR(1) modeling.

      + R Code: plot x=r[-length(r)] and y=r[-1]
                which matches  xi=r_{i-1} with yi=r_{i]

           plot.lag <- function(r, cex=.6) {
               rg <- range(r)
               plot(r[-length(r)], r[-1], xlab="Lag", ylab="Actual", type="p",
                    pch=16, cex=cex, cex.axis=.8, cex.lab=.8, xlim=rg, ylim=rg)
               abline(lm(r[-1] ~ r[-length(r)]), col="red", lwd=2)
               }

           par(mfrow=c(1,1))
           plot.lag(resid(lm(MEDV ~ ., data=boston)))

      + Escalated to a lineup with permutations:

           windows(h=7.5,w=20)  # Make the plots roughly square
           r <- resid(lm(MEDV ~ ., data=boston))
           Nrep <- 10
           ihere <- sample(Nrep,size=1)
           par(mfrow=c(2,Nrep/2), mgp=c(1.5,.5,0), mar=c(2.5,2.5,0.5,0.5))
           for(i in seq(Nrep))
               if(i==ihere) plot.lag(r) else  plot.lag(sample(r))

      + Why does the lag plot work?
          Partial answer: If there are adjacent extreme values,
                          they will generate a positive slope.


  - Leverage plots: to see the influence of cases on ... (?)

            site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-961/"
            source(paste(site,"leverage-plots.R",sep=""))
            leverage.plots(boston.lm, cex=.5)

       (Unlike JMP, this routine does not add the means back
        into the adjusted variables; this is why these plots
        are zero-centered.)

       Are these plots for lack-of-fit or something else?
       ... Mostly for influence, not lack-of-fit.


================================================================


RECAP -- Regression Diagnostics

- Three types of diagnostics?
    . lack of fit
    . case influence
    . non-identifiability

- What are the three model failures in Gaussian linear models?
    .
    .
    .

- What are some diagnostics we should always look at?
    .
    .
    .

- What are some remedies to improve models?
    .
    .
    .

- What is the cost of using too many modeling remedies?
    .

- Describe the of B-C transformations: idea, shapes
    .
    .
    .
    .
    .
    .


================================================================


* INFERENCE FOR DIAGNOSTICS PLOTS:

  - Motivation 1: You teach regression in Stat 101 and
    explain the normal quantile plot of residuals.
    Then a student shows you such a plot from real data
    and asks: "Professor, is this a good or a bad plot?"
    There should be systematic way of telling good from bad,
    not just hand-waving.

  - Motivation 2: Most diagnostics for lack of fit are informal graphics.
    Wouldn''t it be nice if we could quantify whether an apparent
    flaw detected in a diagnostic plot indicates a statistically
    significant model departure?  What does this even mean?

  - General idea:

        Q:  How can we get an idea whether the observed data
            look anything like data generated from the model?

        A:  Create DATASETS from the estimated model!

            This approach has at least two names:

                     ------------------------
                    | "parametric bootstrap" |
                     ------------------------

                          -----------
                         | "plug-in" |
                          -----------

        Caveat: We really want to diagnose 'errors', not residuals,
                and for the true model, not the estimated model.
                Plugin/parametric bootstrap, however, is the next best.


  - Parametric bootstrap for linear model: Generate response data from

                     -----------------------
                    |   y* ~ N(X b, s^2 I)  |
                     -----------------------

                     b   = LS estimate of beta
                     s   = RMSE estimate of sigma
                     X b = yhat

  - Interpretation:

    . We are converting diagnostics into a testing situation:
          H0: The full model
          H1: Any deviation from the full model.

    . Role of the full model parameters: They are ALL NUISANCE PARAMETERS.
          ==> The vector beta and the scalar sigma^2 are nuisance.

    . The parametric bootstrap is used to generate a null distribution
      for the datasets.
      [Keep in mind: All our theory and practice is conditional on X,
                     hence X is treated as fixed and only y is random.
                     Alternative datasets under H0 have the same X
                     but different y.]

    . The parametric bootstrap is really a hack:
        We don''t know beta and sigma^2, so we use their estimates
        as if they were the truth.
        This is asymptotically justifiable but requires large N/p.

    . We haven''t decided on test statistics yet, but if we manage
      to use pivotal test statistics, the choice of b and s for
      beta and sigma is irrelevant.


  - Execution:

        # Here are the observed quantities:
        boston.lm <- lm(MEDV ~ ., data=boston)
        X <- model.matrix(boston.lm)     # Fixed known numbers; same across simulations
        r <- resid(boston.lm)            # Actual residuals
        yhat <- fitted(boston.lm)        # = X b = estimate of mu=X beta
        s <- summary(boston.lm)$sig      # = estimate of sigma
        N <- nrow(X)                     # Sample size
        rlim <- c(-1,1)*max(abs(r))      # Fix residual range for plotting

        # Bootstrap simulation:
        y.boot <- rnorm(N, m=yhat, s=s)         # A simulated response vector with 'plug-in'
        boston.boot.lm <- lm(y.boot ~ X)
        r.boot <- resid(boston.boot.lm)         # A simulated residual vector
        yhat.boot <- fitted(boston.boot.lm)     # For usual residual plot
        s.boot <- summary(boston.boot.lm)$sig   # To standardize the residuals
        # Prepare a pair of side-by-side plots
        windows(wid=10, hei=5)
        par(mfrow=c(1,2), mgp=c(1.8,.5,0), mar=c(3,3,2,1))
        # Compare residual plots:
        plot(yhat,      r,      main="Actual Data",
             pch=16, cex=.5, ylim=rlim); abline(h=0)
        plot(yhat.boot, r.boot, main="Bootstrap Data",
             pch=16, cex=.5, ylim=rlim); abline(h=0)
        # Compare normal quantile plots of the residuals:
        qqnorm(r/s,           main="Actual Data",
               pch=16, cex=.5, ylim=c(-3,3)); abline(a=0,b=1)
        qqnorm(r.boot/s.boot, main="Bootstrap Data",
               pch=16, cex=.5, ylim=c(-3,3)); abline(a=0,b=1)
        # Compare plot of residuals versus order:
        plot(r,      main="Actual Data",    pch=16, cex=.5, ylim=rlim); abline(h=0)
        plot(r.boot, main="Bootstrap Data", pch=16, cex=.5, ylim=rlim); abline(h=0)
        # [Compare this comparison with the one we made earlier using random permutations.]

  - Standard R diagnostics for bootstrapped data:
        # BS comparison of standard R diagnostics:
        dev.new();  par(mfrow=c(2,2))
        plot(boston.lm, pch=16, cex=.6)       # Actual
        dev.new();  par(mfrow=c(2,2))
        y.boot <- rnorm(N, m=yhat, s=s)       # A simulated response vector
        boston.boot.lm <- lm(y.boot ~ X)
        plot(boston.boot.lm, pch=16, cex=.6)  # Bootstrapped/plugged-in
        # (Animate the above.)

  - Visual inference using the lineup method:

      boston.lm.boot <- function() {            # One parametric bootstrap draw
        y.boot <- rnorm(N, m=yhat, s=s)         # A simulated response vector
        boston.boot.lm <- lm(y.boot ~ X)
        r.boot <- resid(boston.boot.lm)         # A simulated residual vector
        yhat.boot <- fitted(boston.boot.lm)     # For usual residual plot
        s.boot <- summary(boston.boot.lm)$sig   # To standardize the residuals
        list("yhat"=yhat.boot, "r"=r.boot, "s"=s.boot)
      }
      boston.lm.obs <- list(yhat=yhat, r=r, s=s)  # Observed

      par(mfrow=c(5,4), mgp=c(1.5,0.25,0), mar=c(1.5,1.5,1,1))
      i.obs <- sample(20, size=1)
      for(i in 1:20){
        if(i==i.obs) { tmp <- boston.lm.obs }
        else { tmp <- boston.lm.boot() }
        plot(tmp$yhat, tmp$r, main="", xlab="", ylab="",
             pch=16, cex=.5, cex.axis=.8, ylim=rlim); abline(h=0)
      }


  - Numerical inference using diagnostic test statistics:

    . Examples of test statistics: invent your own!
               T(r) = max(r)                      ('worst' upper residual)
               T(r) = min(r)                      ('worst' lower residual)
               T(r) = 95%% percentile of residuals quantile(r,p=.95)
               T(r) = mean(r^3)/mean(r^2)^(3/2)   (~ skewness)
               T(r) = mean(r^4)/mean(r^2)^2       (~ kurtosis to measure tail weight)
               T(r) = mean(r*yhat^2)              (parabolic curvature in r vs yhat)
               T(r) = cor(r[-1],r[-length(r)])    (auto-correlation test vs order
               ...                                 ==> lurking variables test!)

      Bootstrap simulations provide an approximate null distribution for
        H0: The fitted model is true.
      for any of these test statistics T(r).

    . Code to plot a histogram of null values and the observed value:

        plot.null <- function(null.vals, obs.val, xlab="Test Statistic", ...) {
                stretch <- function(x, fac=1.1) { m <- mean(x);  (x - m)*fac + m }
                par(mfrow=c(1,1), mgp=c(1.8,.5,0), mar=c(3,3,1,1))
                hist(test.stat.null, col="gray", breaks=N.boot/25,
                     xlim=stretch(range(c(test.stat.null, test.stat.obs))), main="", xlab=xlab, ...)
                abline(h=0)
                abline(v=test.stat.obs, col="red", lwd=3)  # mark the observed value of test.stat()
                text(x=test.stat.obs, y=par()$usr[4], lab="Obs", adj=c(1.1,1.5), col="red")
                text(x=median(test.stat.null), y=par()$usr[4], lab="Null", adj=c(1.1,1.5), col="gray30")
            }

    . Code for test statistics:    [bad programming: functions use vectors 'r' and 'yhat']
        test.stat <- max                 #     ('worst' upper residual)
        test.stat <- min                 #     ('worst' lower residual)
        test.stat <- function(r) quantile(r, probs=.95)      # 95% percentile of residuals
        test.stat <- function(r) mean(r^3)/mean(r^2)^(3/2)   # skewness
        test.stat <- function(r) mean(r^4)/mean(r^2)^2       # kurtosis to measure tail weight
        test.stat <- function(r) summary(lm(r ~ yhat + I(yhat^2)))$r.squared^0.5 # 'Banana test': curv.
        test.stat <- function(r) summary(lm(abs(r) ~ yhat))$r.squared^0.5        # 'Wedge test': heterosk.
        test.stat <- function(r) cor(r[-1],r[-length(r)])    # auto-correlation test by order
        # PS: The two cases marked as 'Equiv. ...' are really more informative because they have a sign.

    . Code for bootstrap simulations and graphical rendering of test results:
          # Number of bootstrap draws/simulations:
          N.boot <- 999   # 1000 spacings with equal prob for obs. value under H0
          # Fill this vector with null values of the test statistic:
          test.stat.null <-
            replicate(N.boot,
                      test.stat(resid(lm(rnorm(N, m=yhat, s=s) ~ X))) )
          # The actual/observed value of the test statistic:
          test.stat.obs <- test.stat(r)
          # The p-value of the observed test statistic (one-sided upper):
          (1+sum(test.stat.null>=test.stat.obs))/(1+N.boot)
          # Comparison of a histogram of null values with the actual value:
          plot.null(test.stat.null, test.stat.obs)
          # ==> Most of the test statistics are egregiously significant,
          #     implying that the linear model is highly flawed.


  - Caveat: If the model overfits the data,
            there might still be model defects,
            but they might not be detectable.

            [==> Dangers of model selection!]

  - Computational remark on null simulations:
      To approximate very small p-values, very large simulations are needed.
      What is the lowest possible estimated p-value when Nsim=999?  ...
      However, histograms as shown above can reveal the extremity of p-values
      even for moderate Nsim.
      ==> There ought to exist a way to fit distributions to, say, 999 values
          and extrapolate them further out than 1/1000 quantiles.


  - Generality of parametric bootstrap sampling:  The idea can be used
      . for other types of lack-of-fit checks, and
      . for any type of 'generative model' (glm, gam, ...)

  - Which of the above test statistics are pivotal under H0?
    . What is the null distribution of the residual vector r?
        ...
    . What is the null distribution of the vector of fitted values yhat?
        ...
    Now answer the question of pivotality under H0 for each test statistic:
       T(r) = max(r)                      : ...
       T(r) = min(r)                      : ...
       T(r) = 95%% percentile of r        : ...
       T(r) = skewness of r               : ...
       T(r) = kurtosis of r               : ...
       T(r) = banana test (curv.) of r    : ...
       T(r) = wedge test (heterosk.) of r : ...
       T(r) = lag test wrt order          : ...

  - AYT?

    . What is the 'null hypothesis' in our diagnostics checks?
        ...
    . Is the parametric bootstrap an exact method for obtaining null distributions?
        ...
        ...
    . Is its justification finite-sample or asymptotic?
        ...

    Ignore the following if you don''t know regular bootstrap yet:

    . What is bootstrap (parametric or nonparametric) usually used for?
        ...
    . Could you use the parametric bootstrap for the same purpose here?
        ...


================================================================


* CONDITIONING ON A SUFFICIENT STATISTIC


  - DEEP FACT: (actually, a definition of sufficiency)

          ------------------------------------------------------------------
         |  The conditional distribution of y given a sufficient statistic  |
         |  is independent of the parameters.                               |
          -----------------------------------------------------------------

         It means: If we know the unconditional distribution of S(y),
                   we know the distribution of y.

         General fact of probability:

              P(dy)  =  Integral_t  P(dy|t) Q(dt)

         General fact of statistics:

              P(dy;theta)  =  Integral_t P(dy|t;theta) Q(dt;theta)
                   ^^^^^                        ^^^^^       ^^^^^

         Under sufficiency:

              P(dy;theta)  =  Integral_t P(dy|t) Q(dt;theta)
                                           ^^^^
                                         no theta!

         ==> If we have a draw 't' from  Q(dt;theta),
             we know how to generate draws of y
             not knowing theta!


  - Purpose: Generating a single null distribution
             for a composite H0 with nuisance parameters

              --------------------------------------
             |  ELIMINATION OF NUISANCE PARAMETERS  |
              --------------------------------------

  - Our goal is to apply a sufficiency argument to tests for model diagnostics.
    ==> For diagnostics, our nuisance parameters in linear regression are ...


  - Sufficiency Theory (see math stat for a full understanding)

    . Introductory quiz questions:
        What is the difference between probability theory and statistics?
          ...

    . Intuition: If we observe N data values,
                 but the model has only one or two parameters,
                 and the model is true,
                 do we really need to know all N data values?
                 I.e., shouldn''t there exist a reduction to
                 one or two summaries (functions of y=(y1,...,yN)^T)
                 that contain all the information about the parameters?

    . Principle:
              ---------------------------------------------------------------------
             |   If we assume a model  p(y|theta)  to be true,                     |
             |   statistical inference does not require knowledge of   p(y|theta); |
             |   it only requires knowledge of  p(y|theta1)/p(y|theta0) for all    |
             |   pairs of parameter settings  theta1  and  theta2.                 |
              ---------------------------------------------------------------------

                 Consequences: Terms that cancel out in the ratio are irrelevant.
                               We only need information about  y  that is sufficient
                               to reconstitute the ratios  p(y|theta1)/p(y|theta0).

                 This principle is related to the Neyman-Pearson lemma:
                 Inference to distinguish between  p(y|theta1)  and  p(y|theta0)
                 requires only the ratio   p(y|theta1)/p(y|theta0).


    . Factorization lemma / Neyman criterion:

               ---------------------------------------------------------------------
              |  p(y|theta) = f(S(y),theta)*h(y)*c(theta)  ==>  S(y) is sufficient  |
               ---------------------------------------------------------------------


    . Example 1a:  yi iid N(mu,1);  y=(y1,...,yN)^T
        p(y|mu) = exp(-sum((y-mu)^2)/2) / (2*pi)^(N/2)       # Valid R expression
                ~ exp(- sum(y^2)/2 + sum(y)*mu - N*mu^2/2)   # sum(y^2)  irrelevant
                                     ^^^^^^
        S(y) = sum(y) = "1-dim sufficient statistics"

    . Example 1b:  yi iid N(0,sigma^2);  y=(y1,...,yN)^T
        p(y|sigma) = exp(-sum(y^2)/(2*sigma^2)) / (2*pi)^(N/2)     # Valid R expression
                          ^^^^^^^^
        S(y) = sum(y^2) = "1-dim sufficient statistics"

    . Example 2:  yi iid N(mu,sig^2),  y=(y1,...,yN)^T
        p(y|mu,sig) = exp(- sum((y-mu)^2)/(2*sig^2)) / (2*pi*sig)^(N/2)  # Valid R
                    = exp(- sum(y^2) + 2*sum(y)*mu - N*mu^2)/(2*sig^2)
                     (sqrt(2*pi)*sig)^{-N}
        S(y) = c(sum(y), sum(y^2)) = "2-dim sufficient statistic"

    . Example 3:  y ~ N(X beta, sig^2 I),
        p(y|beta,sig) = exp(-sum((y - X%*% beta)^2)/(2*sig^2)) / (sqrt(2*pi)*sig)^N
                      = exp(-(sum(y^2) + 2*t(y)%*%X%*%beta - t(beta)%*%t(X)%*%X%*%beta) / (2*sig^2) ) /
                        (sqrt(2*pi)*sig)^{N
        S(y) = c(t(X)%*%y, sum(y^2)) = "(p+2)-dim sufficient statistic"


    . Comments on sufficiency:

      + Equivalence of sufficient statistics:
          If S(y) is sufficient, so is any G(S(y)) if G(s) is 1-1.
          Example 1a: sum(y)  or  mean(y)
          Example 2:  (sum(y), sum(y^2))  or
                      (mean(y), sd(y))
          Example 3:  (t(X)%*%y, sum(y^2))  or
                      (b, RMSE)  or
                      (yhat, RSS)
                assuming  X  is not perfectly collinear.

      + Maximal reduction by sufficiency: "minimal sufficient statistics"
          Example 1a: (sum(y[1:10]),sum(y[11:N]))  is also sufficient but not minimal.
          See Lehmann (book) for a sufficient condition: 'completeness'

      + Sufficiency is a difficult concept at the level of measure theory!
        . Reason: Conditional distributions are only defined ALMOST SURELY
                  wrt to the distribution.
                  In a parametric statistical model the sets
                  with probability zero may be different from
                  model distribution to model distribution.
        . Examples:  P(dy;theta) = Dirac at theta
                     P(dy;theta) = Uniform[0,theta]  (theta > 0)
        . Technical solution:
            For most probability spaces there is no problem
            because one can choose what is called a 'regular version'
            or 'Markov Kernel' or 'transition probability' (synonyms)
            that is a version of conditional distribution for ALL theta,
            i.e., all P(dy;theta).
            [Keywords: 'Polish spaces', 'Radon spaces']
            [Good article: Chang &Pollard (1997), Statistica Neerlandica,
                           ~ 'Conditioning As Disintegration']
        . Caution:
            Even if P(dy;theta) have densities wrt the same dominating measure,
            the conditional distribution P(dy|S(y);theta) does NOT need to have
            a density!!!  It can be 'degenerate', as we will see next.

    . Geometric intuitions:

      + Conditioning = "Holding the sufficient statistic S(y) fixed."

      + What datasets get mapped to this same value of S(y)?

              -------------------------------------
             |  Equiv(y) := { y* | S(y*) = S(y) }  |
              -------------------------------------

      + What is the conditional distribution on such a set under p(y|theta)?


    . Equiv(y): In each case, ask how to obtain an element in Equiv(y).

      + Example 1a: Equiv(y) = { y* | mean(y*)=mean(y) }
                             = hyperplane through mean(y)*e, orthogonal to e
                               (where  e=(1,...,1)^T)

      + Example 1b: Equiv(y) = { y* | sum(y*^2)=sum(y^2) }
                             = sphere at the origin with radius sqrt(sum(y^2))

      + Example 2: Equiv(y) = { y* | mean(y*)=mean(y) & sd(y*)=sd(y) }
                            = intersection of orth complement of 'e' through mean(y)*e
                              and the sphere centered at mean(y)*e with radius sd(y)*sqrt(n-1)

      + Example 3: Equiv(y) = { y* | yhat*=yhat and RSS(y*)=RSS(y) }
                            = { y* | y*=yhat+r* and |r*|^2 = |r|^2 }
                            = yhat + { r* | r*_|_ X and |r*|^2 = |r|^2 }
                            = yhat + sphere in residual space with radius=sqrt(RSS)

    . Conditional distributions: Operational descriptions
                                 We observe y and need to sample y*
                                 such that S(y*) = S(y) in such a way
                                 that y* and y are equally likely
                                 if the model is correct.

      + Example 1a:
                       ysim <- rnorm(N, m=mu, s=1)
                       ystar <- mean(y) + (ysim-mean(ysim))

                    It doesn''t matter what 'm' we picked, so we can use m=0.
                    ==> degenerate normal distribution in 'contrast space'

      + Example 1b:
                       ysim <- rnorm(N, m=0, s=1)
                       ystar <- ysim / sqrt(sum(ysim^2)) * sqrt(sum(y^2))

                    It doesn''t matter what 's' we picked, so we can use s=1.
                    ==> degenerate uniform distribution on a sphere

      + Example 2:
                       ysim <- rnorm(N, m=0, s=1)
                       ystar <- mean(y) + (ysim-mean(ysim))/sd(ysim)*sd(y)

                   Again, it doesn''t matter what 'm' and 's' we pick,
                   so we can use  m=0 and s=1.
                   ==> uniform distribution on sphere of radius sqrt(sum((y-mean(y))^2))
                       in contrast space

      + Example 3:
                       RMSE <- function(model) summary(model)$sigma
                       y <- boston[,14]      # MEDV (Median Housing Values)
                       X <- boston[,-14]     # Fixed across simulations
                       N <- nrow(X)
                       # Model, observed:
                       M.lm <- lm(y ~ ., data=X)
                       r <- resid(M.lm)            # Actual residuals
                       yhat <- fitted(M.lm)        # = X b = estimate of mu=X beta
                       s <- RMSE(M.lm)             # = estimate of sigma
                       # One draw from the conditional distribution given M.lm:
                       ysim <- rnorm(N, m=0, s=1)  # m and s are arbitrary and irrelevant
                       ysim.lm <- lm(ysim ~ ., data=X)
                       ystar = resid(ysim.lm)/RMSE(ysim.lm)*s + yhat
                       # Actually, for diagnostic purposes we only the the residual from ystar:
                       rstar <- resid(ysim.lm)/RMSE(ysim.lm)*s


  - Canned functions for comparison plots based on sufficiency theory:
      If the data followed the model, their residual plots would look like this...
        site <- "http://stat.wharton.upenn.edu/~buja/STAT-961/"
        source(paste(site,"residual-comparison.R",sep=""))
        res.vs.fits(boston.lm,  new=F)   # execute one statement at a time
        res.vs.qnorm(boston.lm, new=F)
        res.vs.order(boston.lm, new=F, f=.20)
        res.vs.lag(boston.lm, new=F)

        bodyfat.lm <- lm(PercBF ~ . - Density, data=bodyfat)
        res.vs.fits(bodyfat.lm,  new=F)   # execute one statement at a time
        res.vs.qnorm(bodyfat.lm, new=F)
        res.vs.order(bodyfat.lm, new=F, f=.20)
        res.vs.lag(bodyfat.lm, new=F)


  - Another form of visual inference:
      Turn the comparisons into 'retention bands' or 'null bands' by overplotting
      many 'null curves/lines' and the 'actual curve/line' in a 'spaghetti plot'.

    . Prepare:
          r <- resid(boston.lm)
          yhat <- fitted(boston.lm)
          X <- model.matrix(boston.lm)
          N <- nrow(X)

    . Normal quantile band:
          norm <- function(x) sqrt(sum(x^2))
          hor <- qqnorm(r, pch=16, cex=.5)$x  # Set up QQ norm plot and ask for hor coords.
          abline(lm(r ~ hor), col="gray")
          for(i in 1:100) {
            r.null <- resid(lm(rnorm(N) ~ X))
            r.null <- r.null / norm(r.null) * norm(r)
            points(x=sort(hor), y=sort(r.null), pch=16, cex=.3, col="gray")
          }
          points(x=sort(hor), y=sort(r), pch=16, cex=.5, col="red")

      ==> "NULL BAND" or "RETENTION BAND" for quantile plot
          assuming the model is correct.
      ==> NOT a confidence band!!!  Do not confuse!!!

    . Standard residual plot:
          bw <- 1/3
          plot(yhat, r, pch=16, cex=.5)  # set up the plot in proper dimensions
          for(i in 1:100) {
            r.null <- resid(lm(rnorm(N) ~ X))
            r.null <- r.null / norm(r.null) * norm(r)
            lines(lowess(r.null ~ yhat), col="gray", lwd=.5)
          }
          points(yhat, r, pch=16, cex=.5)
          lines(lowess(r ~ yhat), col="red", lwd=3)

    . Heteroscedasticity plot: abs(r) vs yhat
          bw <- 1/3
          plot(yhat, abs(r), pch=16, cex=.5)  # set up the plot in proper dimensions
          for(i in 1:100) {
            r.null <- resid(lm(rnorm(N) ~ X))
            r.null <- r.null / norm(r.null) * norm(r)
            lines(lowess(abs(r.null) ~ yhat, f=bw), col="gray", lwd=.5)
          }
          points(yhat, abs(r), pch=16, cex=.5)
          lines(lowess(abs(r) ~ yhat, f=bw), col="red", lwd=3)

    . Lurking variables plot:
      + Not successful because smoothing does not help;
        features are too rough to be captured by smooths.
      + Better idea, borrowed from time series: LAG PLOT
          rg <- range(r)
          N <- length(r)
          plot(x=r[-N], y=r[-1], xlab="Lag", ylab="Null Resid", xlim=rg, ylim=rg, type="n")
          for(i in 1:100) {
            r.null <- resid(lm(rnorm(N) ~ X))
            r.null <- r.null / norm(r.null) * norm(r)
            abline(lm(r.null[-1] ~ r.null[-N]),  col="gray", lwd=2)
          }
          points(x=r[-N], y=r[-1], pch=16, cex=.5, xlim=rg, ylim=rg)
          abline(lm(r[-1] ~ r[-length(r)]), col="red", lwd=2)


  - Redo the numeric diagnostic inferences with the conditional null distribution
    given the minimal null sufficient statistic:

    . Recall statistics (I added the last one since last time: auto correlation):

        test.stat <- max                                      # 'worst' upper residual
        test.stat <- min                                      # 'worst' lower residual
        test.stat <- function(r) quantile(r, probs=.05)       #  5% percentile of residuals
        test.stat <- function(r) quantile(r, probs=.95)       # 95% percentile of residuals
        test.stat <- function(r) mean(r^3)/mean(r^2)^(3/2)    # skewness
        test.stat <- function(r) mean(r^4)/mean(r^2)^2        # kurtosis to measure tail weight
        test.stat <- function(r) abs(cor(r,yhat^2))           # curvature as a fct of yhat
        test.stat <- function(r) abs(cor(abs(r),yhat))        # heteroskedasticity as a fct of yhat
        test.stat <- function(r) cor(r[-1],r[-length(r)])     # auto-correlation test against order
                                                              # ==> lurking variables test!
    . Simulation from the EXACT CONDITIONAL NULL DISTRIBUTION
      for any of the above test statistics:

        norm <- function(r) sqrt(sum(r^2))
        r.norm <- norm(r)                            # Incorrect s, but we only need proportionality
        stdize <- function(r, s) r/norm(r)*s         # For conditioning on norm(r)
        # Short version of simulation code, using sapply():
        test.stat.null <-
            sapply(seq(999),
                   function(i) { if(i%%100==0) cat(i," ")
                                 test.stat(stdize(resid(lm(rnorm(N, m=0, s=1) ~ X)), r.norm)) } )
                                 #         ^^^^^^                   ^^^^^^^^ difference to plug-in
        # Approximate p-value:
        test.stat.obs <- test.stat(r)
        (1+sum(test.stat.null>=test.stat.obs))/(1+N.null)
        # Graphical comparison:
        plot.null(test.stat.null, test.stat.obs)

        # (You can check the quality of an asymptotic normal approximation:)
        qqnorm(test.stat.null, pch=16);  abline(a=mean(test.stat.null), b=sd(test.stat.null), col="blue")

    . Subtle observation:

        In some of these test statistics 'yhat' is used.
        Strictly speaking, in the bootstrap/plug-in approach
        'yhat' should be from the simulated data,
        but we held it fixed at the observed 'yhat'.
        Now in the null conditional simulation we
        condition on 'yhat', hence holding it fixed
        is the right thing to do.

        [On a different note, of course, it''s still bad programming
         to pass anything into a function with a global variable!]


  - Summary of inference for diagnostics:

    . Diagnostics for lack of fit in linear models
      can be given a formal testing framework.

    . Approach: Consider the large model as the H0.

    . Method: Simulate 'null data' from H0, i.e., the full model.

    . Problem: Slopes and sigma of the full model are nuisance parameters.
               They need to be eliminated.

    . Approach 1 to nuisance parameter elimination:
        Estimate them, use them as the truth, simulate response vectors.
        This has asymptotic justification.
        ==> Parametric bootstrap or plug-in simulation

    . Approach 2 to nuisance parameter elimination:
        Remove them by conditioning on a minimal sufficient statistic;
        simulate response vectors from the conditional distribution.
        (Actually, it is sufficient to simulate residual vectors).

    . Inferential use 1:
        Apply any diagnostic plot to null residuals (par boot, or cond null).
        Compare with diagnostic plot of actual residuals.
        Make this scheme inferential with the lineup scheme:
          randomly insert the actual plot among 19 null plots
          and use independent judges.

    . Inferential use 2:
        Augment residual plots with traces (fitted lines or smooths).
        Overplot 100 traces from null residuals;
        overplot the actual trace from the real data
        (spaghetti plots)

    . Inferential use 3:
        Lineup method -- make diagnostic plots of 19 draws from the conditional null distribution;
        randomly mix in a plot of the actual data.
        ==> If the plot of the actual data looks different, we have some model violation, 5%%


  - Summary comments:

    . In all of the above we really mean "minimal sufficient statistics".
      In fact the theory is really meant for "complete sufficient statistics".
          See Lehman''s (& Romano''s) book on testing, chapter on similar and unbiased tests.

    . The scheme provides inference for the null hypothesis
               H0: 'the model is true.'
      ==> We obtain a conditional null distribution FOR THE DATA,
          hence a null distribution for any test statistic
          computed from the data.

    . We simulate from the CONDITIONAL distribution to generate
      response vectors  y*  that are in some sense

          AS LIKELY AS THE OBSERVED RESPONSE VECTOR  y
          IF THE LINEAR MODEL IS CORRECT.

      The qualification 'in some sense' refers to
      conditioning on the sufficient statistic:

      Conditional on the suff stat, we are able to
      generate y* that are equally likely as y,
      assuming that the model is true,
      NOT KNOWING THE TRUE PARAMETERS OF THE MODEL.

    . Datasets sampled from this conditional distribution
      will share exactly the value of the sufficient statistic
      with its observed values!
      'Proof by example':

              summary(lm(ystar ~ ., data=X))
              summary(lm(y     ~ ., data=X))

    . We need a name for the scheme.
      What about "null-sufficient sampling"?
      Any better proposals?


================================================================


2017/12/01

Recap:

        - Diagnostics, 3 types:
        ... lack of fit
        ... case influence
        ... identifiability

        - Which type did we deal with?
        ... lack of fit

        - three types of lack of fit:
        ... 1st order: mean is linear
        ... 2nd order: homoskedasticity
        ... normality

        - Two principles to endow ... (lack of fit) diagnostics with inference?
        ... parametric bootstrap
        ... null-sufficient sampling: need a complete sufficient statistic under the full model

        - Compare the two principles:
        . exact theory? ...
        . universal applicability?  ...

        - What is the null hypothesis for diagnostic inference?
        ... the WHOLE (FULL) model is H0

        - Try to apply the 2nd principle to testing H0: betaj=0:
        ...  (see Lehmann)


Roadmap:
- model selection
- permutation tests