================================================================


CHAPTER 3: STATISTICAL INFERENCE IN LINEAR MODELS


* MODEL ASSUMPTIONS FOR FINITE-SAMPLE INFERENCE:

     ------------------------
    |                        |
    |  eps ~ N(0,sigma^2 I)  |
    |                        |
     ------------------------

    (alternative: "no" assumptions but asymptotics)

  - This is the first time that we assume an actual distribution
    for the errors.

  - Can you give an equivalent assumption for y?  ...

       y ~ N(X beta, sigma^2 I)


* TESTING IN LINEAR MODELS:


  - NULL HYPOTHESES: assumptions about individual slopes,
                     e.g., beta1=beta1.null (some fixed assumed value);
                     usually beta1=0.

      H0: beta1 = beta1.null  (usually: beta1.null=0)


  - H0 is a COMPOSITE NULL HYPOTHESIS:

      y = beta0 + beta1.null*x1 + beta2*x2 + ... + betap*xp + eps,
          ^^^^^   (often =0)      ^^^^^            ^^^^^
          E[eps]=0,  V[eps]=sigma^2*I
                            ^^^^^
      beta0, beta2, betap, sigma are parameters not restricted by H0:

       ----------------------------------------------------------------------------
      |                                                                            |
      |   H0 = { (beta0, beta1, beta2,..., betap, sigma)  |  beta1=beta1.null  }   |
      |                                                                            |
       ----------------------------------------------------------------------------

      ==> H0 is a (p+1)-dim set of parameter vectors -- it is COMPOSITE.

      Definition:

       ------------------------------------------------------------------------
      |   If beta1 is the parameter of interest in H0, then                    |
      |   beta0, beta2, ..., betap, sigma  are 'NUISANCE PARAMETERS' wrt H0.   |
       ------------------------------------------------------------------------

      Important:
      . Nuisance parameters are wrt to a null hypothesis!
      . If beta2 is being tested, then beta0, beta1, beta3,... are nuisance parameters.


  - The INFERENCE GUARANTEE for a statistical test:

    . The retention (non-rejection) rule is:

       -----------------------------------------------------------------------------------
      |  'Retain  H0: beta1=beta1.null'  :<==>  'b1 is in retention interval RI=[lo,hi]'  |
       -----------------------------------------------------------------------------------

    . We want a retention probability of 1-alpha under H0:

        P[ Retain H0 | H0 ]  =  P[ b1 in RI | H0 ]  >=  1-alpha       (= 0.95, say)

      We call 'alpha' the 'significance level' or 'Type I error'.

    . Problem: P[ Retain H0 | H0 ] is really

        P[ Retain H0 | (beta0, beta1.null, beta2, ..., betap, sigma) ]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                         For each choice of nuisance parameter,
                       we get a different probability distribution.

        Q: What nuisance parameters should we choose to calculate this probability?

    . Solutions:

      (1) We use 'pivotal' quantities as test statistics,
          that is, quantities whose NULL DISTRIBUTION
          is INDEPENDENT OF THE NUISANCE PARAMETERS in H0.
          Examples: t and F statistics (as we will see)

      (2) In pretty models there exist 'minimal sufficient statistics'
          UNDER THE NULL HYPOTHESIS; conditioning on them produces
          exact conditional null distributions free of nuisance parameters.
          (Stay tuned.)

      (3) Asymptotic theory: Often one has a CLT under H0  for a test statistic
            (b1 - beta1.null)/SE.est[b1]  -~->  N(0,1)
          so the asymptotic distribution N(0,1) is free of nuisance parameters.
          ==> Coverage probabilities for finite N are approximate.


    . Present Goal: For linear models, justify
      + tests based on   t = (bj-betaj.null)/SE.est[bj],
      + CIs of the form  bj +- const*SE.est[bj].
      No optimality theory (see the book by Lehman, Lehman & Romano),
      just correctness/validity of the alpha-coverage guarantee.


* STANDARD ERROR ESTIMATES:

  - Let  s = RMSE = sqrt( RSS / (N-p-1) )  be the conventional estimate of sigma.
    Then we define the obvious SE.est:

         --------------------------------
        |                     s          |
        | SE.est[bj]  =  -----------     |
        |                  |xj.adj|      |
         --------------------------------
        |                                |
        | V.est[b]  =  (X^T X)^{-1} s^2  |  diagonal: SE.est[bj]^2
        |                                |
         --------------------------------

    We thereby estimate 'dataset-to-dataset variation' of 'bj' and 'b'

                    BASED ON A SINGLE DATASET!!!


* Principles of statistical inference as applied to linear models:


    . The test statistic for betaj is

                bj - betaj.null
        tj  =  ----------------
                   sj

      where we abbreviate  sj = SE.est[bj] = s/|xj.adj|.

      and H0 means: "betaj.null is the hypothetically true value of the j'th slope."


      Note:   tj  =  tj(bj,betaj.null)
                  =  directional distance between bj and betaj in multiples of sj

              tj is not a function of the other betak (k!=j) and sigma^2,
                 but its distribution might still depend on them.
                 Or does it?  Here goes next:


   . The statistic 'tj' has some special properties:

     + It is a 'pivotal' quantity, meaning its distribution
       does NOT depend on 'nuisance parameters', that is,
       parameters that are not part of the null hypothesis
       (betak for k!=j, and sigma).
       'Pivotality' of 'tj' holds...
         > ... exactly under normal parametric assumptions.
         > ... asymptotically under fewer assumptions on the errors.

     + 't' has a known finite-sample distribution for all N (in fact: Student''s t)
       under normal parametric assumptions,
       enabling 'exact' finite-sample inference based on 't'.

     + 't' has a non-degenerate limiting distribution as N-->Inf (in fact: N(0,1))
       under minimal assumptions on the (homoskedastic) errors,
       enabling asymptotic inference based on 't'.


    . Here is the kind of probability statement that is needed
      for (two-sided) statistical inference:

        P[ -c < tj < +c | this value of betaj.null, any value for other betak ]
        =    1-alpha              (for 'exact' inference)
        -->  1-alpha  as N-->Inf  (for asymptotic inference)

      where c=c(alpha), independently of what the other betak and sigma^2 in H0 are.


    . USE FOR H0 TESTING:

      If  H0: betaj=betaj.null  is true,
      then for  tj = (bj-betaj.null)/sj  we have

             P[-c < tj < +c | betaj.null ]  '=' or '-->'  1-alpha

      We consider it as evidence against H0 at the significance
      level alpha if the observed value tj.obs falls outside (-c,c):

        Reject H0  :<==>  observed 't' is not in (-c,+c)

      (This is the definition of a rejection rule.)

      The gambling guarantee is:

        P[ false rejection ]  =  P[ reject H0 | H0 true ]
                              =  P[ tj not in (-c,c) | H0 true ]
                              =  alpha

    . USE FOR CIs:

      A CI with confidence level 1-alpha is the collection of
      parameter values betaj.null which cannot be rejected
      if tested at the significance level alpha.

      To find CI(1-alpha), invert  -c < tj < +c  so it becomes a
      condition for betaj:

        -c < tj < +c   <==>  bj - c*sj < betaj < bj + c*sj
                       <==>  betaj  in  (bj-c*sj, bj+c*sj)

      ==>  CI(1-alpha)  =  (bj-c*sj, bj+c*sj)   (a random interval).

      The gambling guarantee is:

        The true value of betaj is in CI(1-alpha) with Probability = 1-alpha.


    . Comparison of CIs and acceptance/retention/non-rejection intervals:

        tj in (-c,+c)  <==>  betaj in (bj - c*sj, bj + c*sj)
                                      ^^^^^^^^^ CI ^^^^^^^^^
                                          centered at bj

                       <==>  bj    in (betaj - c*sj, betaj + c*sj)
                                      ^^^ Retention Interval ^^^^
                                        centered at betaj from H0

                       <==>  bj and betaj are no more than c*sj apart.

         There is only one true value of  betaj,
         but each dataset will produce its own value of bj.

         The gambling guarantee is that the true betaj and the
         observed bj will be no more than c*sj apart for a fraction
         1-alpha of datasets (exactly or approximately).


================================================================


* ASYMPTOTIC DISTRIBUTION THEORY FOR INFERENCE IN LINEAR MODELS:


  - The basis of asymptotic inference abouts slopes: a generalized CLT for bj

    .  As N -> Inf,
          p (= number of predictors) is held fixed, and
          cases are i.i.d. sampled:

               bj-betaj    D
        tj =  ----------  -->  N(0,1)  "in distribution", "weak convergence"
                 sj

      where betaj is the true-but-unknown or hypothesized parameter value.

      Meaning:   P(t in [a,b])  ->  P(z in [a,b])  where z ~ N(0,1)
                              N->Inf

  - Comments:

    . Some strangeness:
        This CLT still conditions on the predictors,
        yet the predictor rows increase in numbers, N-->Inf.

    . The CLT holds even for non-sampled/truly fixed predictors,
      where N-->Inf.  One then needs a condition on the Hii values
      to make sure that they converge to zero so no case retains
      influence > eps > 0 as N --> Inf.

    . With asymptotic inference, one does not need to check
        assumptions such as normal errors,
        BUT one needs 'large N' and Hii -> 0.

    . One also needs correctness of the model: y = X beta + eps
        but the components of eps can have non-normal distributions
        (but must satisfy some mild moment conditions for the CLT).

    . 'N large enough' may mean, e.g., N > 10*(p+1),
        i.e., more than 10 obs. per parameter)
         (In general, how large is enough depends on the problem.
          Sometimes N=5 is large enough,
          and sometimes N=10^6 is not large enough.
          Simulations of special cases often give some idea.)
      It also requires that no Hii is dominant.

    . More powerful modern asymptotic theory exists,
      where 'p' can increase slowly with 'N'.


================================================================


* ROADMAP:
  - Multivariate normal distribution
  - Derivation of the t-distribution for t-statistics (testing one slope)
  - Derivation of the F-distribution for F-statistics (testing multiple slopes)


================================================================


* THE MULTIVARIATE NORMAL DISTRIBUTIONS: Definitions and Facts

  - This is in preparation of exact finite-sample inference under
    the assumption of iid normal errors.


  - Def.: An N-dimensional 'standard normal random vector' is given by

          z = (z1,z2,...,zN)^T      where      zn are i.i.d. N(0,1)

          We write  z ~ N(0_N,I_N)      where      0 is in R^N and I is NxN.

          One draw from z, in R:    z <- rnorm(N)


  - Fact:  -------------------------------------------------
          |                                                 |
          |  The multivariate standard normal distribution  |
          |  is rotation-invariant/symmetric!               |
          |                                                 |
           -------------------------------------------------

          Meaning: If  z ~ N(0,I),
                   then  Qz ~ N(0,I)
                   for any NxN orthogonal matrix Q.

          Proof: Use that the density of 'z' depends only on |z|^2, and det(Q)=1.
                 One could prove this fact with probabilities,
                 but it is more transparent with expectations:
                 E[ f(Qz) ] = Integral f(Qz) p(z) dz   # zeta = Qz, dzeta = det(Q) dz
                            = Integral f(zeta) p(Q^{-1} zeta) dzeta
                            = Integral f(zeta) p(zeta) dzeta
                            = Integral f(z) p(z) dz
                            = E[ f(z) ]

  - Def.: A general N-dimensional 'normal random vector'
          with mean vector 'mu' and covariance matrix 'C' is given by
          y = (y1,y2,...,yN)^T     where

           ------------------------------------------------------
          |                                                      |
          |  y = A^T z + mu     and  C = A^T A  and  z ~ N(0,I)  |     N(mu, C)
          |                                                      |
           ------------------------------------------------------

          A ~ N x N
          y ~ N x 1
          z ~ N x 1

          We write

           ---------------
          |  y ~ N(mu,C)  |     where mu is in R^N and C is NxN (sym., >=0).
           ---------------

             if B^T B = A^T A, do   A^T z  and B^T z  have the same distribution?  To be proven!

                (Q A)^T (Q A) = A^T Q^T Q A = A^T A


  - Fact: The definition of N(mu,C) is independent of the representation C=A^T A.

          'Proof':   A^T A = B^T B   <==>   A = Q B  for some orthogonal Q (N x N)
                                            ('<==' is easy; '==>' less so)
                     Then:  E[ f(A^T z) ] = E[ f(B^T Q^T z) ] = E[ f(B^T z) ]
                     (We''re not assuming that there is a density!)


  - Fact: Multivariate normal distributions are uniquely characterized by
          their expectation vectors and variance/covariance matrices,
          i.e., 1st & 2nd moments.


  - Fact: Any linear (matrix) transformation of a multivariate normal distribution
          yields again a normal distribution:

          y ~ N(mu, C)  ==>  Gy+c ~ N(G mu + c, GCG^T)   [Note: G can be non-square.]


  - Fact: Two random vectors y1, y2 with a joint normal distribution
          are uncorrelated iff they are independent.

          V[y1,y2] = 0  <==>  y1, y2 are independent

          Reason: y = rbind(y1,y2)         (using R notation)
                  V[y] = block-diagonal with 1,2 block and 2,1 block zeroed out
                  ==>  y1 = A1 z1,  y2 = A2 z2
                       where  z=rbind(z1,z2)  ~  N(0,I)
                  Because z1 and z2 are independent, y1 and y2 are independent.


  - Fact: If 'y' is multivariate normal and  V[y] = sigma^2 I,
          when are  A^T y  and  B^T y  uncorrelated (and hence independent)?

          V[A^T y, B^T y] = ...  A^T V[y] B = sigma^2 A^T B ?=? 0

          ==> A^T y and B^T y are independent if
              the columns of A are orthogonal to the columns of B.
              (Column space of A is orthogonal to the column space of B.)


  - Note: The definition of general normal distributions includes
          "degenerate" or "singular" cases where  V[y] = C  is not of full rank.

          Examples of normally distributed random vectors with
          degenerate distributions in linear regression:  ...


  - Note: Most of the above arguments work when the initial z is replaced with
          any circularly symmetric distribution in R^N.
          (An example is the circularly symmetric t-distribution.)
          That is, from a circularly symmetric distribution one can
          define an elliptic family of distributions.
          The only property that does not generalize
          is the equivalence of 'uncorrelated' and 'independent';
          this property is special to the multivariate normal distribution,
          and in fact it characterizes it.

  - R: One draw from y ~ N(mu, C) is obtained by

            N <- 100
            A <- cbind(rep(1,N), 1:N, (1:N)%%2)   # Just an example
            mu <- 1:ncol(A)
            y <- t(A)%*%rnorm(N) + mu          # \
            y <- crossprod(A,rnorm(N)) + mu    # /  two versions of code doing the same

  - R: Given C, how do you find an A?

            C <- crossprod(matrix(rnorm(25),5,5)) # example of generating a pos def matrix
            C

            A <- chol(C)                          # Cholesky decomp:  A^T A = C
            crossprod(A)                          # = A^T A = C
            t(A) %*% A                            # same


================================================================


* EXACT NULL DISTRIBUTION THEORY FOR LINEAR MODELS:


  - We will be examining the null distribution of the statistics t and F.
    They turn out to be pivotal.


  - Q: If  y ~ N(X beta, sigma^2 I), what is the distribution of 'b' and 'bj'?

    A: For the whole coef vector of estimates:
          b = (X^T X)^{-1} X^T y         is a linear transformation of y
          E[b] = beta                    from 1st order assumption
          V[b] = sigma^2 (X^T X)^{-1}    from 2nd order assumption

          =>  b ~ N(beta, sigma^2 (X^T X)^{-1})          (multivariate normal)

       One coeff estimate at a time:
          bj = < y, xj.adj/|xj.adj|^2 > is a linear combination of y
          E[bj] = betaj                 (assuming the model is unbiased)
          V[bj] = sigma^2 / |xj.adj|^2  (assuming the variance properties)

          =>  bj ~ N(betaj, sigma^2 / |xj.adj|^2)        (univariate normal)

      If we know sigma, the natural test statistic is

               bj - betaj
          z = --------------   ~  N(0,1)
              sigma/|xj.adj|


  - Q: Why do we arrive at a t-distribution and not a normal distribution?

    A: We estimate sigma^2 with  s^2 = RSS/(n-p-1) = RMSE^2

               bj - betaj      (bj - betaj)    (bj - betaj)*|xj.adj|
         t = -------------- = -------------- = ---------------------
               SE.est[bj]        s/|xj.adj|            s


  - Q: How is a t-distribution with k degrees of freedom defined?

    A: For any  Z, Z1,...,Zk that are i.i.d. N(0,sigma^2)  define

                          Z
        t := ------------------------
                  /  Z1^2+...+Zk^2 \
             sqrt(------------------)
                  \        k       /


    Facts:
    . This definition allows us to simulate t-distributions (not that we need to),
      because we know how to simulate iid N(0,sigma^2).
    . The distribution of t is independent of sigma as long as
      it is the same for Z, Z1,...,Zk.
    . As  k-->Inf  we have  t-->N(0,1).  Why?  ...


  - Q: Why would the 't' derived from 'bj' be distributed like a 't'-distribution?

    A: Some pretty linear algebra and geometry!!!
       We will show that the former t can be written as the latter t.

       tj = (bj - betaj)*|xj.adj| / s                 (former t)

       Z := (bj - betaj)*|xj.adj|  ~  N(0,sigma^2),  where betaj is the true slope.

       s^2 = |r|^2 / (N-p-1) = ?????

       To be shown:  For some Z, Z1, Z2, ... iid N(0,sigma^2)
         1) s^2 = (Z1^2 + Z2^2 + ...) / (N-p-1)
         2) s^2 and Z (defined above) are independent.

       Let Q be a Nx(N-p-1) matrix whose columns form
       an ORTHONORMAL BASIS OF RESIDUAL SPACE:
       . Q^T X = 0    (The columns of Q are orthogonal to the columns of X.)
       . Q^T Q = I    (The columns of Q are mutually orthogonal;  size: (N-p-1)^2.)
       . Q Q^T = I-H  (= the projection onto residual space; size: N^2.)
                      (Prove this equality.)

       Define the (N-p-1)-dimensional random vector

         z = (Z1,Z2,...)^T = Q^T y     [ y = X beta + eps,  Q^T y = Q^T eps ]

       Facts:

       . r = Q z
           (Proof:  r = (I-H)y = Q Q^T y = Q z)
           (Interpretation: z contains the coordinates of r in the basis Q.)

       . z ~ N(0, sigma^2*I)  where 'I' is of size (N-p-1)^2
           (Proof: z = Q^T y;  y is normal, hence z is normal.
                   E[z] = E[Q^T y] = Q^T X beta = 0
                   V[z] = Q^T V[y] Q = sigma^2*I  of size (N-p-1)^2.)

       . |r|^2 = |z|^2                               QED 1) above
           (Proof:  |r|^2 = |(I-H)y|^2 = y^T (I-H) Y = y^T Q Q^T y = |Q^T y|^2 = |z|^2.)

       . z and bj are stochastically independent.    QED 2) above
           (Proof: bj ~  xj.adj^T y, and  xj.adj is orthogonal to the columns of Q.)


  - Properties of the t-distribution with k degrees of freedom:
    . Bell-shaped:
          x <- seq(-4,4,length=500)
          plot(x, dt(x,df=1), type="l", ylim=c(0,.45), col="blue",
               xlab="quantile", ylab="t-density")  # dt(..,df=1) == dcauchy(..)
          abline(h=0)
          text(x=0, y=.25, lab="Cauchy", col="blue")
          text(x=0, y=.23, lab="df=1", col="blue")
    . As df increases, the t-distribution gets lighter in the tails:
          colrs <- gray.colors(100,0,.8)
          for(df in 2:100) lines(x, dt(x,df), col=colrs[df], lwd=0.5)
    . As df -> Inf, the t-distribution approaches N(0,1) distribution:
          lines(x, dnorm(x), col="red", lwd=2)
          text(x=0, y=.44, lab="Gauss", col="red")
          text(x=0, y=.42, lab=expression(paste("df=",infinity)), col="red")
                              #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ math symbols

          # For math symbols in R plots see
               help(plotmath)
               demo(plotmath)
          # Actually, these may be obsolete!
          # Check out the R package   'latex2exp' (?)
          # It has an R function that allows you to use
          # latex notation for special symbols in R plots.
          # (Pointer due to Alex Miller, 2016.)


  - Another view of the tj statistic: express it in terms of  y = X beta + eps

      Write:      mu = E[y] = X beta   assuming the full model is correct.

                  eps = E[y] - mu  ~ N(0,sigma^2 I_N)

      Recall:     bj    = <y,xj.> / |xj.|^2    # abbreviating xj. = xj.adj

                  betaj = E[bj]
                        = E [ <y, xj.> / |xj.|^2 ]
                        = <mu, xj.> / |xj.|^2

                  bj - betaj  =  <y - mu, xj.> / |xj.|^2

                  SE.est[bj] = s / |xj.|


      ==>         tj = (bj - betaj) / SE.est(bj)
                     = (bj - betaj) /(s/|xj.|)
                     =  < y - mu, xj./|xj.| >  / s

           --------------------------------------
          |                                      |
          |     tj = < y - mu, xj./|xj.| > / s   |
          |                                      |  For testing H0: <mu,xj.> = 0
          |        = < eps, xj./|xj.| > / s      |
          |                                      |
           --------------------------------------

          tj is the 'projection' of the error/noise vector eps
          on the normalized adjusted regressor xj.


  - Why is tj a pivot for testing betaj = 0?

    . Definition: A 'Pivot' is a test statistic whose
        null distribution does NOT depend on the nuisance parameters.

    . Q: What are the nuisance parameters when testing H0: betaj = 0 ?
      A: All  betak (k =/= j) and sigma

    . Is tj a pivot in this sense?
        It''s null distribution is ...

    . Counter-example: t-statistic of simple regression, unadjusted

        tj =  < y - mu, xj/|xj| > / s    not a pivot?

        This is the t-statistic for simple linear regression on xj, unadjusted.

        H0: betaj = 0
        ==>  y = mu + eps,  mu = sum_{k != j} xk betak

        Two choices for the denominator of tj:
        * s from multiple regression on x0,x1,...,xp, or
        * s from simple regression on xj

        If s is from the simple regression, but the full model is true, then
           E[s^2] = E[RSS.simple] / (N-1) = ( |mu|^2 + sigma^2 ) / (N-1)
        where  mu = = \sum_{k =/= j} xk betak
        ==> Distribution of s depends on betak (k =/= j),
            hence tj is not pivotal.

        - If s is from the multiple regression, then
           E[s^2] = E[RSS.multiple] / (N-p-1)
           RSS.multiple = |r.multiple|^2
           E[r.multiple] = 0
           V[r.multiple] = (I-H) sigma^2
           r.multiple ~ N(0,(I-H) sigma^2)
        ==> Distribution of s does not depend on betak (k =/= j),
            only on sigma, which should cancel out with numerator of tj.

        - Numerator of tj:  <y,xj>  instead of  <y,xj.>
           E[<y,xj>] = E[<mu + eps, xj>]
                     = E[<sum_{k =/= j} xk betak, xj>] + E[<eps,xj>]
                     = sum_{k =/= j} <xk, xj> betak
                     = not zero, can be anything, depends on betak (k =/= j)
           V[<y,xj>] = |xj|^2 sigma^2
           <y,xj> ~ N(sum_{k =/= j} <xk, xj> betak, |xj|^2 sigma^2)
        ==> The numerator distribution depends on
            the nuisance parameters betak (k =/= j).

        Conclusions:

        - Adjustment of xj is essential for testing
                           H0: betaj = 0
          in multiple regression on x0,x1,...,xp.
          It is needed to achieve pivotality of the
          numerator of tj wrt the nuisance slopes.

        - The denominator (i.e., s) should be based on the full model
          in order to achieve pivotality wrt to the nuisance slopes.
          This also achieves pivotality wrt sigma, which drops out
          of the tj ratio.


================================================================


* F-TESTS FOR COMPARING TWO ARBITRARY NESTED MODELS:


  - Purpose: testing groups of slopes jointly
             in particular: overall F-test where H0: y ~ N(mu*e, sigma^2*I)


  - Null hypotheses tested with F-ratios:

      Partition:
                 X     =  (X1,X2)                  (assume X is full rank)

                 beta  =  (beta1^T,beta2^T)^T

                                        ---------------
                 NULL HYPOTHESIS H_0:  |   beta1 = 0   |
                                        ---------------

                 H_0:  beta1 = 0

                 b     =  (b1^T, b2^T)^T

                 p+1   =  p1 + p2

                 (Think: beta1 are of interest, educational measures, say;
                         beta2 not of interest but necessary adjustment, for demographics,..)

                 (The intercept is usually among the X2 columns.)

      Associated projections:

                 H     = X(X^T X)^{-1} X^T

                 H2    = X2(X2^T X2)^{-1} X2^T

                 H1.2  =  H - H2
                       =  orth. projection onto orth. complement of cspan(X2) in cspan(X)
                                           _           _|_
                          i.e.,  cspan(X) | | cspan(X2)

                 X1.2  =  (I-H2) X1  =  (H-H2) X1
                       =  columns of X1 adjusted for X2

                 H1.2  =  X1.2 (X1.2^T X1.2)^{-1} X1.2^T

                 H     =  H1.2 + H2

      Orthonormal bases:

                 Q2    =  o.n. basis of  cspan(X2)    [N x p2]

                 Q1.2  =  o.n. basis of orth. complement of cspan(X2) in cspan(X) [N x p1]

                 Q     =  (Q1.2, Q2) = o.n. basis of cspan(X)    [N x (p1+p2) = N x (p+1)]

                 Q.res =  orthonormal basis of  cspan(X)^perp

                 NOTE:  cspan(Q1.2) =/= cspan(X1)   !!!
                        cspan(Q1.2) ==  cspan(X1.2) = Range(H-H2) = Range(H1.2)
      Facts:

                 H2    =  Q2 Q2^T

                 H1.2  =  Q1.2 Q1.2^T

                 H     =  Q Q^T


                             ---------------
      NULL HYPOTHESIS H_0:  |   beta1 = 0   |
                             ---------------

                 With H_0 we use a hypothetical assumption
                   "Are the columns of X1 statistically significant
                    above and beyond X2 ?  Do we need them?"
                 to answer the question whether X1 matters.

                                        -------------------
         H_0 in terms of  mu: = E[y]:  |   mu = X2 beta2   |
                                        -------------------

                                        -----------------
         Equivalent:                   |   H1.2 mu = 0   |
                                        -----------------

  - Estimate of  H1.2 mu:  H1.2 y

                                     -----------------------
      H_0:                          |   H1.2 y = H1.2 eps   |  due to error alone, not mu
                                     -----------------------

  - The F-statistic is based on:

         yhat - yhat2  =  (H-H2) y  =  H1.2 y  =  H1.2 eps
                                               ^ under the null hypothesis H0
                           H1.2 y = H1.2( X2 beta2 + eps ) = H1.2 eps
                           dfs = p1 = dim(Range(H1.2))
                           E[|H1.2 y|^2 / p1] = sigma^2  under null hyp. H0

         r = y - yhat    = (I-H) y  =  (I-H) eps
                                    ^ assuming 1st order correctness of the full model
                           (I-H) y = (I-H) (X beta + eps) = (I-H) eps
                           dfs = N-p-1
                           E[|(I-H) y|^2 / (N-p-1)] = sigma^2  under 1st+2nd order correctness


  - Def. of F-distribution: W1,...,Wq,Z1,...Zr i.i.d. N(0,sigma^2)
                                                          ^^^^^ irrelevant but same
            (W1^2+...+Wq^2)/q
        F = -----------------   has a (central) F-distribution with q,r dfs
            (Z1^2+...+Zr^2)/r

        Btw: W1^2+...+Wq^2 is also called a chi-square with q dfs
             Z1^2+...+Zr^2 is hence a chi-square with r dfs
             To define an F distributions, we need two independent chi-square variables.

    [Limiting cases of F:
       q --> Inf, r fixed           ...
       r --> Inf, q fixed           ...
       q --> Inf, r --> Inf         ...
     ]


  - The F-ratio has numerator and denominator as follows:

        Numerator   = | Q1.2 Q1.2^T y |^2 / p1
                    = | Q1.2^T eps |^2 / p1      (under null hyp. H0)
                    = mean of p1 iid normals^2:
                        Q1.2^T eps = (W1,...,Wq)^T,  q = p1

        Denominator = | Q.res Q.res^T y |^2 / (N-p-1)
                    = | Q.res^T eps |^2 / (N-p-1)  (if model is 1st order correct)
                    = mean of (N-p-1) iid normals^2
                        Q.res^T y = (Z1,...,Zr)^T,  r = N-p-1

        The iid normals of the numerator are independent of
        the iid normals of the denominator because
        cspan(Q.res)  and  span(Q1.2) = cspan(X1.2)  are orthogonal.

  - Special case of one predictor:
          F = t^2         (recall: t=Z/sqrt((Z1^2+...+Zr^2)/r)

  - Pivotality of F: independence of its distribution under H0
                     from nuisance parameters
          AYT?  What are the nuisance parameters?
                Why do they drop out of the null distribution of F?

  - Special case where p1=p,  X2 = intercept column alone
         ==>  overall F-test
         X1.2 = all real predictors centered (mean adjusted)
         H2 = 'mean projection'


  - Rules for 95\% quantiles of F-distributions?

    . "2 for t": works for about n~dfs >= 25
        qt(.975, df=1:100)    # cross-over below 2 is for df=60

    . F-distribution:
        df2 <- 5:30
        plot(x=range(df2), y=c(0,10), type="n", xlab="df2", ylab="F")
        for(df1 in 1:10) {
          lines(x=df2, y=qf(.95, df1=df1, df2=df2))
          text(x=df2[1], y=qf(.95, df1=df1, df2=df2[1]), lab=df1, cex=.8, adj=1.2)
        }
      No apparent simple rule for F thresholds.


  - Generalization: Construct the F-statistic for

       ---------------------------
      |   H_0:  beta1 = beta1_0   |    as opposed to:  beta1_0 = 0
       ---------------------------

      Hint: Reduce to the case of testing  beta1 = 0

      Trick:  y  =  X1 beta1_0+ X2 beta2 + eps

              y - X1 beta1_0  =  X2 beta2 + eps

      I.e., use a new response:  y - X1 beta1_0

           | H1.2 (y - X1 beta1_0) |^2 / p1
      F = ---------------------------------
                | (I-H) y |^2 / (N-p-1)

      ==> Can be inverted to a confidence region for beta1.
          Cut-off:   qf(1-alpha, df1=..., df2=...)

      CR(b1)  =  { beta1 :  | yhat.2 - X1.2 beta1 |^2  <= p1 * s^2 * qf(1-alpha) }
              =  { beta1 :  | X1.2 b1 - X1.2 beta1 |^2 <= p1 * s^2 * qf(1-alpha) }
              =  { beta1 :  | X1.2 (b1 - beta1) |^2    <= p1 * s^2 * qf(1-alpha) }
              =  { beta1 :  (beta1 - b1)^T (X1.2^T X1.2) (beta1 - b1)  <= p1 * s^2 * qf(1-alpha) }


================================================================




* CIs AND PIs FOR THE RESPONSE:

  - Inference for the response vs. prediction of the response: CIs vs PIs

    . Let xx be a row (!) from the design matrix X written as a column,
             or a new set of predictor values.
        Ex.: When modeling the car data with
                MPG = b0 + b1*WEIGHT + b2*HP
             we have  xx=(1,WEIGHT,HP)^T
             We might want to know something about  y=MpG  at  xx=(1, 3000, 200)^T

        yhat(xx) = xx^T b   where b=(b0,b1,b2)^T.

        You can judge assertions such as:
             "It is thought in the automotive industry that cars
              with a weight of 3000 lb and 200HP get ON AVERAGE 27 MGP on the highway."
        This is a null hypothesis!

    . INFERENCE for the response is really inference for the EXPECTED VALUE E[y(xx)]:
              E[y(xx)] = mu(xx) = xx^T beta
        We look for an interval that contains the true <xx,beta> with prob 95\%.
        This interval is called a 'confidence interval' or 'CI' for mu(xx) = xx^T beta.
        It is of the form  yhat(xx) +- halfwidth.

    . PREDICTION of the response is
      + either point    estimation of   E[y(xx)]  using  yhat(xx) = <xx,b>,
      + or     interval estimation of   E[y(xx)]  using  yhat(xx) +- halfwidth.
      but the idea being that the interval captures 95\% of actual observations y at xx,
      if such could be observed.
      In this case the interval is called a 'prediction interval' or 'PI'.
      Its gambling guarantee should be to catch future response values y at xx
      based on past training data with probability 95\%.
      (Both future response values and past data are random!)

        You can judge assertions such as:
             "It is thought in the automotive industry that cars
              with a weight of 3000lb and 200HP
              get between 24 and 30 MGP on the highway."
        This speaks about actual car models.

    . Q: Which is going to be wider, the CI or the PI?


  - Inference for  E[y(xx)] = xx^T beta:

    . Develop a test statistic:

        yhat(xx) = xx^T b

      ASSUMPTION: THE MODELS IS CORRECT !!!

        E[yhat(xx)] = xx^T beta = E[y(xx)] = mu(xx)

        Var[yhat(xx)] = Var[xx^T b]
                      = Var[xx^T (X^T X)^{-1} X^T y]
                      = sigma^2 (xx^T (X^T X)^{-1} xx)
        ==>    stderr.est(yhat(xx)) = s sqrt(xx^T (X^T X)^{-1} xx)
                                             ^^^^^^^^^^^^^^^^^^^^
                                                =: H(xx)
                                                H(xx_i) = H_ii
        ==>    yhat(xx) - <xx,beta>
               --------------------  is t-distributed with df=N-p-1
               stderr.est(yhat(xx))

               Why?  ...

        ==> Can be inverted to CIs and Retention regions

            CI = 95\% interval for <xx,beta>
            RI = 95\% coverage interval for yhat(xx) under H_0: E[y(xx)]=<xx,beta>


    . Inference:

      + Note that for  xx=xi  (= i''th row of X),
         this is  stderr.est(yhat(xx)) = s sqrt(Hii)

      + Inversion of tests to CIs:
          This is a general procedure that applies to any parameter test:

            P[ -q < t < +q | H_0 ]  =  .95  (e.g.)
               -q < t < +q
               -q*stderr.est(yhat(xx)) < yhat(xx) - <xx,beta> < +q*stderr.est(yhat(xx))

          Retention interval for H_0: <xx,beta> (an assumed fixed value):

               <xx,beta>-q*stderr.est(yhat(xx)) < yhat(xx) < <xx,beta>+q*stderr.est(yhat(xx))
               yhat(xx) in [<xx,beta> +-q*stderr.est(yhat(xx))]

          Confidence interval for <xx,beta>:

               yhat(xx)-q*stderr.est(yhat(xx)) <  <xx,beta>  < yhat(xx)+q*stderr.est(yhat(xx))
            Written more simply:

               <xx,beta> in [yhat(xx) +- q*stderr.est(yhat(xx))]


      + CI:    95\%CI = [ yhat(xx) +- 2 stderr.est(yhat(xx)) ]

           This Ci captures the true  E[y(xx)]=mu(xx)=<xx,beta>  for 95\% of datasets.

           Uses: Confidence bands around fitted lines in simple regression

      + Testing H_0, the above example: See earlier "It is thought...", first instance.

               H_0: E[MPG(WEIGHT=3000,HP=200)] = mu0(xx) = 27 mpg

                   =>  xx = (1,3000,200)^T
                       t = (yhat(xx) - mu0(xx))/stderr.est(yhat(xx))
                       Reject at ~5\% significance when  |t|>2.

      + '2' in CIs and retention intervals should really be:
                 qt(p=.975, df=n-p-1)
           More generally:
                 qt(p=1-alpha/2, df=n-p-1)
           where alpha = 0.05, 0.01, 0.001, ...

      + Ex.:   Simple linear regression with one predictor x, calculate  (X^T X)^{-1}
               plot CI(x) as a function of x.


  - Prediction intervals based on yhat(xx):
    . Let y(xx) be the random variable of FUTURE response values at xx
          whereas  yhat(xx) is obtained from PAST data.
    . Assume the model is correct, wrt first and second moments and normality
          for past and future data.
    . Wanted: an interval around  yhat(xx) that has a 95\% chance of catching y(xx)
    . Approach: Find distribution of  y(xx)-yhat(xx).
          E[ y(xx) - yhat(xx) ] = 0
          V[ y(xx) - yhat(xx) ] = V [ y(xx) ] + V[ yhat(xx) ]    (*)
                                = sigma^2 + stderr(yhat(xx))^2
                                = sigma^2 + sigma^2*(xx^T (X^T X)^{-1} xx)
                                = sigma^2 * (1 + xx^T (X^T X)^{-1} xx)
          Why this nice Pythagoras in (*) above?
          Reason: Future and past data are independent (or so we assume).
                  Future y(xx) is independent of yhat(xx) estimated from past data.

    . Assuming past and future response values are normally distributed:
                     y(xx) - yhat(xx)
          t = --------------------------------   has a t-distribution with df=n-p-1
              s*sqrt(1 + xx^T (X^T X)^{-1} xx)

          Reasons: The Gaussian assumption AND
            the numerator is independent of the denominator because
            - s  is computed from the residuals of past data  y.
            - yhat(xx) is computed from the projection of past  y  onto the X-space.
            - y(xx) is future data at xx independent of past data.
            => s is independent of yhat(xx) and y(xx)
               (recall: V[r,yhat]=0 => under normality  r  and  yhat  are independent)

    . PI(xx) = [ yhat(xx) +- 2*s*sqrt( 1 + xx^T (X^T X)^{-1} xx ) ]

                            (2 should be a 1-alpha/2 t-quantile.)


================================================================ end of class