================================================================ CHAPTER 3: STATISTICAL INFERENCE IN LINEAR MODELS * MODEL ASSUMPTIONS FOR FINITE-SAMPLE INFERENCE: ------------------------ | | | eps ~ N(0,sigma^2 I) | | | ------------------------ (alternative: "no" assumptions but asymptotics) - This is the first time that we assume an actual distribution for the errors. - Can you give an equivalent assumption for y? ... y ~ N(X beta, sigma^2 I) * TESTING IN LINEAR MODELS: - NULL HYPOTHESES: assumptions about individual slopes, e.g., beta1=beta1.null (some fixed assumed value); usually beta1=0. H0: beta1 = beta1.null (usually: beta1.null=0) - H0 is a COMPOSITE NULL HYPOTHESIS: y = beta0 + beta1.null*x1 + beta2*x2 + ... + betap*xp + eps, ^^^^^ (often =0) ^^^^^ ^^^^^ E[eps]=0, V[eps]=sigma^2*I ^^^^^ beta0, beta2, betap, sigma are parameters not restricted by H0: ---------------------------------------------------------------------------- | | | H0 = { (beta0, beta1, beta2,..., betap, sigma) | beta1=beta1.null } | | | ---------------------------------------------------------------------------- ==> H0 is a (p+1)-dim set of parameter vectors -- it is COMPOSITE. Definition: ------------------------------------------------------------------------ | If beta1 is the parameter of interest in H0, then | | beta0, beta2, ..., betap, sigma are 'NUISANCE PARAMETERS' wrt H0. | ------------------------------------------------------------------------ Important: . Nuisance parameters are wrt to a null hypothesis! . If beta2 is being tested, then beta0, beta1, beta3,... are nuisance parameters. - The INFERENCE GUARANTEE for a statistical test: . The retention (non-rejection) rule is: ----------------------------------------------------------------------------------- | 'Retain H0: beta1=beta1.null' :<==> 'b1 is in retention interval RI=[lo,hi]' | ----------------------------------------------------------------------------------- . We want a retention probability of 1-alpha under H0: P[ Retain H0 | H0 ] = P[ b1 in RI | H0 ] >= 1-alpha (= 0.95, say) We call 'alpha' the 'significance level' or 'Type I error'. . Problem: P[ Retain H0 | H0 ] is really P[ Retain H0 | (beta0, beta1.null, beta2, ..., betap, sigma) ] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For each choice of nuisance parameter, we get a different probability distribution. Q: What nuisance parameters should we choose to calculate this probability? . Solutions: (1) We use 'pivotal' quantities as test statistics, that is, quantities whose NULL DISTRIBUTION is INDEPENDENT OF THE NUISANCE PARAMETERS in H0. Examples: t and F statistics (as we will see) (2) In pretty models there exist 'minimal sufficient statistics' UNDER THE NULL HYPOTHESIS; conditioning on them produces exact conditional null distributions free of nuisance parameters. (Stay tuned.) (3) Asymptotic theory: Often one has a CLT under H0 for a test statistic (b1 - beta1.null)/SE.est[b1] -~-> N(0,1) so the asymptotic distribution N(0,1) is free of nuisance parameters. ==> Coverage probabilities for finite N are approximate. . Present Goal: For linear models, justify + tests based on t = (bj-betaj.null)/SE.est[bj], + CIs of the form bj +- const*SE.est[bj]. No optimality theory (see the book by Lehman, Lehman & Romano), just correctness/validity of the alpha-coverage guarantee. * STANDARD ERROR ESTIMATES: - Let s = RMSE = sqrt( RSS / (N-p-1) ) be the conventional estimate of sigma. Then we define the obvious SE.est: -------------------------------- | s | | SE.est[bj] = ----------- | | |xj.adj| | -------------------------------- | | | V.est[b] = (X^T X)^{-1} s^2 | diagonal: SE.est[bj]^2 | | -------------------------------- We thereby estimate 'dataset-to-dataset variation' of 'bj' and 'b' BASED ON A SINGLE DATASET!!! * Principles of statistical inference as applied to linear models: . The test statistic for betaj is bj - betaj.null tj = ---------------- sj where we abbreviate sj = SE.est[bj] = s/|xj.adj|. and H0 means: "betaj.null is the hypothetically true value of the j'th slope." Note: tj = tj(bj,betaj.null) = directional distance between bj and betaj in multiples of sj tj is not a function of the other betak (k!=j) and sigma^2, but its distribution might still depend on them. Or does it? Here goes next: . The statistic 'tj' has some special properties: + It is a 'pivotal' quantity, meaning its distribution does NOT depend on 'nuisance parameters', that is, parameters that are not part of the null hypothesis (betak for k!=j, and sigma). 'Pivotality' of 'tj' holds... > ... exactly under normal parametric assumptions. > ... asymptotically under fewer assumptions on the errors. + 't' has a known finite-sample distribution for all N (in fact: Student''s t) under normal parametric assumptions, enabling 'exact' finite-sample inference based on 't'. + 't' has a non-degenerate limiting distribution as N-->Inf (in fact: N(0,1)) under minimal assumptions on the (homoskedastic) errors, enabling asymptotic inference based on 't'. . Here is the kind of probability statement that is needed for (two-sided) statistical inference: P[ -c < tj < +c | this value of betaj.null, any value for other betak ] = 1-alpha (for 'exact' inference) --> 1-alpha as N-->Inf (for asymptotic inference) where c=c(alpha), independently of what the other betak and sigma^2 in H0 are. . USE FOR H0 TESTING: If H0: betaj=betaj.null is true, then for tj = (bj-betaj.null)/sj we have P[-c < tj < +c | betaj.null ] '=' or '-->' 1-alpha We consider it as evidence against H0 at the significance level alpha if the observed value tj.obs falls outside (-c,c): Reject H0 :<==> observed 't' is not in (-c,+c) (This is the definition of a rejection rule.) The gambling guarantee is: P[ false rejection ] = P[ reject H0 | H0 true ] = P[ tj not in (-c,c) | H0 true ] = alpha . USE FOR CIs: A CI with confidence level 1-alpha is the collection of parameter values betaj.null which cannot be rejected if tested at the significance level alpha. To find CI(1-alpha), invert -c < tj < +c so it becomes a condition for betaj: -c < tj < +c <==> bj - c*sj < betaj < bj + c*sj <==> betaj in (bj-c*sj, bj+c*sj) ==> CI(1-alpha) = (bj-c*sj, bj+c*sj) (a random interval). The gambling guarantee is: The true value of betaj is in CI(1-alpha) with Probability = 1-alpha. . Comparison of CIs and acceptance/retention/non-rejection intervals: tj in (-c,+c) <==> betaj in (bj - c*sj, bj + c*sj) ^^^^^^^^^ CI ^^^^^^^^^ centered at bj <==> bj in (betaj - c*sj, betaj + c*sj) ^^^ Retention Interval ^^^^ centered at betaj from H0 <==> bj and betaj are no more than c*sj apart. There is only one true value of betaj, but each dataset will produce its own value of bj. The gambling guarantee is that the true betaj and the observed bj will be no more than c*sj apart for a fraction 1-alpha of datasets (exactly or approximately). ================================================================ * ASYMPTOTIC DISTRIBUTION THEORY FOR INFERENCE IN LINEAR MODELS: - The basis of asymptotic inference abouts slopes: a generalized CLT for bj . As N -> Inf, p (= number of predictors) is held fixed, and cases are i.i.d. sampled: bj-betaj D tj = ---------- --> N(0,1) "in distribution", "weak convergence" sj where betaj is the true-but-unknown or hypothesized parameter value. Meaning: P(t in [a,b]) -> P(z in [a,b]) where z ~ N(0,1) N->Inf - Comments: . Some strangeness: This CLT still conditions on the predictors, yet the predictor rows increase in numbers, N-->Inf. . The CLT holds even for non-sampled/truly fixed predictors, where N-->Inf. One then needs a condition on the Hii values to make sure that they converge to zero so no case retains influence > eps > 0 as N --> Inf. . With asymptotic inference, one does not need to check assumptions such as normal errors, BUT one needs 'large N' and Hii -> 0. . One also needs correctness of the model: y = X beta + eps but the components of eps can have non-normal distributions (but must satisfy some mild moment conditions for the CLT). . 'N large enough' may mean, e.g., N > 10*(p+1), i.e., more than 10 obs. per parameter) (In general, how large is enough depends on the problem. Sometimes N=5 is large enough, and sometimes N=10^6 is not large enough. Simulations of special cases often give some idea.) It also requires that no Hii is dominant. . More powerful modern asymptotic theory exists, where 'p' can increase slowly with 'N'. ================================================================ * ROADMAP: - Multivariate normal distribution - Derivation of the t-distribution for t-statistics (testing one slope) - Derivation of the F-distribution for F-statistics (testing multiple slopes) ================================================================ * THE MULTIVARIATE NORMAL DISTRIBUTIONS: Definitions and Facts - This is in preparation of exact finite-sample inference under the assumption of iid normal errors. - Def.: An N-dimensional 'standard normal random vector' is given by z = (z1,z2,...,zN)^T where zn are i.i.d. N(0,1) We write z ~ N(0_N,I_N) where 0 is in R^N and I is NxN. One draw from z, in R: z <- rnorm(N) - Fact: ------------------------------------------------- | | | The multivariate standard normal distribution | | is rotation-invariant/symmetric! | | | ------------------------------------------------- Meaning: If z ~ N(0,I), then Qz ~ N(0,I) for any NxN orthogonal matrix Q. Proof: Use that the density of 'z' depends only on |z|^2, and det(Q)=1. One could prove this fact with probabilities, but it is more transparent with expectations: E[ f(Qz) ] = Integral f(Qz) p(z) dz # zeta = Qz, dzeta = det(Q) dz = Integral f(zeta) p(Q^{-1} zeta) dzeta = Integral f(zeta) p(zeta) dzeta = Integral f(z) p(z) dz = E[ f(z) ] - Def.: A general N-dimensional 'normal random vector' with mean vector 'mu' and covariance matrix 'C' is given by y = (y1,y2,...,yN)^T where ------------------------------------------------------ | | | y = A^T z + mu and C = A^T A and z ~ N(0,I) | N(mu, C) | | ------------------------------------------------------ A ~ N x N y ~ N x 1 z ~ N x 1 We write --------------- | y ~ N(mu,C) | where mu is in R^N and C is NxN (sym., >=0). --------------- if B^T B = A^T A, do A^T z and B^T z have the same distribution? To be proven! (Q A)^T (Q A) = A^T Q^T Q A = A^T A - Fact: The definition of N(mu,C) is independent of the representation C=A^T A. 'Proof': A^T A = B^T B <==> A = Q B for some orthogonal Q (N x N) ('<==' is easy; '==>' less so) Then: E[ f(A^T z) ] = E[ f(B^T Q^T z) ] = E[ f(B^T z) ] (We''re not assuming that there is a density!) - Fact: Multivariate normal distributions are uniquely characterized by their expectation vectors and variance/covariance matrices, i.e., 1st & 2nd moments. - Fact: Any linear (matrix) transformation of a multivariate normal distribution yields again a normal distribution: y ~ N(mu, C) ==> Gy+c ~ N(G mu + c, GCG^T) [Note: G can be non-square.] - Fact: Two random vectors y1, y2 with a joint normal distribution are uncorrelated iff they are independent. V[y1,y2] = 0 <==> y1, y2 are independent Reason: y = rbind(y1,y2) (using R notation) V[y] = block-diagonal with 1,2 block and 2,1 block zeroed out ==> y1 = A1 z1, y2 = A2 z2 where z=rbind(z1,z2) ~ N(0,I) Because z1 and z2 are independent, y1 and y2 are independent. - Fact: If 'y' is multivariate normal and V[y] = sigma^2 I, when are A^T y and B^T y uncorrelated (and hence independent)? V[A^T y, B^T y] = ... A^T V[y] B = sigma^2 A^T B ?=? 0 ==> A^T y and B^T y are independent if the columns of A are orthogonal to the columns of B. (Column space of A is orthogonal to the column space of B.) - Note: The definition of general normal distributions includes "degenerate" or "singular" cases where V[y] = C is not of full rank. Examples of normally distributed random vectors with degenerate distributions in linear regression: ... - Note: Most of the above arguments work when the initial z is replaced with any circularly symmetric distribution in R^N. (An example is the circularly symmetric t-distribution.) That is, from a circularly symmetric distribution one can define an elliptic family of distributions. The only property that does not generalize is the equivalence of 'uncorrelated' and 'independent'; this property is special to the multivariate normal distribution, and in fact it characterizes it. - R: One draw from y ~ N(mu, C) is obtained by N <- 100 A <- cbind(rep(1,N), 1:N, (1:N)%%2) # Just an example mu <- 1:ncol(A) y <- t(A)%*%rnorm(N) + mu # \ y <- crossprod(A,rnorm(N)) + mu # / two versions of code doing the same - R: Given C, how do you find an A? C <- crossprod(matrix(rnorm(25),5,5)) # example of generating a pos def matrix C A <- chol(C) # Cholesky decomp: A^T A = C crossprod(A) # = A^T A = C t(A) %*% A # same ================================================================ * EXACT NULL DISTRIBUTION THEORY FOR LINEAR MODELS: - We will be examining the null distribution of the statistics t and F. They turn out to be pivotal. - Q: If y ~ N(X beta, sigma^2 I), what is the distribution of 'b' and 'bj'? A: For the whole coef vector of estimates: b = (X^T X)^{-1} X^T y is a linear transformation of y E[b] = beta from 1st order assumption V[b] = sigma^2 (X^T X)^{-1} from 2nd order assumption => b ~ N(beta, sigma^2 (X^T X)^{-1}) (multivariate normal) One coeff estimate at a time: bj = < y, xj.adj/|xj.adj|^2 > is a linear combination of y E[bj] = betaj (assuming the model is unbiased) V[bj] = sigma^2 / |xj.adj|^2 (assuming the variance properties) => bj ~ N(betaj, sigma^2 / |xj.adj|^2) (univariate normal) If we know sigma, the natural test statistic is bj - betaj z = -------------- ~ N(0,1) sigma/|xj.adj| - Q: Why do we arrive at a t-distribution and not a normal distribution? A: We estimate sigma^2 with s^2 = RSS/(n-p-1) = RMSE^2 bj - betaj (bj - betaj) (bj - betaj)*|xj.adj| t = -------------- = -------------- = --------------------- SE.est[bj] s/|xj.adj| s - Q: How is a t-distribution with k degrees of freedom defined? A: For any Z, Z1,...,Zk that are i.i.d. N(0,sigma^2) define Z t := ------------------------ / Z1^2+...+Zk^2 \ sqrt(------------------) \ k / Facts: . This definition allows us to simulate t-distributions (not that we need to), because we know how to simulate iid N(0,sigma^2). . The distribution of t is independent of sigma as long as it is the same for Z, Z1,...,Zk. . As k-->Inf we have t-->N(0,1). Why? ... - Q: Why would the 't' derived from 'bj' be distributed like a 't'-distribution? A: Some pretty linear algebra and geometry!!! We will show that the former t can be written as the latter t. tj = (bj - betaj)*|xj.adj| / s (former t) Z := (bj - betaj)*|xj.adj| ~ N(0,sigma^2), where betaj is the true slope. s^2 = |r|^2 / (N-p-1) = ????? To be shown: For some Z, Z1, Z2, ... iid N(0,sigma^2) 1) s^2 = (Z1^2 + Z2^2 + ...) / (N-p-1) 2) s^2 and Z (defined above) are independent. Let Q be a Nx(N-p-1) matrix whose columns form an ORTHONORMAL BASIS OF RESIDUAL SPACE: . Q^T X = 0 (The columns of Q are orthogonal to the columns of X.) . Q^T Q = I (The columns of Q are mutually orthogonal; size: (N-p-1)^2.) . Q Q^T = I-H (= the projection onto residual space; size: N^2.) (Prove this equality.) Define the (N-p-1)-dimensional random vector z = (Z1,Z2,...)^T = Q^T y [ y = X beta + eps, Q^T y = Q^T eps ] Facts: . r = Q z (Proof: r = (I-H)y = Q Q^T y = Q z) (Interpretation: z contains the coordinates of r in the basis Q.) . z ~ N(0, sigma^2*I) where 'I' is of size (N-p-1)^2 (Proof: z = Q^T y; y is normal, hence z is normal. E[z] = E[Q^T y] = Q^T X beta = 0 V[z] = Q^T V[y] Q = sigma^2*I of size (N-p-1)^2.) . |r|^2 = |z|^2 QED 1) above (Proof: |r|^2 = |(I-H)y|^2 = y^T (I-H) Y = y^T Q Q^T y = |Q^T y|^2 = |z|^2.) . z and bj are stochastically independent. QED 2) above (Proof: bj ~ xj.adj^T y, and xj.adj is orthogonal to the columns of Q.) - Properties of the t-distribution with k degrees of freedom: . Bell-shaped: x <- seq(-4,4,length=500) plot(x, dt(x,df=1), type="l", ylim=c(0,.45), col="blue", xlab="quantile", ylab="t-density") # dt(..,df=1) == dcauchy(..) abline(h=0) text(x=0, y=.25, lab="Cauchy", col="blue") text(x=0, y=.23, lab="df=1", col="blue") . As df increases, the t-distribution gets lighter in the tails: colrs <- gray.colors(100,0,.8) for(df in 2:100) lines(x, dt(x,df), col=colrs[df], lwd=0.5) . As df -> Inf, the t-distribution approaches N(0,1) distribution: lines(x, dnorm(x), col="red", lwd=2) text(x=0, y=.44, lab="Gauss", col="red") text(x=0, y=.42, lab=expression(paste("df=",infinity)), col="red") #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ math symbols # For math symbols in R plots see help(plotmath) demo(plotmath) # Actually, these may be obsolete! # Check out the R package 'latex2exp' (?) # It has an R function that allows you to use # latex notation for special symbols in R plots. # (Pointer due to Alex Miller, 2016.) - Another view of the tj statistic: express it in terms of y = X beta + eps Write: mu = E[y] = X beta assuming the full model is correct. eps = E[y] - mu ~ N(0,sigma^2 I_N) Recall: bj = / |xj.|^2 # abbreviating xj. = xj.adj betaj = E[bj] = E [ / |xj.|^2 ] = / |xj.|^2 bj - betaj = / |xj.|^2 SE.est[bj] = s / |xj.| ==> tj = (bj - betaj) / SE.est(bj) = (bj - betaj) /(s/|xj.|) = < y - mu, xj./|xj.| > / s -------------------------------------- | | | tj = < y - mu, xj./|xj.| > / s | | | For testing H0: = 0 | = < eps, xj./|xj.| > / s | | | -------------------------------------- tj is the 'projection' of the error/noise vector eps on the normalized adjusted regressor xj. - Why is tj a pivot for testing betaj = 0? . Definition: A 'Pivot' is a test statistic whose null distribution does NOT depend on the nuisance parameters. . Q: What are the nuisance parameters when testing H0: betaj = 0 ? A: All betak (k =/= j) and sigma . Is tj a pivot in this sense? It''s null distribution is ... . Counter-example: t-statistic of simple regression, unadjusted tj = < y - mu, xj/|xj| > / s not a pivot? This is the t-statistic for simple linear regression on xj, unadjusted. H0: betaj = 0 ==> y = mu + eps, mu = sum_{k != j} xk betak Two choices for the denominator of tj: * s from multiple regression on x0,x1,...,xp, or * s from simple regression on xj If s is from the simple regression, but the full model is true, then E[s^2] = E[RSS.simple] / (N-1) = ( |mu|^2 + sigma^2 ) / (N-1) where mu = = \sum_{k =/= j} xk betak ==> Distribution of s depends on betak (k =/= j), hence tj is not pivotal. - If s is from the multiple regression, then E[s^2] = E[RSS.multiple] / (N-p-1) RSS.multiple = |r.multiple|^2 E[r.multiple] = 0 V[r.multiple] = (I-H) sigma^2 r.multiple ~ N(0,(I-H) sigma^2) ==> Distribution of s does not depend on betak (k =/= j), only on sigma, which should cancel out with numerator of tj. - Numerator of tj: instead of E[] = E[] = E[] + E[] = sum_{k =/= j} betak = not zero, can be anything, depends on betak (k =/= j) V[] = |xj|^2 sigma^2 ~ N(sum_{k =/= j} betak, |xj|^2 sigma^2) ==> The numerator distribution depends on the nuisance parameters betak (k =/= j). Conclusions: - Adjustment of xj is essential for testing H0: betaj = 0 in multiple regression on x0,x1,...,xp. It is needed to achieve pivotality of the numerator of tj wrt the nuisance slopes. - The denominator (i.e., s) should be based on the full model in order to achieve pivotality wrt to the nuisance slopes. This also achieves pivotality wrt sigma, which drops out of the tj ratio. ================================================================ * F-TESTS FOR COMPARING TWO ARBITRARY NESTED MODELS: - Purpose: testing groups of slopes jointly in particular: overall F-test where H0: y ~ N(mu*e, sigma^2*I) - Null hypotheses tested with F-ratios: Partition: X = (X1,X2) (assume X is full rank) beta = (beta1^T,beta2^T)^T --------------- NULL HYPOTHESIS H_0: | beta1 = 0 | --------------- H_0: beta1 = 0 b = (b1^T, b2^T)^T p+1 = p1 + p2 (Think: beta1 are of interest, educational measures, say; beta2 not of interest but necessary adjustment, for demographics,..) (The intercept is usually among the X2 columns.) Associated projections: H = X(X^T X)^{-1} X^T H2 = X2(X2^T X2)^{-1} X2^T H1.2 = H - H2 = orth. projection onto orth. complement of cspan(X2) in cspan(X) _ _|_ i.e., cspan(X) | | cspan(X2) X1.2 = (I-H2) X1 = (H-H2) X1 = columns of X1 adjusted for X2 H1.2 = X1.2 (X1.2^T X1.2)^{-1} X1.2^T H = H1.2 + H2 Orthonormal bases: Q2 = o.n. basis of cspan(X2) [N x p2] Q1.2 = o.n. basis of orth. complement of cspan(X2) in cspan(X) [N x p1] Q = (Q1.2, Q2) = o.n. basis of cspan(X) [N x (p1+p2) = N x (p+1)] Q.res = orthonormal basis of cspan(X)^perp NOTE: cspan(Q1.2) =/= cspan(X1) !!! cspan(Q1.2) == cspan(X1.2) = Range(H-H2) = Range(H1.2) Facts: H2 = Q2 Q2^T H1.2 = Q1.2 Q1.2^T H = Q Q^T --------------- NULL HYPOTHESIS H_0: | beta1 = 0 | --------------- With H_0 we use a hypothetical assumption "Are the columns of X1 statistically significant above and beyond X2 ? Do we need them?" to answer the question whether X1 matters. ------------------- H_0 in terms of mu: = E[y]: | mu = X2 beta2 | ------------------- ----------------- Equivalent: | H1.2 mu = 0 | ----------------- - Estimate of H1.2 mu: H1.2 y ----------------------- H_0: | H1.2 y = H1.2 eps | due to error alone, not mu ----------------------- - The F-statistic is based on: yhat - yhat2 = (H-H2) y = H1.2 y = H1.2 eps ^ under the null hypothesis H0 H1.2 y = H1.2( X2 beta2 + eps ) = H1.2 eps dfs = p1 = dim(Range(H1.2)) E[|H1.2 y|^2 / p1] = sigma^2 under null hyp. H0 r = y - yhat = (I-H) y = (I-H) eps ^ assuming 1st order correctness of the full model (I-H) y = (I-H) (X beta + eps) = (I-H) eps dfs = N-p-1 E[|(I-H) y|^2 / (N-p-1)] = sigma^2 under 1st+2nd order correctness - Def. of F-distribution: W1,...,Wq,Z1,...Zr i.i.d. N(0,sigma^2) ^^^^^ irrelevant but same (W1^2+...+Wq^2)/q F = ----------------- has a (central) F-distribution with q,r dfs (Z1^2+...+Zr^2)/r Btw: W1^2+...+Wq^2 is also called a chi-square with q dfs Z1^2+...+Zr^2 is hence a chi-square with r dfs To define an F distributions, we need two independent chi-square variables. [Limiting cases of F: q --> Inf, r fixed ... r --> Inf, q fixed ... q --> Inf, r --> Inf ... ] - The F-ratio has numerator and denominator as follows: Numerator = | Q1.2 Q1.2^T y |^2 / p1 = | Q1.2^T eps |^2 / p1 (under null hyp. H0) = mean of p1 iid normals^2: Q1.2^T eps = (W1,...,Wq)^T, q = p1 Denominator = | Q.res Q.res^T y |^2 / (N-p-1) = | Q.res^T eps |^2 / (N-p-1) (if model is 1st order correct) = mean of (N-p-1) iid normals^2 Q.res^T y = (Z1,...,Zr)^T, r = N-p-1 The iid normals of the numerator are independent of the iid normals of the denominator because cspan(Q.res) and span(Q1.2) = cspan(X1.2) are orthogonal. - Special case of one predictor: F = t^2 (recall: t=Z/sqrt((Z1^2+...+Zr^2)/r) - Pivotality of F: independence of its distribution under H0 from nuisance parameters AYT? What are the nuisance parameters? Why do they drop out of the null distribution of F? - Special case where p1=p, X2 = intercept column alone ==> overall F-test X1.2 = all real predictors centered (mean adjusted) H2 = 'mean projection' - Rules for 95\% quantiles of F-distributions? . "2 for t": works for about n~dfs >= 25 qt(.975, df=1:100) # cross-over below 2 is for df=60 . F-distribution: df2 <- 5:30 plot(x=range(df2), y=c(0,10), type="n", xlab="df2", ylab="F") for(df1 in 1:10) { lines(x=df2, y=qf(.95, df1=df1, df2=df2)) text(x=df2[1], y=qf(.95, df1=df1, df2=df2[1]), lab=df1, cex=.8, adj=1.2) } No apparent simple rule for F thresholds. - Generalization: Construct the F-statistic for --------------------------- | H_0: beta1 = beta1_0 | as opposed to: beta1_0 = 0 --------------------------- Hint: Reduce to the case of testing beta1 = 0 Trick: y = X1 beta1_0+ X2 beta2 + eps y - X1 beta1_0 = X2 beta2 + eps I.e., use a new response: y - X1 beta1_0 | H1.2 (y - X1 beta1_0) |^2 / p1 F = --------------------------------- | (I-H) y |^2 / (N-p-1) ==> Can be inverted to a confidence region for beta1. Cut-off: qf(1-alpha, df1=..., df2=...) CR(b1) = { beta1 : | yhat.2 - X1.2 beta1 |^2 <= p1 * s^2 * qf(1-alpha) } = { beta1 : | X1.2 b1 - X1.2 beta1 |^2 <= p1 * s^2 * qf(1-alpha) } = { beta1 : | X1.2 (b1 - beta1) |^2 <= p1 * s^2 * qf(1-alpha) } = { beta1 : (beta1 - b1)^T (X1.2^T X1.2) (beta1 - b1) <= p1 * s^2 * qf(1-alpha) } ================================================================ * CIs AND PIs FOR THE RESPONSE: - Inference for the response vs. prediction of the response: CIs vs PIs . Let xx be a row (!) from the design matrix X written as a column, or a new set of predictor values. Ex.: When modeling the car data with MPG = b0 + b1*WEIGHT + b2*HP we have xx=(1,WEIGHT,HP)^T We might want to know something about y=MpG at xx=(1, 3000, 200)^T yhat(xx) = xx^T b where b=(b0,b1,b2)^T. You can judge assertions such as: "It is thought in the automotive industry that cars with a weight of 3000 lb and 200HP get ON AVERAGE 27 MGP on the highway." This is a null hypothesis! . INFERENCE for the response is really inference for the EXPECTED VALUE E[y(xx)]: E[y(xx)] = mu(xx) = xx^T beta We look for an interval that contains the true with prob 95\%. This interval is called a 'confidence interval' or 'CI' for mu(xx) = xx^T beta. It is of the form yhat(xx) +- halfwidth. . PREDICTION of the response is + either point estimation of E[y(xx)] using yhat(xx) = , + or interval estimation of E[y(xx)] using yhat(xx) +- halfwidth. but the idea being that the interval captures 95\% of actual observations y at xx, if such could be observed. In this case the interval is called a 'prediction interval' or 'PI'. Its gambling guarantee should be to catch future response values y at xx based on past training data with probability 95\%. (Both future response values and past data are random!) You can judge assertions such as: "It is thought in the automotive industry that cars with a weight of 3000lb and 200HP get between 24 and 30 MGP on the highway." This speaks about actual car models. . Q: Which is going to be wider, the CI or the PI? - Inference for E[y(xx)] = xx^T beta: . Develop a test statistic: yhat(xx) = xx^T b ASSUMPTION: THE MODELS IS CORRECT !!! E[yhat(xx)] = xx^T beta = E[y(xx)] = mu(xx) Var[yhat(xx)] = Var[xx^T b] = Var[xx^T (X^T X)^{-1} X^T y] = sigma^2 (xx^T (X^T X)^{-1} xx) ==> stderr.est(yhat(xx)) = s sqrt(xx^T (X^T X)^{-1} xx) ^^^^^^^^^^^^^^^^^^^^ =: H(xx) H(xx_i) = H_ii ==> yhat(xx) - -------------------- is t-distributed with df=N-p-1 stderr.est(yhat(xx)) Why? ... ==> Can be inverted to CIs and Retention regions CI = 95\% interval for RI = 95\% coverage interval for yhat(xx) under H_0: E[y(xx)]= . Inference: + Note that for xx=xi (= i''th row of X), this is stderr.est(yhat(xx)) = s sqrt(Hii) + Inversion of tests to CIs: This is a general procedure that applies to any parameter test: P[ -q < t < +q | H_0 ] = .95 (e.g.) -q < t < +q -q*stderr.est(yhat(xx)) < yhat(xx) - < +q*stderr.est(yhat(xx)) Retention interval for H_0: (an assumed fixed value): -q*stderr.est(yhat(xx)) < yhat(xx) < +q*stderr.est(yhat(xx)) yhat(xx) in [ +-q*stderr.est(yhat(xx))] Confidence interval for : yhat(xx)-q*stderr.est(yhat(xx)) < < yhat(xx)+q*stderr.est(yhat(xx)) Written more simply: in [yhat(xx) +- q*stderr.est(yhat(xx))] + CI: 95\%CI = [ yhat(xx) +- 2 stderr.est(yhat(xx)) ] This Ci captures the true E[y(xx)]=mu(xx)= for 95\% of datasets. Uses: Confidence bands around fitted lines in simple regression + Testing H_0, the above example: See earlier "It is thought...", first instance. H_0: E[MPG(WEIGHT=3000,HP=200)] = mu0(xx) = 27 mpg => xx = (1,3000,200)^T t = (yhat(xx) - mu0(xx))/stderr.est(yhat(xx)) Reject at ~5\% significance when |t|>2. + '2' in CIs and retention intervals should really be: qt(p=.975, df=n-p-1) More generally: qt(p=1-alpha/2, df=n-p-1) where alpha = 0.05, 0.01, 0.001, ... + Ex.: Simple linear regression with one predictor x, calculate (X^T X)^{-1} plot CI(x) as a function of x. - Prediction intervals based on yhat(xx): . Let y(xx) be the random variable of FUTURE response values at xx whereas yhat(xx) is obtained from PAST data. . Assume the model is correct, wrt first and second moments and normality for past and future data. . Wanted: an interval around yhat(xx) that has a 95\% chance of catching y(xx) . Approach: Find distribution of y(xx)-yhat(xx). E[ y(xx) - yhat(xx) ] = 0 V[ y(xx) - yhat(xx) ] = V [ y(xx) ] + V[ yhat(xx) ] (*) = sigma^2 + stderr(yhat(xx))^2 = sigma^2 + sigma^2*(xx^T (X^T X)^{-1} xx) = sigma^2 * (1 + xx^T (X^T X)^{-1} xx) Why this nice Pythagoras in (*) above? Reason: Future and past data are independent (or so we assume). Future y(xx) is independent of yhat(xx) estimated from past data. . Assuming past and future response values are normally distributed: y(xx) - yhat(xx) t = -------------------------------- has a t-distribution with df=n-p-1 s*sqrt(1 + xx^T (X^T X)^{-1} xx) Reasons: The Gaussian assumption AND the numerator is independent of the denominator because - s is computed from the residuals of past data y. - yhat(xx) is computed from the projection of past y onto the X-space. - y(xx) is future data at xx independent of past data. => s is independent of yhat(xx) and y(xx) (recall: V[r,yhat]=0 => under normality r and yhat are independent) . PI(xx) = [ yhat(xx) +- 2*s*sqrt( 1 + xx^T (X^T X)^{-1} xx ) ] (2 should be a 1-alpha/2 t-quantile.) ================================================================ end of class