================================================================ CHAPTER 4: PRACTICAL ASPECTS OF LINEAR MODELS -- COMPUTATIONS, INTERPRETATIONS * R: Formula language for modeling, with excursions - Create some toy 'data': N <- 1000 # (Programming: Do not hardcode constants such as this one!) x1 <- round(runif(N)*100) # Continuous Predictor x2 <- round(rnorm(N,m=71,s=3), dig=.5) # Continuous predictor x3 <- sample(c("m","f","c"),size=N, replace=T, prob=c(.6,.3,.1)) # Categorical predictor eps <- rnorm(N,s=1) # 'Error' myd <- data.frame(x1=x1, x2=x2, x3=x3, y= round(10 + .1*x1 + 0*x2 + c("m"=0,"f"=-6,"c"=-10)[x3] + eps, dig=.01) ) - Excursion: Basic steps for getting acquainted with a new dataframe !!!!!!!!!!!!!!! # Dataframe or matrix or what? class(myd) # How large is it? dim(myd) # What are the variable names? names(myd) # Look at the first 6 rows: head(myd) # Look at the last 6 rows: tail(myd) # General purpose: summary(myd) - Formulas to express linear models: # Use 'lm()' to fit a linear model: lm(y ~ x1 + x2 + x3, data=myd) # What is this, though, !%@^&*%? summary(lm(y ~ x1 + x2 + x3, data=myd)) # Much better! !!! Everything is citizen of first class: MODEL FORMULAS, MODELS, SUMMARIES They "exist" because they are objects that can be assigned to variables. Examples: myf <- y ~ x1 + x2 + x3; class(myf) mym <- lm(myf, data=myd); class(mym) mym # uninformative mym.smry <- summary(mym); class(mym.smry) mym.smry # Nice printing, but what's really in it? is.list(mym.smry) # indeed, a list, hence ask for .. names(mym.smry) mym.smry$coeff class(mym.smry$coeff) dimnames(mym.smry$coeff) # Pieces we need for inference of coefficients: mym.smry$coeff[,"Estimate"] mym.smry$coeff[,"Std. Error"] mym.smry$coeff[,"t value"] mym.smry$coeff[,"Pr(>|t|)"] Ways to explore a model object: class(mym) is.list(mym) names(mym) # Components of the list. sapply(mym, class) Similar for summary: class(mym.smry) is.list(mym.smry) names(mym.smry) # Components of the list. sapply(mym.smry, class) What about formulas? is.list(myf) is.vector(myf) is.character(myf) is.expression(myf) as.character(myf) # coercion length(myf) myf[1] myf[2] myf[3] # Important: formulas are not character strings !!!!!!!! # They have an internal representation that involves 3 parts, # but all we need to know is that formulas feed into # modeling (and some plotting) functions. - "Operator overloading": new meanings for +,-,1,*,/,^ in model formulas '+' means 'include' '*' means 'include with interactions' ':' means 'include these specific interactions' '^' means 'include hierarchical ANOVA terms with interactions of all orders' '.' means 'include all predictors' (all variables other than the response) '-' means 'exclude following predictor' '1' means 'intercept' - To prevent overloading: I(...), e.g., + I((x1+x2)^2) - Linear models with categorical predictors: summary(lm(y ~ x3, data=myd)) . Observation: The complete collection of dummies of a categorical variable adds up to the intercept vector, hence we''d have complete collinearity. . Standard remedy: Drop one dummy. It becomes the 'reference category'. . Which category gets dropped by R? A: The first one. . Which category is the first one? A: The first one listed in 'levels()': levels(myd[,3]) Oh, this column is not a factor! class(myd[,3]) But 'lm()'coerces 'myd[,3]' into a 'factor', and it sorts the levels alphabetically: levels(factor(myd[,3])) ==> 'c' gets dropped. . How can we control what the first factor is? A: We make the variable a factor and set the levels in the order we want. myd[,3] <- factor(myd[,3], levels=c("m","f","c")) #!!!!!!! table(myd[,3]) summary(lm(y ~ x3, data=myd)) ==> Dropped 'm', which is the first category. . WARNING!!! Do NOT set levels as follows when myd[,3] already is a factor: levels(myd[,3]) <- c("c","f","m") #!!!!!!!! levels(myd[,3]) table(myd[,3]) # What a mess! - Linear models with interactions: # y = beta0 + beta1*x1 + beta2*x2 + + beta12*x1*x2 + eps summary(lm(y ~ x1 * x2, data=myd)) summary(lm(y ~ x1 + x2 + x1:x2, data=myd)) # dito summary(lm(y ~ x1 + x2 + I(x1 * x2), data=myd)) # dito summary(lm(y ~ x1 * x2 * x3, data=myd)) # that's different - Interaction of continuous and categorical predictors: summary(lm(y ~ x1 * x3, data=myd)) ================================================================ * INTERPRETATION OF ESTIMATES AND TESTS: - Unlike courses taught by Paul Rosenbaum and Dylan Small, we in Stat 961 are NOT into causality. All associations are just that, NOT causations! - Traps: . Causality: "When x1 is increased by one unit,..." . Observations versus averages: "... y increases by..." . Estimates vs parameters: averages ==> observational data, where x1 cannot be manipulated and where y is noisy, hence there are only averages, but even these averages are not "known", only estimated... - Interpretations of regression estimates: . Slope interpretation: the issue of... ---------------------- | "ceteris paribus" | ---------------------- (English speakers pronounce this Latin expression with "k" for "c" and stresses on first syllables.) . "bj = est. ave. difference in y for pairs of objects with a unit difference in xj when all other predictors are at the same levels." . "bj = est. ave. difference in y for pairs of cases that ONLY differ by a unit in xj." . "Unit change in xj" and "holding all other predictors fixed" are formulations often used but misleading because of their implication of active experimentation. - Interpretation of other regression estimates: . Interpretation of b0? "est. ave. y at x=0" . Interpretation of stderr(b1)? !!!!!!!!!!!!!!!!!!!!!!!!!!! ("est. stdev of b1 varying from dataset to dataset") (caveat: conditioning on x, same X, random y) . Interpretation of t1? ("t1 = b1/stderr(b1) = dist from 0 in multiples of stderrs") . Interpretation of p-value(b1)? ("prob of observing a t1 more extreme than the observed value, dataset to dataset, under the assumption beta1=0") (Warning! This is not a parameter but a random variable. Why?) . Interpretation of s=RMSE? (R''s "Resid. stderr" is a bad term) ("est. sdev of errors/noise/epsilons") . Interpretation of R Square? ("fraction of variation in y accounted for by the model") (avoid: "explained by") - More background on slope interpretations: . Example: Car models, x1 = Weight, x2 = HP, y = MPG The slope of x1 estimates the average difference in MPG for pairs of car models that differ by one Weight unit but have the same HP. . Fine point: For different pairs, it doesn''t have to be the same HP. It''s just within a pair that HP has to be the same. . The previous point is due to linearity of the model: You can think of the slopes as partial derivatives. For linear functions f(x) = b0+b1*x1+...+bp*xp the partial derivative wrt xj is bj, and it is the same everywhere. For nonlinear models the situation would be more difficult because partial derivatives would not be constant. . The association reflected by a slope is as follows: delta xj |--> est. ave. delta y = bj * delta xj at fixed levels of all other xk. . If xj is a dummy variable, "delta xj = 1" is a two-group mean comparison. bj = ave est difference between group 1 and the group 0 when all other predictors are the same. . Use differences (delta xj) other than a unit difference in the following cases: + The unit difference is too small to matter. Example: xj = weight of a car in lb Better use delta xj = 100 or 1000 lb. + The unit difference is too large and implies extrapolation. Example: xj = weight of small diamonds in carats If the range of the weights is 1/10 to 1/3 carat, then a 1 carat difference has not been observed and implies extrapolation. Use delta xj = 1/10 carat instead. . Intercepts are often extrapolations as well. Examples: + cars with zero HP and zero weight + diamonds with zero weight - Linear models and nonlinear associations: . Many types of nonlinearities can be modeled with variable transformations such as logarithms and roots and with interaction terms. We consider: + Additive nonlinearity + Interactions They are modeled by forming new predictors. . Additive nonlinearity: xj |--> mu(...,xj,...) = ... + bj*f(xj) + ... + Frequent cases: o f(xj) = log(xj) o f(xj) = sqrt(xj) o f(xj) ~ xj + xj^2 ==> Only for predictors whose values are >0 and >=0, resp. + You will have to interpret what the meaning of bj is: In both cases you should not interpret additive differences in xj but multiplicative differences instead. E.g., if two cases differ in xj by 10\%, i.e., by a factor 1.1, the estimated average difference in y is o bj*(log(xj*1.1) - log(xj)) = bj*log(1.1) ~ bj*0.1 o bj*(sqrt(xj*1.1) - sqrt(xj)) = bj*sqrt(xj)*(sqrt(1.1)-sqrt(1)) ~ bj*sqrt(xj)*0.05 ^^^^^^^^^^^ the term in the fitted equation for the case with the lower xj value. + Note that the ceteris paribus clause holds like in linear models: We compare pairs of cases with 10\% difference in xj but same values on all other xk. Again these 'same values' within a pair can differ between pairs. Why? Because even in additive models the partial derivative wrt xj does not depend on the other variables: (delta/delta xj) (f1(x1)+...+fj(xj)+...+fp(xp)) = (d/d xj) fj(xj) . Interaction nonlinearities: + The simplest form of interaction between two predictors is multiplicative. I.e., one introduces new predictors which are the products of original predictors. + Consider a 'first order interaction', i.e., an interaction in two predictors xj and xk: E[y(x)] = mu(...,xj,...,xk,...) = ... + bj*xj + bk*xk + bjk*xj*xk + ... + Note that we now have p+2 predictors: x0,x1,...xj,...,xj,...,xp,xj*xk (p+2)nd predictor ^^^^^ + IMPORTANT IMPLICATION OF INTERACTIONS: 'ceteris paribus' is no longer possible for the interacting predictors: --------------------------------------------------- | It is NOT possible to ask that xj*xk be the same | | when there is a difference in xj. | --------------------------------------------------- Q: How are going to interpret the slopes bj, bk and bjk? + If xj and xk are both quantitative, then the estimated interaction bjk is interpreted as a correction to the slopes bj and bk: o bj*xj + bk*xk + bjk*xj*xk = (bj+bjk*xk)*xj + bk*xk ==> The slope of xj depends on xk through bj(xk) := (bj+bjk*xk) o bj*xj + bk*xk + bjk*xj*xk = bj*xj + (bk+bjk*xj)*xk ==> The slope of xk depends on xj through bk(xj) := (bk+bjk*xj) + If xj is quantitative and xk is a dummy (i.e., 0,1 valued), we have the following two interpretations: o bj(xk) = (bj+bjk*xk) = bj when xk=0 bj(xk) = (bj+bjk*xk) = bj+bjk when xk=1 Thus the interaction term allows for different xj slopes in the two xk groups: bjk = slope difference for xj between the xk groups ==> We can test whether the two groups require different slopes by testing the null hypothesis H0: betajk = 0. o bk(xj) = (bk+bjk*xj) is the estimated mean difference between group xk=1 and group xk=0. This estimated mean difference is a function of xk. ==> We can test whether the mean difference between the xk groups depends on xj by testing the same null hypothesis H0: betajk = 0. + Both xj and xk are dummies: bj = estimated average difference between groups xj=1 and xj=0 in the group xk=0. bk = estimated average difference between groups xk=1 and xk=0 in the group xj=0. bjk = estimated correction when comparing groups (xj=0;xk=0) and (xj=1;xk=1) Additively the est.ave.diff. should be bj+bk, but if the two groups 'interact' (have some sort of 'synergy'), this needs to be correct to bj+bk+bjk. Example: xj ~ drug A administered (1) or not (0) xk ~ drug B administered (1) or not (0) ==> bjk ~ synergy between drugs A and B bjk > 0: positive synergy (they strengthen each other''s effects) bjk < 0: negative synergy (they attenuate each other''s effects) * 'MBA' STATISTICS: - The ranges of random variables and observations are often of one of the following types: . If a variable (x or y) + takes on only positive values, and + changes/differences are expressed in percentages/multiples as opposed to numeric amounts, ==> "RATIO SCALE" variable . If a variable + can take on positive and negative values and + changes/differences are expressed in numeric amounts, ==> "INTERVAL SCALE" . If a variable + is confined to [0,1] (or [0,100] if expressed as \%) ==> "PROBABILITY/PROPORTION SCALE" . If a variable + is confined to non-negative integers, ==> "COUNTING SCALE" . If a variable + is confined to {0,1}, ==> "Binary Variable" for 2-group comparison - Examples of each? (Anything not covered?) ... ratio: salaries, other compensations such as bonuses, investments, ... interval: financial log-returns, coordinate axes for physical space probability scale: relative frequencies as prob estimates, election returns counting scales: number of ordered items - Notes on ratio scales: . Changes/gains/losses/differences are expressed as + factor changes: 'doubling', 'tripling', 'halving',... "production is up by a factor 1.3" + percent changes: 'salary is up 5%', 'sales declined 20%', ... "production is up by 30%" Percentages are equivalent ways to express multiplicative differences. . Questions/practice: Translate back and forth between factor changes and percent changes: + 'an increase of 100%' means what factor change? ... + 'tripling income' means what percent change? ... + 'the internet grew 140%' means what factor change? ... + 'stocks lost 3%' means what factor change? ... + Interpret the following as investment returns over 6 periods: (1) +10%, +10%, +10%, -10%, -10%, -10% (2) +10%, -10%, +10%, -10%, +10%, -10% Question: o Who is better off? o Do you end up at 100\% or less or more? - Logarithms: . Their principal role is to map RATIO SCALES to INTERVAL SCALES Map multiplicative factor changes/differences to additive amount changes/differences. . Linear Approximation of Logarithms: the "small change/difference approximation" log(1+z) ~ z for small z (convention: small means |z|<.1, up to +-10\%) and NATURAL log !!! Consequence: log(y*(1+z)) = log(y) + log(1+z) ~ log(y) + z ------------------------------------------ | RULE: A difference in log(y) by .01 | | corresponds approximately to | | a difference in y by 1\% | ------------------------------------------ - Scale this rule from -10\% to +10\% - This holds only for the natural log!!!! Why? ... . Caveats about logarithmic transforms of y: Means of logs do not correspond to means of unlogged: mean(log(y)) != log(mean(y)) [Is there an inequality here?] However: median(log(y)) == log(median(y)) ==> In general, if "ave. est." is interpreted as "median est.", one can arbitrarily transform with monotone functions. (For the normal and any symmetric distribution, we have 'mean=median', but not for skewed distributions.) #================================================================ # # # - MODELS OF EXPONENTIAL GROWTH/DECAY/APPRECIATION/DEPRECIATION: # . Example from the crazy dotcom years: growth of number of web servers URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-961/web-servers.txt" webs <- read.table(URL, header=T) colnames(webs) # [1] "Web.Servers" "Log.Web.Servers" "Yrs.since.Jan.97" # [4] "Time" "Year" "Month" # [7] "Time.of.Exp.Growth" plot(webs[,c("Time","Web.Servers")], type="l", col="blue") points(webs[,c("Time","Web.Servers")],pch=16, cex=.5) # This is not all the way exponential, but it might be piecewise. # Extract the span in which exponential growth was supposed # to have happened, 1997 - end of 2000 webs.red <- webs[as.logical(webs[,"Time.of.Exp.Growth"]), c("Yrs.since.Jan.97","Web.Servers")] colnames(webs.red) <- c("year","nserv") plot(webs.red, type="b", pch=16, cex=.5) # Zero is Jan 1997. # Exponential growth calls for logging the response: plot(webs.red, type="b", log="y", pch=16, cex=.5) # Or: plot(webs.red[,1], log10(webs.red[,2]), ylab="Log10", type="b", xlab="yr", pch=16, cex=.5) # So we model log(y) ~ b0 + b1 * x + error abline(lm(log10(nserv) ~ year, data=webs.red), lwd=2) # # . Meaning: Multiplicative Model summary(lm(log(nserv) ~ year, data=webs.red)) ## ---- ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 13.408146 0.021395 626.7 <2e-16 *** ## year 0.900178 0.008834 101.9 <2e-16 *** ## ---- # Translation of an additive model on the log-scale # to a multiplicative model on the original scale: # log(nserv) ~ 13.4 + 0.9 * year + r # nserv ~ exp( 13.4 ) * exp( 0.9 * year ) * exp( r ) # ^^^^^^^^^^^ exp( 0.9 )^year * " # 660000 * 2.46^year * " plot(webs.red, pch=16, xlab="yr", ylab="nserv") lines(x=webs.red[,1], y=660000*2.46^webs.red[,1]) # Not perfect but adequate for now because we are more # interested in interpretation. # # . Interpretation: We start with 660,000 servers in Jan 1997. # From there on we have a yearly increase # by a factor of 2.46, or 146\%. # # . Multiplicative errors: Make sense when error # is proportional to size of y(x) (or E[y(x)] = mu(x) really). # (The size of mice varies differently from the size of elephants.) # [Another way to accommodate size-dependent error would be # with direct modeling of the heteroscedasticity: # y = b0 + b1*x + eps, where eps ~ N(0,x^2*sigma^2) (e.g.) # But this does not limit y to positive values, # as do multiplicative errors.] # # . Using linear approximation for log(y)-models: # log(yhat) = b0 + b1*x # difference in x by dx # => difference in log(yhat) by b1*dx # => "difference" in yhat by b1*dx*100% if |b1*dx|<0.1 # # Web server data: # # b1 = .9 = (est. ave.) log-increase in webservers per year # ==> dx = 1yr is much too large for linear approximation. # # Try dx = 1 month: b1*dx = b1/12 = 0.075 # ==> Small enough for linear approximation # b1/12*100 = 7.5% = (est. ave.) percentage increase per month # [Compounded this should amount to a 146% yearly increase, # but here we see the approximation error: # This monthly rate compounds to 1.075^12 ## [1] 2.381780 # instead of 2.46. The more exact rate would be 2.46^(1/12) ## [1] 1.077899 # or closer to a 7.8% increase per month # Still, linear approximation for log-differences |z|<.1 are commonly used. # # #================================================================ # # # - MODELS OF LOGARITHMICALLY DIMINISHING RETURNS # # Example: Wine display space; the more space, the more sales, # but: going from 1 ft to 2 ft has probably a greater effect # than going from 9 ft to 10 ft URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/wine.dat" wine <- read.table(URL, header=T) colnames(wine) # [1] "display.feet" "sales" # Parallel sort according to ft for plotting curves in order of predictor: wine <- wine[order(wine[,1]),] # Response: number of bottles sold summary(wine) # plot(wine[,1], wine[,2], pch=16, xlab="Display ft", ylab="Wine Sales") ## Idea: There may not be a constant increase in sales per unit increase # but per percentage increase in display space. # ==> take log(x). plot(log(wine[,1]), wine[,2], pch=16, xlab="log Display ft", ylab="Wine Sales") # Looks nearly ok. Ok enought to try a linear model: wine.lm <- lm(sales ~ log(display.feet), data=wine) abline(wine.lm) # # Meaning of the line: summary(wine.lm) # ---- # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 83.560 14.413 5.797 6.24e-07 *** # log(display.feet) 138.621 9.834 14.096 < 2e-16 *** # ---- # # Fitted Equation: (est. ave.) sales ~ 140 * log(display.feet) + 84 # plot(wine, pch=16, xlab="log Display ft", ylab="Wine Sales") lines(x=wine[,1], y=84+139*log(wine[,1])) # # Magic of logs again: # . 10% difference in display.feet # => (est. ave.) difference in sales is about 140 * 0.1 = 14 bottles # . Doubling display feet: log(2) ~ 0.7 # => (est. ave.) difference in sales is about 140 * 0.7 ~ 100 bottles # # General: fit y ~ b0 + b1*log(x) # 100% difference in x => ~b1*.7 difference in yhat. (est. ave.) # 10% difference in x => ~b1/10 difference in yhat. (est. ave.) # 1% difference in x => ~b1/100 difference in yhat. ( " ) # # #================================================================ # # # 3) MODELS OF CONSTANT ELASTICITY: # # Example: Fedex has(had) a delivery service called "CourierPak". # At different prices, different numbers of CourierPak deliveries # were requested by customers. URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/Courier.dat" courier <- read.table(URL, header=T) colnames(courier) ## [1] "Price" "Volume" plot(courier, pch=16) # One could fit a linear model for starters: courier.lm <- lm(Volume ~ Price, courier) summary(courier.lm) abline(courier.lm) # Looks quite good, but one doesn't like straight lines for a couple # of reasons: # 1) Lines cross into negatives, which does not happen (Vol>0). # The minimum volume is zero. # 2) More importantly, by economic theory it is not a unit increase # in Price that is associated with a constant decrease in Volume. # It is a PERCENTAGE increase in Price that causes a PERCENTAGE decrease # in volume. # courier.log.lm <- lm(log(Volume) ~ log(Price), courier) summary(courier.log.lm) # ---- # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.04930 0.20145 39.957 < 2e-16 *** # log(Price) -0.32732 0.07961 -4.112 0.000397 *** # ---- # The raw Price->Volume fit and the log(Price)->Log(Volume) perform # about the same, but the world prefers the latter. Interpretation: # # log(Volume) ~ 8 - 1/3*log(Price) + r # # . Constant elasticity as a power law: # Volume = exp(8 - 1/3*log(Price + r) # = exp(8) * Price^{-1/3} * exp(r) # = 2981 * Price^{-1/3} * exp(r) # In general: log(y) = b0 + b1*log(x) + r # y = exp(b0) * x^b1 * exp(r) # exp(b0) = est.median baseline at x=1 # # . Small change approximation: # Price difference of 1% ~ Volume difference of ... ? # Price difference of 1% # => log(Price) difference of about 0.01 # => log(Volume) difference of about (-1/3)*0.01 (|.|<0.1) # => Volume difference of about -1/3 % # # In general: 1% difference in x ~ difference in yhat of b1%. # This is called "constant price elasticity". # # (Assumes |b1|<0.1. If this is not the case, choose a smaller dx: # dx=dPrice=0.001 => dy=dVolume=b1*0.1% .) # # #================================================================ SKIP THE FOLLOWING * FIXED COSTS/VARIABLE COSTS: - Example: A manufacturer of 'blocks' gets orders of certain quantities and production costs. It is clear that total production cost is mostly driven by quantity. But the manufacturer would like to know what other factors make some orders more and others less expensive. - Distinguish: . FC: fixed cost (often setup costs) . VC: variable cost (cost per additional unit) - TC - Total Cost model: Denote quantity (number of units ordered) by 'Q'. TC = beta0 + beta1*Q + beta2*x2 + ... + eps ^^^^^ ^^^^^ ?? ?? (?? are for you to fill in: FC, VC) lm(Tot.Cost ~ Quantity + x2 + ...) - Unrealistic error structure: . Do you think 'eps' is homoskedastic, same variation for small and large orders? . (The other lack of realism, eps unbounded below, will not be resolved in the following proceedings.) - Predictor issue: . Some predictors affect setup costs, hence contribute to TC at the per-order level, as the predictor x2 above. . HOWEVER, many predictors affect cost on a per unit basis. + amount of material per block, + difficulty of material, + extra features ordered for the blocks, ... ==> Contribution to total cost should be beta3*Q*x3 TC = beta0 + beta1*Q + beta2*x2 + beta3*Q*x3 + ... + eps ^^^^^ ^^^^^ FC-pred VC-pred - New concept: AC = Average Cost := TC/Q . Average cost model: AC = beta0/Q + beta1 + beta2*x2/Q + beta3*x3 + ... + eps . If the AC model is assumed homoskedastic, then the TC model is heteroskedastic with larger sigma for larger orders. . In AC cost models, the fixed cost is prorated from quantity Q to a single unit. . AC and TC models have both FC and VC, but --------------------------------------------------------------------- | + in the TC model FC is the intercept, VC is slope for Q, | | + in the AC model FC is the slope for 1/Q, VC is the intercept. | --------------------------------------------------------------------- ==> Both models have interpretations of beta0 and beta1 as fixed and variable costs, in reverse order. But they differ in their error structure, and in how other factors affect the Total Cost of the order. - A data example: Cost is given as Ave.Cost. blocks <- read.table("http://stat.wharton.upenn.edu/~buja/STAT-961/blocks_red.dat", header=T) colnames(blocks) # Here is the plot of TC vs Q, in $1000: # Certainly doesn't look like a constant error variance! plot(Ave.Cost*Units/1000 ~ Units, data=blocks, pch=16) # # AC plot(Ave.Cost ~ Units, data=blocks, pch=16) # Maybe there is a slightly negative slope: lower Ave.Cost # for more Quantity? Economies of scale? # # There is a problem, though: an extremely small order # visible when plotting against 1/Quantity: plot(x=1/blocks[,"Units"], y=blocks[,"Ave.Cost"], pch=16) # Units = Quantity # Remove: sel <- 1/blocks[,"Units"] < .5 plot(x=1/blocks[sel,"Units"], y=blocks[sel,"Ave.Cost"], pch=16) # # Naive model for Tot.Cost=Ave.Cost*Units: summary(lm(Ave.Cost*Units ~ Units, blocks[sel,])) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4146.470 865.000 4.794 3.22e-06 *** ## Units 22.149 1.658 13.356 < 2e-16 *** ## Multiple R-squared: 0.4752, Adjusted R-squared: 0.4725 # Model for Ave.Cost: summary(lm(Ave.Cost ~ I(1/Units), blocks[sel,])) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 31.082 2.273 13.673 < 2e-16 *** ## I(1/Units) 1502.211 416.821 3.604 0.000397 *** ## Multiple R-squared: 0.06185, Adjusted R-squared: 0.05709 # The estimates of fixed and variable costs # differ wildly between the two approaches. # Which one would you believe? # Typically Tot.Cost models have much higher R^2 # than Ave.Cost models, but this is irrelevant: # The high R^2 stems from the trivial proportionality with Units, # which the Ave.Cost model factors out. # Often, Ave.Cost models are more plausible and # yield comparable predictions when multiplied up to Tot.Cost. # #================================================================ * A PROTOCOL FOR REPORTING REGRESSION FINDINGS: - Target audience: non-statisticians (your manager...) . Principles: avoid technical terms, report relevant subject matter, keep stats and its qualifications at low volume, except when bearing bad news ... (which happens often in practice) - Example: URL <- "http://stat.wharton.upenn.edu/~buja/STAT-541/CarModels2003-4.TXT" cars <- read.table(URL, header=T, sep=",", na=".") my.cars <- cars[,c("Weight.lbs","Horsepower","MPG.Highway")] - Explain purpose and data: "The goal is to analyze highway gas mileage of 2003/2004 car models and describe to what extent it is correlated with two major model characteristics: weight and horsepower." - Summarize the predictors and response with round, memorable numbers: summary(my.cars) "Current car models average about 3,500 lb, a little over 200 HP, and about 26 MPG on the highway. Some models are as light as 1,800 lb and as heavy as 6,400 lb. The engines range from about 70 to 660 HP, and gas mileage from 12 to 68 MPG." [Provide a table for these values for easy parsing.] - Pick an imagined case with heavily rounded mean or median values as predictors. Example: sapply(my.cars, mean, na.rm=T) ==> "For reference, consider hypothetical car models roughly in the middle of the pack, with weight 3,500 lb and 200 HP." - Fit a model: (offline, not of interest to your audience) my.cars.lm <- lm(MPG.Highway ~ Weight.lbs + Horsepower, data=my.cars) and give a reference range (PI) of the response values for the reference models: my.cars.refy <- predict(my.cars.lm, newdata=data.frame(Weight.lbs=3.5, Horsepower=200, MPG.Highway=NA)) # Silly,... my.cars.refy # Closer to 27 than 26 because our reference values are low-balling the means. # Standard Gaussian PI: my.cars.refy + c(-2,2) * summary(my.cars.lm)$sigma # Quantile-based non-parametric PI: my.cars.refy + quantile(resid(my.cars.lm), prob=c(.025,.975)) # (Better: use standardized residuals) ==> "Such reference car models are predicted to range in mileage from as low as 20 MPG to as high as 32 MPG. Thus there is considerable variation in highway mileage even among models with the same weight and same horsepower." - Continue with general comments on the quality of the data and results: summary(my.cars.lm) ==> "This is compatible with the fact that weight and horsepower account for about two thirds (64%) of the variation in MPG. (Weight and horsepower are highly statistically significant as predictors of MPG, but, as is often the case, this does not translate to an equal degree of practical significance.)" - Translate the slopes as meaningfully as possible: "If we compare models that differ by 500 lb in weight but have the same horsepower, the estimated average difference in mileage is about -2 MPG, quantifying the obvious fact that heavier cars have lower mileage." "If, on the other hand, we compare models that differ by 10 HP but have the same weight, the estimated average difference in mileage is about -0.3 MPG, again quantifying the obvious fact that stronger cars have lower mileage." - Maybe mention parenthetically the uncertainty in the slopes: "(The 95% uncertainty about these numbers can be described by a range -2+-1/3 MPG for the 500 lb weight difference and -.3+-.06 MPG for the 10 HP difference.)" - Complications: + Predictors can be ratio scale, hence you may have used log(x), hence you need to consider a percent difference in x. + Any other kind of transformation also causes problems. E.g., physics says to model "I(1/MPG.Highway)", which would require additional non-trivial translations of statement about differences, possibly replacing means with medians, and translating endpoints of ranges. + Nonlinear effects are often intuitively expressed by comparing two specific x-configurations: "Compare, for example, cars with weights 3,500 and 4,000 pounds, and 200 HP." Especially when there are interaction terms you might have to choose specific values for several predictor variables and make comparisons across these predictor values. ================================================================