================================================================


CHAPTER 4: PRACTICAL ASPECTS OF LINEAR MODELS -- COMPUTATIONS, INTERPRETATIONS



* R: Formula language for modeling, with excursions

  - Create some toy 'data':
          N <- 1000   # (Programming: Do not hardcode constants such as this one!)
          x1 <- round(runif(N)*100)                       # Continuous Predictor
          x2 <- round(rnorm(N,m=71,s=3), dig=.5)          # Continuous predictor
          x3 <- sample(c("m","f","c"),size=N,
                       replace=T, prob=c(.6,.3,.1))       # Categorical predictor
          eps <- rnorm(N,s=1)                             # 'Error'
          myd <- data.frame(x1=x1, x2=x2, x3=x3,
                            y= round(10 + .1*x1 + 0*x2 + c("m"=0,"f"=-6,"c"=-10)[x3] + eps,
                                     dig=.01) )

  - Excursion: Basic steps for getting acquainted with a new dataframe !!!!!!!!!!!!!!!
          # Dataframe or matrix or what?
          class(myd)
          # How large is it?
          dim(myd)
          # What are the variable names?
          names(myd)
          # Look at the first 6 rows:
          head(myd)
          # Look at the last 6 rows:
          tail(myd)
          # General purpose:
          summary(myd)

  - Formulas to express linear models:
          # Use 'lm()' to fit a linear model:
          lm(y ~ x1 + x2 + x3, data=myd)           # What is this, though, !%@^&*%?
          summary(lm(y ~ x1 + x2 + x3, data=myd))  # Much better!
    !!! Everything is citizen of first class: MODEL FORMULAS, MODELS, SUMMARIES
        They "exist" because they are objects that can be assigned to variables.
        Examples:
          myf <- y ~ x1 + x2 + x3;  class(myf)
          mym <- lm(myf, data=myd); class(mym)
          mym                # uninformative
          mym.smry <- summary(mym); class(mym.smry)
          mym.smry           # Nice printing, but what's really in it?
          is.list(mym.smry)  # indeed, a list, hence ask for ..
          names(mym.smry)
          mym.smry$coeff
          class(mym.smry$coeff)
          dimnames(mym.smry$coeff)
          # Pieces we need for inference of coefficients:
          mym.smry$coeff[,"Estimate"]
          mym.smry$coeff[,"Std. Error"]
          mym.smry$coeff[,"t value"]
          mym.smry$coeff[,"Pr(>|t|)"]
        Ways to explore a model object:
          class(mym)
          is.list(mym)
          names(mym)  # Components of the list.
          sapply(mym, class)
        Similar for summary:
          class(mym.smry)
          is.list(mym.smry)
          names(mym.smry)  # Components of the list.
          sapply(mym.smry, class)
        What about formulas?
          is.list(myf)
          is.vector(myf)
          is.character(myf)
          is.expression(myf)
          as.character(myf)  # coercion
          length(myf)
          myf[1]
          myf[2]
          myf[3]
       # Important: formulas are not character strings !!!!!!!!
       # They have an internal representation that involves 3 parts,
       # but all we need to know is that formulas feed into
       # modeling (and some plotting) functions.


  - "Operator overloading": new meanings for +,-,1,*,/,^ in model formulas
          '+' means 'include'
          '*' means 'include with interactions'
          ':' means 'include these specific interactions'
          '^' means 'include hierarchical ANOVA terms with interactions of all orders'
          '.' means 'include all predictors' (all variables other than the response)
          '-' means 'exclude following predictor'
          '1' means 'intercept'

  - To prevent overloading: I(...), e.g., + I((x1+x2)^2)


  - Linear models with categorical predictors:
          summary(lm(y ~ x3, data=myd))

    . Observation: The complete collection of dummies of a
        categorical variable adds up to the intercept vector,
        hence we''d have complete collinearity.

    . Standard remedy: Drop one dummy.
        It becomes the 'reference category'.

    . Which category gets dropped by R?
        A: The first one.

    . Which category is the first one?
        A: The first one listed in 'levels()':
          levels(myd[,3])
        Oh, this column is not a factor!
          class(myd[,3])
        But 'lm()'coerces 'myd[,3]' into a 'factor',
        and it sorts the levels alphabetically:
          levels(factor(myd[,3]))
        ==> 'c' gets dropped.

    . How can we control what the first factor is?
        A: We make the variable a factor and set
        the levels in the order we want.
          myd[,3] <- factor(myd[,3], levels=c("m","f","c")) #!!!!!!!
          table(myd[,3])
          summary(lm(y ~ x3, data=myd))
        ==> Dropped 'm', which is the first category.

    . WARNING!!!
      Do NOT set levels as follows when myd[,3] already is a factor:
          levels(myd[,3]) <- c("c","f","m")                 #!!!!!!!!
          levels(myd[,3])
          table(myd[,3])             # What a mess!


  - Linear models with interactions:
          # y = beta0 + beta1*x1 + beta2*x2 + + beta12*x1*x2 + eps
          summary(lm(y ~ x1 * x2, data=myd))
          summary(lm(y ~ x1 + x2 + x1:x2, data=myd))       # dito
          summary(lm(y ~ x1 + x2 + I(x1 * x2), data=myd))  # dito
          summary(lm(y ~ x1 * x2 * x3, data=myd))   # that's different


  - Interaction of continuous and categorical predictors:

          summary(lm(y ~ x1 * x3, data=myd))


================================================================


* INTERPRETATION OF ESTIMATES AND TESTS:


  - Unlike courses taught by Paul Rosenbaum and Dylan Small,
    we in Stat 961 are NOT into causality.
    All associations are just that, NOT causations!

  - Traps:
    . Causality: "When x1 is increased by one unit,..."
    . Observations versus averages: "... y increases by..."
    . Estimates vs parameters: averages
    ==> observational data, where x1 cannot be manipulated
        and where y is noisy, hence there are only averages,
        but even these averages are not "known", only estimated...


  - Interpretations of regression estimates:

    . Slope interpretation: the issue of...

               ----------------------
              |   "ceteris paribus"  |
               ----------------------

              (English speakers pronounce this Latin expression
               with "k" for "c" and stresses on first syllables.)

    . "bj = est. ave. difference in y for pairs of objects with a unit difference in xj
            when all other predictors are at the same levels."

    . "bj = est. ave. difference in y for pairs of cases that ONLY
            differ by a unit in xj."

    . "Unit change in xj" and "holding all other predictors fixed"
      are formulations often used but misleading because of
      their implication of active experimentation.

  - Interpretation of other regression estimates:

    . Interpretation of b0?
            "est. ave. y at x=0"
    . Interpretation of stderr(b1)?    !!!!!!!!!!!!!!!!!!!!!!!!!!!
            ("est. stdev of b1 varying from dataset to dataset")
            (caveat: conditioning on x, same X, random y)
    . Interpretation of t1?
            ("t1 = b1/stderr(b1) = dist from 0 in multiples of stderrs")
    . Interpretation of p-value(b1)?
            ("prob of observing a t1 more extreme than the observed value,
              dataset to dataset, under the assumption beta1=0")
            (Warning!  This is not a parameter but a random variable.  Why?)

    . Interpretation of s=RMSE? (R''s "Resid. stderr" is a bad term)
            ("est. sdev of errors/noise/epsilons")
    . Interpretation of R Square?
            ("fraction of variation in y accounted for by the model")
            (avoid: "explained by")


  - More background on slope interpretations:

    . Example:  Car models, x1 = Weight, x2 = HP, y = MPG
        The slope of x1 estimates the average difference
        in MPG for pairs of car models that differ by one
        Weight unit but have the same HP.

    . Fine point: For different pairs, it doesn''t have to be the same
         HP.  It''s just within a pair that HP has to be the same.

    . The previous point is due to linearity of the model:
        You can think of the slopes as partial derivatives.
        For linear functions f(x) = b0+b1*x1+...+bp*xp
        the partial derivative wrt xj is bj, and it is
        the same everywhere.
        For nonlinear models the situation would be more difficult
        because partial derivatives would not be constant.

    . The association reflected by a slope is as follows:
        delta xj  |-->  est. ave. delta y = bj * delta xj
      at fixed levels of all other xk.

    . If xj is a dummy variable, "delta xj = 1" is a two-group mean comparison.
        bj = ave est difference between group 1 and the group 0
             when all other predictors are the same.

    . Use differences (delta xj) other than a unit difference
      in the following cases:

      + The unit difference is too small to matter.
        Example: xj = weight of a car in lb
                 Better use  delta xj = 100 or 1000 lb.

      + The unit difference is too large and implies extrapolation.
        Example: xj = weight of small diamonds in carats
                 If the range of the weights is 1/10 to 1/3 carat,
                 then a 1 carat difference has not been observed
                 and implies extrapolation.
                 Use  delta xj = 1/10 carat  instead.

    . Intercepts are often extrapolations as well.
        Examples:
        + cars with zero HP and zero weight
        + diamonds with zero weight


  - Linear models and nonlinear associations:

    . Many types of nonlinearities can be modeled with
      variable transformations such as logarithms and roots
      and with interaction terms.
      We consider:
      + Additive nonlinearity
      + Interactions
      They are modeled by forming new predictors.

    . Additive nonlinearity:

          xj |-->  mu(...,xj,...) = ... + bj*f(xj) + ...

      + Frequent cases:
        o f(xj) = log(xj)
        o f(xj) = sqrt(xj)
        o f(xj) ~ xj + xj^2

        ==> Only for predictors whose values are >0 and >=0, resp.

      + You will have to interpret what the meaning of bj is:
        In both cases you should not interpret additive differences in xj
        but multiplicative differences instead.
        E.g., if two cases differ in xj by 10\%, i.e., by a factor 1.1,
        the estimated average difference in y is

        o bj*(log(xj*1.1) - log(xj)) = bj*log(1.1)
                                     ~ bj*0.1

        o bj*(sqrt(xj*1.1) - sqrt(xj)) = bj*sqrt(xj)*(sqrt(1.1)-sqrt(1))
                                       ~ bj*sqrt(xj)*0.05
                                         ^^^^^^^^^^^
                                        the term in the fitted equation
                                        for the case with the lower xj value.

      + Note that the ceteris paribus clause holds like in linear models:
          We compare pairs of cases with 10\% difference in xj but same
          values on all other xk.
          Again these 'same values' within a pair can differ between pairs.
          Why?  Because even in additive models the partial derivative wrt xj
          does not depend on the other variables:
            (delta/delta xj) (f1(x1)+...+fj(xj)+...+fp(xp)) = (d/d xj) fj(xj)


    . Interaction nonlinearities:

      + The simplest form of interaction between two predictors is multiplicative.
          I.e., one introduces new predictors which are the products of original predictors.

      + Consider a 'first order interaction', i.e., an interaction in two predictors xj and xk:

            E[y(x)] = mu(...,xj,...,xk,...) = ... + bj*xj + bk*xk + bjk*xj*xk + ...

      + Note that we now have p+2 predictors:  x0,x1,...xj,...,xj,...,xp,xj*xk
                                                       (p+2)nd predictor ^^^^^

      + IMPORTANT IMPLICATION OF INTERACTIONS: 'ceteris paribus' is no longer possible
          for the interacting predictors:

           ---------------------------------------------------
          | It is NOT possible to ask that xj*xk be the same  |
          | when there is a difference in xj.                 |
           ---------------------------------------------------

          Q: How are going to interpret the slopes bj, bk and bjk?

      + If xj and xk are both quantitative, then the estimated interaction bjk
          is interpreted as a correction to the slopes bj and bk:

        o   bj*xj + bk*xk + bjk*xj*xk  =  (bj+bjk*xk)*xj + bk*xk
              ==> The slope of xj depends on xk through bj(xk) := (bj+bjk*xk)

        o   bj*xj + bk*xk + bjk*xj*xk  =  bj*xj + (bk+bjk*xj)*xk
              ==> The slope of xk depends on xj through bk(xj) := (bk+bjk*xj)

      + If xj is quantitative and xk is a dummy (i.e., 0,1 valued),
        we have the following two interpretations:

        o   bj(xk) = (bj+bjk*xk)  =  bj      when xk=0
            bj(xk) = (bj+bjk*xk)  =  bj+bjk  when xk=1
          Thus the interaction term allows for different xj slopes
          in the two xk groups:
            bjk = slope difference for xj between the xk groups
          ==> We can test whether the two groups require different slopes
              by testing the null hypothesis  H0: betajk = 0.

        o   bk(xj) = (bk+bjk*xj)  is the estimated mean difference
          between group xk=1 and group xk=0.
          This estimated mean difference is a function of xk.
          ==> We can test whether the mean difference between the xk groups
              depends on xj by testing the same null hypothesis  H0: betajk = 0.

      + Both xj and xk are dummies:
          bj  = estimated average difference between groups xj=1 and xj=0
                  in the group xk=0.
          bk  = estimated average difference between groups xk=1 and xk=0
                  in the group xj=0.
          bjk = estimated correction when comparing groups (xj=0;xk=0) and (xj=1;xk=1)
                  Additively the est.ave.diff. should be bj+bk,
                  but if the two groups 'interact' (have some sort of 'synergy'),
                  this needs to be correct to  bj+bk+bjk.
          Example: xj ~ drug A administered (1) or not (0)
                   xk ~ drug B administered (1) or not (0)
                   ==>  bjk ~ synergy between drugs A and B
                        bjk > 0: positive synergy (they strengthen each other''s effects)
                        bjk < 0: negative synergy (they attenuate each other''s effects)


* 'MBA' STATISTICS:

  - The ranges of random variables and observations are often
    of one of the following types:

    . If a variable (x or y)
      + takes on only positive values, and
      + changes/differences are expressed in percentages/multiples
          as opposed to numeric amounts,
      ==> "RATIO SCALE" variable

    . If a variable
      + can take on positive and negative values and
      + changes/differences are expressed in numeric amounts,
      ==> "INTERVAL SCALE"

    . If a variable
      + is confined to [0,1] (or [0,100] if expressed as \%)
      ==> "PROBABILITY/PROPORTION SCALE"

    . If a variable
      + is confined to non-negative integers,
      ==> "COUNTING SCALE"

    . If a variable
      + is confined to {0,1},
      ==> "Binary Variable" for 2-group comparison

  - Examples of each?  (Anything not covered?)
      ...
      ratio: salaries, other compensations such as bonuses, investments, ...
      interval: financial log-returns, coordinate axes for physical space
      probability scale: relative frequencies as prob estimates, election returns
      counting scales: number of ordered items


  - Notes on ratio scales:

    . Changes/gains/losses/differences are expressed as
      + factor changes: 'doubling', 'tripling', 'halving',...
          "production is up by a factor 1.3"
      + percent changes: 'salary is up 5%', 'sales declined 20%', ...
          "production is up by 30%"
        Percentages are equivalent ways to express multiplicative differences.

    . Questions/practice:
      Translate back and forth between factor changes and percent changes:
      + 'an increase of 100%' means what factor change?    ...
      + 'tripling income' means what percent change?       ...
      + 'the internet grew 140%' means what factor change? ...
      + 'stocks lost 3%' means what factor change?         ...
      + Interpret the following as investment returns over 6 periods:
        (1) +10%, +10%, +10%, -10%, -10%, -10%
        (2) +10%, -10%, +10%, -10%, +10%, -10%
        Question:
        o Who is better off?
        o Do you end up at 100\% or less or more?


  - Logarithms:


    . Their principal role is to map

        RATIO SCALES to INTERVAL SCALES

        Map multiplicative factor changes/differences
        to additive amount changes/differences.


    . Linear Approximation of Logarithms: the "small change/difference approximation"

         log(1+z) ~ z    for small z  (convention: small means |z|<.1, up to +-10\%)
                         and NATURAL log !!!


      Consequence:

         log(y*(1+z)) = log(y) + log(1+z) ~ log(y) + z

         ------------------------------------------
        |  RULE: A difference in log(y) by .01     |
        |        corresponds approximately to      |
        |        a difference in y by 1\%           |
         ------------------------------------------

        - Scale this rule from -10\% to +10\%

        - This holds only for the natural log!!!!
          Why?  ...


    . Caveats about logarithmic transforms of y:

        Means of logs do not correspond to means of unlogged:
          mean(log(y)) != log(mean(y))            [Is there an inequality here?]

        However:
          median(log(y)) == log(median(y))

        ==>  In general, if "ave. est." is interpreted as "median est.",
             one can arbitrarily transform with monotone functions.

             (For the normal and any symmetric distribution,
              we have  'mean=median',
              but not for skewed distributions.)


#================================================================
#
#
# - MODELS OF EXPONENTIAL GROWTH/DECAY/APPRECIATION/DEPRECIATION:

#    . Example from the crazy dotcom years: growth of number of web servers
         URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-961/web-servers.txt"
         webs <- read.table(URL, header=T)
         colnames(webs)
#           [1] "Web.Servers"        "Log.Web.Servers"    "Yrs.since.Jan.97"
#           [4] "Time"               "Year"               "Month"
#           [7] "Time.of.Exp.Growth"
         plot(webs[,c("Time","Web.Servers")], type="l", col="blue")
         points(webs[,c("Time","Web.Servers")],pch=16, cex=.5)
#      This is not all the way exponential, but it might be piecewise.
#      Extract the span in which exponential growth was supposed
#      to have happened, 1997 - end of 2000
        webs.red <- webs[as.logical(webs[,"Time.of.Exp.Growth"]),
                         c("Yrs.since.Jan.97","Web.Servers")]
        colnames(webs.red) <- c("year","nserv")
        plot(webs.red, type="b", pch=16, cex=.5)
#     Zero is Jan 1997.
#     Exponential growth calls for logging the response:
        plot(webs.red, type="b", log="y", pch=16, cex=.5)
#     Or:
        plot(webs.red[,1], log10(webs.red[,2]), ylab="Log10",
             type="b", xlab="yr", pch=16, cex=.5)

#     So we model   log(y) ~ b0 + b1 * x + error
        abline(lm(log10(nserv) ~ year, data=webs.red), lwd=2)
#
#   . Meaning: Multiplicative Model

        summary(lm(log(nserv) ~ year, data=webs.red))
        ## ----
        ## Coefficients:
        ## Estimate Std. Error t value Pr(>|t|)
        ## (Intercept) 13.408146   0.021395   626.7   <2e-16 ***
        ## year         0.900178   0.008834   101.9   <2e-16 ***
        ## ----
#     Translation of an additive model on the log-scale
#     to a multiplicative model on the original scale:
#       log(nserv) ~ 13.4 + 0.9 * year + r
#           nserv  ~ exp( 13.4 ) * exp( 0.9 * year ) * exp( r )
#                    ^^^^^^^^^^^   exp( 0.9 )^year   *   "
#                     660000     *       2.46^year   *   "
        plot(webs.red, pch=16, xlab="yr", ylab="nserv")
        lines(x=webs.red[,1], y=660000*2.46^webs.red[,1])
#     Not perfect but adequate for now because we are more
#     interested in interpretation.
#
#   . Interpretation: We start with 660,000 servers in Jan 1997.
#                     From there on we have a yearly increase
#                     by a factor of 2.46, or 146\%.
#
#   . Multiplicative errors: Make sense when error
#       is proportional to size of y(x) (or E[y(x)] = mu(x) really).
#       (The size of mice varies differently from the size of elephants.)

#       [Another way to accommodate size-dependent error would be
#       with direct modeling of the heteroscedasticity:
#         y = b0 + b1*x + eps,  where  eps ~ N(0,x^2*sigma^2)  (e.g.)
#       But this does not limit y to positive values,
#       as do multiplicative errors.]
#
#   . Using linear approximation for log(y)-models:
#         log(yhat) = b0 + b1*x
#              difference in x by dx
#           => difference in log(yhat) by b1*dx
#           => "difference" in yhat by  b1*dx*100%  if |b1*dx|<0.1
#
#     Web server data:
#
#       b1 = .9 = (est. ave.) log-increase in webservers per year
#       ==> dx = 1yr is much too large for linear approximation.
#
#       Try dx = 1 month:  b1*dx = b1/12 = 0.075
#       ==> Small enough for linear approximation
#          b1/12*100 = 7.5% = (est. ave.) percentage increase per month
#          [Compounded this should amount to a 146% yearly increase,
#          but here we see the approximation error:
#          This monthly rate compounds to
             1.075^12
             ## [1] 2.381780
#          instead of 2.46.  The more exact rate would be
             2.46^(1/12)
             ## [1] 1.077899
#          or closer to a 7.8% increase per month
#          Still, linear approximation for log-differences |z|<.1 are commonly used.
#
#
#================================================================
#
#
# - MODELS OF LOGARITHMICALLY DIMINISHING RETURNS
#
#    Example: Wine display space; the more space, the more sales,
#             but: going from 1 ft to 2 ft has probably a greater effect
#                  than going from 9 ft to 10 ft
       URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/wine.dat"
       wine <- read.table(URL, header=T)
       colnames(wine)
#      [1] "display.feet" "sales"
#    Parallel sort according to ft for plotting curves in order of predictor:
       wine <- wine[order(wine[,1]),]
#    Response: number of bottles sold
       summary(wine)
#
       plot(wine[,1], wine[,2],
            pch=16, xlab="Display ft", ylab="Wine Sales")
##    Idea: There may not be a constant increase in sales per unit increase
#          but per percentage increase in display space.
#          ==> take log(x).
       plot(log(wine[,1]), wine[,2],
            pch=16, xlab="log Display ft", ylab="Wine Sales")
#    Looks nearly ok.  Ok enought to try a linear model:
       wine.lm <- lm(sales ~ log(display.feet), data=wine)
       abline(wine.lm)
#
#    Meaning of the line:
       summary(wine.lm)
#      ----
#      Coefficients:
#                        Estimate Std. Error t value Pr(>|t|)
#      (Intercept)         83.560     14.413   5.797 6.24e-07 ***
#      log(display.feet)  138.621      9.834  14.096  < 2e-16 ***
#      ----
#
#    Fitted Equation:   (est. ave.) sales ~ 140 * log(display.feet) + 84
#
       plot(wine, pch=16, xlab="log Display ft", ylab="Wine Sales")
       lines(x=wine[,1], y=84+139*log(wine[,1]))
#
#    Magic of logs again:
#    . 10% difference in display.feet
#         => (est. ave.) difference in sales is about 140 * 0.1 = 14 bottles
#    . Doubling display feet: log(2) ~ 0.7
#         => (est. ave.) difference in sales is about 140 * 0.7 ~ 100 bottles
#
#    General: fit  y ~ b0 + b1*log(x)
#            100% difference in x =>  ~b1*.7  difference in yhat.   (est. ave.)
#             10% difference in x =>  ~b1/10  difference in yhat.   (est. ave.)
#              1% difference in x =>  ~b1/100 difference in yhat.   (  "      )
#
#
#================================================================
#
#
# 3) MODELS OF CONSTANT ELASTICITY:
#
#    Example: Fedex has(had) a delivery service called "CourierPak".
#             At different prices, different numbers of CourierPak deliveries
#             were requested by customers.
       URL <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/Courier.dat"
       courier <- read.table(URL, header=T)
       colnames(courier)
       ## [1] "Price"  "Volume"
       plot(courier, pch=16)
#    One could fit a linear model for starters:
       courier.lm <- lm(Volume ~ Price, courier)
       summary(courier.lm)
       abline(courier.lm)
#    Looks quite good, but one doesn't like straight lines for a couple
#    of reasons:
#    1) Lines cross into negatives, which does not happen (Vol>0).
#       The minimum volume is zero.
#    2) More importantly, by economic theory it is not a unit increase
#       in Price that is associated with a constant decrease in Volume.
#       It is a PERCENTAGE increase in Price that causes a PERCENTAGE decrease
#       in volume.
#
      courier.log.lm <- lm(log(Volume) ~ log(Price), courier)
      summary(courier.log.lm)
#       ----
#       Coefficients:
#                   Estimate Std. Error t value Pr(>|t|)
#       (Intercept)  8.04930    0.20145  39.957  < 2e-16 ***
#       log(Price)  -0.32732    0.07961  -4.112 0.000397 ***
#       ----
#     The raw Price->Volume fit and the log(Price)->Log(Volume) perform
#     about the same, but the world prefers the latter.  Interpretation:
#
#        log(Volume) ~ 8 - 1/3*log(Price) + r
#
#   . Constant elasticity as a power law:
#            Volume = exp(8 - 1/3*log(Price + r)
#                   = exp(8) * Price^{-1/3} * exp(r)
#                   =  2981  * Price^{-1/3} * exp(r)
#     In general:  log(y) = b0 + b1*log(x) + r
#                      y  = exp(b0) * x^b1 * exp(r)
#                           exp(b0) = est.median baseline at x=1
#
#   . Small change approximation:
#       Price difference of 1%  ~  Volume difference of ... ?
#       Price difference of 1%
#         => log(Price) difference of about 0.01
#         => log(Volume) difference of about  (-1/3)*0.01  (|.|<0.1)
#         => Volume difference of about -1/3 %
#
#       In general:   1% difference in x  ~  difference in yhat of b1%.
#       This is called "constant price elasticity".
#
#       (Assumes |b1|<0.1.  If this is not the case, choose a smaller dx:
#        dx=dPrice=0.001  =>  dy=dVolume=b1*0.1% .)
#
#
#================================================================ SKIP THE FOLLOWING


* FIXED COSTS/VARIABLE COSTS:

  - Example: A manufacturer of 'blocks' gets orders of certain
             quantities and production costs.  It is clear
             that total production cost is mostly driven by
             quantity.  But the manufacturer would like to know
             what other factors make some orders more and others
             less expensive.

  - Distinguish:
    . FC: fixed cost (often setup costs)
    . VC: variable cost (cost per additional unit)

  - TC - Total Cost model:  Denote quantity (number of units ordered) by 'Q'.

      TC = beta0 + beta1*Q + beta2*x2 + ... + eps
           ^^^^^   ^^^^^
            ??      ??      (?? are for you to fill in: FC, VC)

      lm(Tot.Cost ~ Quantity + x2 + ...)

  - Unrealistic error structure:
    . Do you think 'eps' is homoskedastic,
      same variation for small and large orders?
    . (The other lack of realism, eps unbounded below,
      will not be resolved in the following proceedings.)

  - Predictor issue:

    . Some predictors affect setup costs, hence contribute
      to TC at the per-order level, as the predictor x2 above.

    . HOWEVER, many predictors affect cost on a per unit basis.
         + amount of material per block,
         + difficulty of material,
         + extra features ordered for the blocks,
           ...
      ==> Contribution to total cost should be  beta3*Q*x3

          TC = beta0 + beta1*Q + beta2*x2 + beta3*Q*x3 + ... + eps
                                 ^^^^^      ^^^^^
                                 FC-pred    VC-pred

  - New concept:   AC = Average Cost := TC/Q

    . Average cost model:

          AC = beta0/Q + beta1 + beta2*x2/Q + beta3*x3 + ... + eps


    . If the AC model is assumed homoskedastic,
      then the TC model is heteroskedastic with
      larger sigma for larger orders.

    . In AC cost models, the fixed cost is prorated
      from quantity Q to a single unit.

    . AC and TC models have both FC and VC, but

       ---------------------------------------------------------------------
      |   + in the TC model FC is the intercept, VC is slope for Q,         |
      |   + in the AC model FC is the slope for 1/Q, VC is the intercept.   |
       ---------------------------------------------------------------------

      ==>   Both models have interpretations of beta0 and beta1
            as fixed and variable costs, in reverse order.
            But they differ in their error structure, and in
            how other factors affect the Total Cost of the order.

  - A data example: Cost is given as Ave.Cost.
                blocks <- read.table("http://stat.wharton.upenn.edu/~buja/STAT-961/blocks_red.dat", header=T)
                colnames(blocks)

#             Here is the plot of TC vs Q, in $1000:
#             Certainly doesn't look like a constant error variance!
                plot(Ave.Cost*Units/1000 ~ Units, data=blocks, pch=16)

#             # AC
                plot(Ave.Cost ~ Units, data=blocks, pch=16)
#             Maybe there is a slightly negative slope: lower Ave.Cost
#               for more Quantity?  Economies of scale?
#
#             There is a problem, though: an extremely small order
#             visible when plotting against 1/Quantity:
                plot(x=1/blocks[,"Units"], y=blocks[,"Ave.Cost"], pch=16)  # Units = Quantity
#             Remove:
                sel <- 1/blocks[,"Units"] < .5
                plot(x=1/blocks[sel,"Units"], y=blocks[sel,"Ave.Cost"], pch=16)
#
#             Naive model for Tot.Cost=Ave.Cost*Units:
                summary(lm(Ave.Cost*Units ~ Units, blocks[sel,]))
##                                 Estimate Std. Error t value Pr(>|t|)
##                     (Intercept) 4146.470    865.000   4.794 3.22e-06 ***
##                     Units         22.149      1.658  13.356  < 2e-16 ***
##              Multiple R-squared:  0.4752,	Adjusted R-squared:  0.4725

#             Model for Ave.Cost:
                summary(lm(Ave.Cost ~ I(1/Units), blocks[sel,]))
##                                 Estimate Std. Error t value Pr(>|t|)
##                     (Intercept)   31.082      2.273  13.673  < 2e-16 ***
##                     I(1/Units)  1502.211    416.821   3.604 0.000397 ***
##              Multiple R-squared:  0.06185,	Adjusted R-squared:  0.05709

#             The estimates of fixed and variable costs
#             differ wildly between the two approaches.
#             Which one would you believe?

#             Typically Tot.Cost models have much higher R^2
#             than Ave.Cost models, but this is irrelevant:
#             The high R^2 stems from the trivial proportionality with Units,
#             which the Ave.Cost model factors out.
#             Often, Ave.Cost models are more plausible and
#             yield comparable predictions when multiplied up to Tot.Cost.
#

#================================================================


* A PROTOCOL FOR REPORTING REGRESSION FINDINGS:

  - Target audience: non-statisticians (your manager...)
    . Principles: avoid technical terms, report relevant subject matter,
                  keep stats and its qualifications at low volume,
                  except when bearing bad news ... (which happens often in practice)

  - Example:
        URL <- "http://stat.wharton.upenn.edu/~buja/STAT-541/CarModels2003-4.TXT"
        cars <- read.table(URL, header=T, sep=",", na=".")
        my.cars <- cars[,c("Weight.lbs","Horsepower","MPG.Highway")]

  - Explain purpose and data:
        "The goal is to analyze highway gas mileage of 2003/2004 car models
         and describe to what extent it is correlated with two major
         model characteristics: weight and horsepower."

  - Summarize the predictors and response with round, memorable numbers:
        summary(my.cars)
        "Current car models average about 3,500 lb, a little over 200 HP,
         and about 26 MPG on the highway.
         Some models are as light as 1,800 lb and as heavy as 6,400 lb.
         The engines range from about 70 to 660 HP, and gas mileage from 12 to 68 MPG."
        [Provide a table for these values for easy parsing.]

  - Pick an imagined case with heavily rounded mean or median values
        as predictors.  Example:
          sapply(my.cars, mean, na.rm=T)
          ==> "For reference, consider hypothetical car models roughly
               in the middle of the pack, with weight 3,500 lb and 200 HP."

  - Fit a model: (offline, not of interest to your audience)
        my.cars.lm <- lm(MPG.Highway ~ Weight.lbs + Horsepower, data=my.cars)
      and give a reference range (PI) of the response values for the reference models:
        my.cars.refy <- predict(my.cars.lm,
                                newdata=data.frame(Weight.lbs=3.5, Horsepower=200,
                                                   MPG.Highway=NA))  # Silly,...
        my.cars.refy
        # Closer to 27 than 26 because our reference values are low-balling the means.
        # Standard Gaussian PI:
        my.cars.refy + c(-2,2) * summary(my.cars.lm)$sigma
        # Quantile-based non-parametric PI:
        my.cars.refy + quantile(resid(my.cars.lm), prob=c(.025,.975))
        # (Better: use standardized residuals)
      ==> "Such reference car models are predicted to range in mileage
           from as low as 20 MPG to as high as 32 MPG.
           Thus there is considerable variation in highway mileage
           even among models with the same weight and
           same horsepower."

  - Continue with general comments on the quality of the data and results:
        summary(my.cars.lm)
      ==> "This is compatible with the fact that weight and horsepower
           account for about two thirds (64%) of the variation in MPG.
           (Weight and horsepower are highly statistically significant
            as predictors of MPG, but, as is often the case, this does
            not translate to an equal degree of practical significance.)"

  - Translate the slopes as meaningfully as possible:
        "If we compare models that differ by 500 lb in weight
         but have the same horsepower, the estimated average difference
         in mileage is about -2 MPG, quantifying the obvious fact that
         heavier cars have lower mileage."
        "If, on the other hand, we compare models that differ by 10 HP
         but have the same weight, the estimated average difference
         in mileage is about -0.3 MPG, again quantifying the obvious fact
         that stronger cars have lower mileage."

  - Maybe mention parenthetically the uncertainty in the slopes:
        "(The 95% uncertainty about these numbers can be described
          by a range -2+-1/3 MPG for the 500 lb weight difference
          and -.3+-.06 MPG for the 10 HP difference.)"

  - Complications:
       + Predictors can be ratio scale, hence you may have used log(x),
         hence you need to consider a percent difference in x.
       + Any other kind of transformation also causes problems.
         E.g., physics says to model "I(1/MPG.Highway)", which
         would require additional non-trivial translations of
         statement about differences, possibly replacing means
         with medians, and translating endpoints of ranges.
       + Nonlinear effects are often intuitively expressed by
         comparing two specific x-configurations:
         "Compare, for example, cars with weights 3,500 and 4,000 pounds,
          and 200 HP."
         Especially when there are interaction terms you might have
         to choose specific values for several predictor variables
         and make comparisons across these predictor values.



================================================================