================================================================


* CURVE ESTIMATION / SMOOTHING / BANDWIDTH SELECTION / CROSS-VALIDATION


  - Idea: given x-y data (quantitative-quantitative)
    . Drop assumption of linearity:  y = (a + bx) + eps
    . Assume smoothness only:        y =   f(x)   + eps,   f() smooth
    . Smooth => locally linear       f(x+h) ~=~ f(x) + Df(x)*h + O(h^2)
    . If we wish to infer y=f(x) from data (xi,yi) (i=1,...,n),
        we have to infer an estimate  f(x)  from (xi,yi) with xi near x
    . First try:
                   fhat(x) = ave(yi; |xi-x|<delta)
                                            ^^^^^ 'bandwidth'

    . R code: running mean, evaluated on a grid of 101 locations covering range(x)
        smooth1 <- function(xy, nloc=101, bw=diff(range(xy[,1]))/5, func=mean) {
          x <- xy[,1];  y <- xy[,2]              # unpack x and y
          xs <- seq(min(x), max(x), length=nloc) # set up the equispaced grid
          fx <- rep(NA, nloc)                    # allocate space for fit
          for(i in 1:nloc) {                     # for all grid points...
            sel <- xs[i]-bw < x & x < xs[i]+bw   # neighborhood of grid point
            fx[i] <- func(y[sel])                # "ave" of  y's in neighborhood
          }
          return(cbind(x=xs, y=fx))
        }
      Note: The smooth function is given by its values on an equispaced grid of x values (xs)

    . Apply to x=LSTAT and y=MEDV in the Boston housing data, varying the bandwidth:
        xy <- boston[,c("LSTAT","MEDV")]
        plot(xy, pch=16, cex=.5)
        lines(smooth1(xy), col="red", lwd=3) # smooth1() can feed plot(), lines(),...
        diff(range(xy[,1]))/5                # We used the default bw.
        lines(smooth1(xy, bw=3), col="green", lwd=3)  # Shrink the bw.
        lines(smooth1(xy, bw=.5), col="blue", lwd=3)  # Shrink even more.
      Problems?
        Metric bandwidth: can create empty windows -> fhat(x)=NA
        Smooth is really not smooth

    . Variations: running median and quantiles for 50%, 90% prediction bands!
        plot(xy, pch=16, cex=.5)
        lines(smooth1(xy, func=median), col="black", lwd=3)
        lines(smooth1(xy, func=function(y) quantile(y,.25)), col="gray50", lwd=3)
        lines(smooth1(xy, func=function(y) quantile(y,.75)), col="gray50", lwd=3)
        lines(smooth1(xy, func=function(y) quantile(y,.05)), col="gray80", lwd=3)
        lines(smooth1(xy, func=function(y) quantile(y,.95)), col="gray80", lwd=3)

    . 'Real' smoothness: kernel smoothing
        smooth <- function(xy, bw=1/5, nloc=51) {
          x <- xy[,1];  y <- xy[,2]
          xmin <- min(x);   xmax <- max(x)
          xs <- seq(xmin, xmax, length=nloc)
          fx <- rep(NA, nloc)
          act.bw <- diff(range(x))*bw         # actual bandwidth
          for(i in 1:nloc) {
            w <- dnorm(x=(x-xs[i])/act.bw) # same as dnorm(x, m=xs[i], s=act.bw)
            w <- w/sum(w)                  # so we have: sum(w)==1
            fx[i] <- sum(y*w)
          }
          return(cbind(x=xs, y=fx))
        }

    . Apply to same with 4 bandwidths:
        bws  <- c(2,    .2,      .03,   .005)
        cols <- c("red","orange","green","blue")
        plot(xy, pch=16, cex=.5)
        for(i in 1:4)
          lines(smooth(xy, bw=bws[i]), col=cols[i], lwd=3)
      Q: What would be a principled way to select the bandwidth? (stay tuned...)

    . Simulation to show variability of smoothers: y = 4*x^2 + eps
        bws  <- c(2,    .2,     .03,   .005)
        N <- 100
        x <- sort(runif(N))
        Nsim <- 200
        par(mfrow=c(2,2), mgp=c(1.5,.5,0), mar=c(3,3,1,1))
        for(ibw in 1:4) {
          plot(x=c(0,1), y=c(-2,6), type="n", pch=16, cex=.5, xlab="", ylab="")
          xy <- cbind(x=x, y=4*x^2 + rnorm(N))
          points(xy, pch=16, cex=.5, col="gray30")
          for(isim in 1:Nsim) {
            xy <- cbind(x=x, y=4*x^2 + rnorm(N))
            lines(smooth(xy, bw=bws[ibw]), col=cols[ibw], lwd=1)
          }
          lines(x, 4*x^2, lwd=2)
        }
      You should be looking at the widths of the color bands (variance!)
      and how much the bands systematically deviate from the target (bias!).
      The wiggliness of the functions is a side-effect of variance.

    . Lesson: BIAS-VARIANCE trade-off !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

              Really: bias-SE trade-off       (Variance is dataset-to-dataset.)

              small bandwidth                     large bandwidth
                  ==>                                 ==>
              large variance (-)                  small variance (+)
              small bias     (+)                  large bias     (-)

    . Mean squared error and bias-variance decomposition:

        Model:     yi  =  f(xi) + epsi            E[epsi] = 0

        BIAS:      Bias(fhat(x)) = E[fhat(x)] - f(x)

        VARIANCE:  V(fhat(x))    = E[ (fhat(x) - E[fhat(x)])^2 ]

        MSE:       MSE(fhat(x))  = E[ (fhat(x) - f(x))^2 ]

      Bias-variance decomposition of the MSE:

         ---------------------------------------------------
        |  MSE(fhat(x))  =   V(fhat(x))  +  Bias(fhat(x))^2 |
         ---------------------------------------------------

        Proof:  E[ (Z - c)^2 ] = V[Z] + ( E[Z] - c )^2

      ==> MSE is an overall measure that trades off bias and variance
          but... MSE is not known


    . PMSE or 'predictive mean squared error':

        PMSE(fhat(x))  =  E[ (Y - fhat(X))^2 | X=x ]

        (Y,X) = future data
        fhat() = estimate based on past i.i.d. data (xi,yi)
        future and past data are assumed independent of each other

      Decomposition, pointwise at one x:

        PMSE(fhat(x))  =  V[Y|X=x] + MSE(fhat(x))
                       =  V[Y|X=x] + V(fhat(x)) + Bias(fhat(x))^2
                          ^^^^^^^^
                          variation
                          around f(x)

    . Expectation across x-values:

          E[PMSE(fhat(X))]  =  E[V[Y|X]]  +  E[MSE(fhat(X))]
                               ^^^^^^^^^     ^^^^^^^^^^^^^^^
                               independent   criterion one
                               of bandwidth  wants to minimize
                                             over bandwidths

      => Minimize expected PMSE instead of MSE

      ??? how do we estimate the PMSE ???

    . CROSS-VALIDATION:
      + Split data into training and holdout data (9/10 and 1/10, or 2/3 and 1/3).
      + Fit fhat(x) to training data.
      + Use fhat(x) to predict y on holdout data.
      + Estimate of E[PMSE]:  ave((yi-fhat(xi))^2 | i in holdout data)
      + Repeat and average again.

      Machine Learning convention:
          Randomly partition data into 10 disjoint tenths;
          use each tenth as holdout;
          average over the ten runs.

      There is evidence that 1/10 is too little for holdout,
      and 10 repetitions are too few for stable averaging.

      Hence a probably better rule:
          Split data randomly into 2/3 and 1/3;
          use 1/3 as holdout;
          repeat 100 to 500 times and average.


----------------


* Cross-validated bandwidth choice:

    . R code for cross-validation of 'smooth()':
        # Function to perform one holdout operation and return one CV value for PMSE
        cv.once <- function(bw, xy, holdout, smooth) {  # 'holdout' = logical vector
          train <- !holdout           # assumed logical, not enumeration
          xy.train <- xy[train,]      # unpacking
          x.holdout <- xy[holdout,1]  #    "
          y.houldout <- xy[holdout,2] #    "
          sm <- smooth(xy.train, bw=bw)                       # fhat on training
          f.houldout <- approx(sm, xout=x.holdout, rule=2)$y  # fhat on holdout
          return(mean((y.houldout-f.houldout)^2))             # PMSE^ from holdout
        }
        # Small difficulty: PMSE requires extrapolation
        # from "training set" to holdout set.
        # This is achieved with the R function 'approx()'; see help(approx).

        # Try on Boston housing data:
        xy <- as.matrix(boston[,c("LSTAT","MEDV")])
        cv.once(.1, xy, c(rep(T,100),rep(F,406)), smooth=smooth)

        # Function to perform 100 random holdouts and return their PMSE values:
        cv <- function(bw, xy, holdout=1/3, times=100, verbose=50, smooth) {
          pmses <- rep(NA, times)
          n <- nrow(xy)
          nholdout <- round(n*holdout)
          holdout.template <- c(rep(T,nholdout), rep(F,n-nholdout))
          for(icv in 1:times) {
            pmses[icv] <- cv.once(bw, xy, holdout=sample(holdout.template), smooth)
            if(verbose>0) if(icv%%verbose==0) cat(icv,"")
          }
          return(pmses)
        }
        cv(bw=.1, xy, smooth=smooth)
        pmses1 <- cv(bw=.1, xy, smooth=smooth)
        hist(pmses1, col="gray")
        # Obtain CV estimates of the PMSE for a large number of bandwidths:
        # Exponential ladder of bandwidths, factor=1.25
        bws <- 1.25^((-25):(+5))
        pmse.bws <- rep(NA, length(bws))
        for(i in 1:length(bws)) {
          pmse.bws[i] <- mean(cv(bw=bws[i], xy, holdout=1/3, times=400, verbose=0, smooth=smooth))
#          pmse.bws[i] <- median(cv(bw=bws[i], xy, holdout=1/3, times=400, verbose=0, smooth=smooth))
          cat("----- done with bandwidth",bws[i],"(",i,")--------------\n")
        }

      # Plot PMSEs as a function of bandwidth:
        par(mfrow=c(1,1))
        plot(bws, pmse.bws, type="h");  lines(bws, pmse.bws)
      # Need to take the log of the bandwidths or, equivalently, the order:
        plot(pmse.bws, type="h", xlab="log bw");  lines(pmse.bws)
      # => In prediction, bias can kill, variance cannot
      # The PMSE profile indicates that the 4th or 6th bandwidth was best:
        plot(xy, pch=16, cex=.5)
        lines(smooth(xy, bw=bws[9]), col="red", lwd=3)
        lines(smooth(xy, bw=bws[11]), col="blue", lwd=3)
      # Seems, though, that more smoothness would please the eyes:
        lines(smooth(xy, bw=bws[14]), col="black", lwd=3)
        lines(smooth(xy, bw=bws[30]), col="black", lwd=3)
        lines(smooth(xy, bw=bws[1]), col="black", lwd=3)
      # => The bias-variance trade-off generated by PMSE minimization with CV
      #    is not visually optimal; it follows data too closely.
      # => Good MSE asks for less bias/more variance than our visual judgment.

      General conclusions:

       --------------------------------------------------
      |    Good performance in prediction (MSE) requires |
      |    following the data more closely               |
      |    i.e., with less bias, more variance           |
      | whereas                                          |
      |    good interpretability requires                |
      |    greater simplicity                            |
      |    i.e., more bias, less variance                |
       --------------------------------------------------

      Which is worse wrt potentially catastrophic performance?

       -------------------------------------------------
      |  Variance is more benign than bias because      |
      |  bias can be unlimited, whereas variance        |
      |  is limited to V[error] by following the data.  |
       -------------------------------------------------

      CV-based choice can be transferred to regression and model
      selection because CV and PMSE minimization can be applied to
      model search in general.

      General conclusion transferred to regression:

      - for prediction:       when in doubt, keep it in
      - for interpretability: when in doubt, throw it out



* LOCAL LINEAR SMOOTHING:

  - Problem with naive local ('locally constant') averaging: boundary bias

  - Solution: Fit local lines in windows, evaluate at points of interest
              (Call the above smooth1() and smooth() 'local constant smoothers'.)

  - Needed: Weighted straight line fits

  - Formulas for weighted line fit given weights  wi >= 0, sum wi = 1:
      Numerically stable version based on centered x and y.
        xbar = sum(w*x)                 # Weighted mean of x values
        ybar = sum(w*y)                 # Weighted mean of y values
        x.y  = sum(w*(x-xbar)*(y-ybar)) # Weighted inner product
        x.x  = sum(w*(x-xbar)^2)        # Weighted squared norm
        b    = x.y/x.x                  # Weighted LS slope
        yhat = ybar + b*(x-xbar)        # Weighted straight line fit
      This solves:  sum_i  wi*(yi - a - b*xi)^2 = min_{a,b}

  - R code:
      smooth.lin <- function(xy, bw=1/5, nloc=51) {
        x <- xy[,1];  y <- xy[,2]          # unpack the data
        xmin <- min(x);   xmax <- max(x)   # needed for the bandwidth
        xs <- seq(xmin, xmax, length=nloc) # the grid points
        fx <- rep(NA, nloc)                # allocate space for the fits
        act.bw <- bw * (xmax-xmin)         # the actual bandwidth
        for(i in 1:nloc) {                 # for all grid points ...
          w <- dnorm(x=x-xs[i],s=act.bw)   # normal or Gaussian kernel weights
          w <- w/sum(w)                    # force sum(w)=1
          xbar <- sum(w*x)                 # see the formulas above...
          ybar <- sum(w*y)
          x.y  <- sum(w*(x-xbar)*(y-ybar))
          x.x  <- sum(w*(x-xbar)*(x-xbar))
          fx[i] <- x.y/x.x*(xs[i]-xbar) + ybar # evaluate the line at the grid point
        }
        xf <- cbind(xs,fx)     # Return the grid and its fitted values
        if(!is.null(colnames(xy))) colnames(xf) <- colnames(xy)
        return(xf)
      }

  - Applied to Boston''s MEDV and LSTAT:

      xy <- boston[,c("LSTAT","MEDV")]
      plot(xy, pch=16, cex=.5)
      lines(smooth.lin(xy, bw=1/5),  col="blue",   lwd=4)
      lines(smooth.lin(xy, bw=1/10), col="green",  lwd=4)
      lines(smooth.lin(xy, bw=1/15), col="brown",  lwd=4)
      lines(smooth(xy, bw=1/5),      col="orange", lwd=4, lty=3)
      lines(smooth(xy, bw=1/10),     col="red",    lwd=4, lty=3)
      lines(smooth(xy, bw=1/20),     col="purple", lwd=4, lty=3)

    ==> Local linear fit with bw=1/5 may have similar bias
        as the local constant fit with bw=1/20,
        but the latter has more variance (wiggles).

  - CV with smooth.lin(): Use larger bandwidths, shift the ladder by +5
        bws <- 1.25^((-20):(+15))
        pmse.bws <- rep(NA, length(bws))
        for(i in 1:length(bws)) {
          pmse.bws[i] <- mean(cv(bw=bws[i], xy, holdout=1/3, times=100, verbose=0, smooth=smooth.lin))
          cat("----- done with bandwidth",bws[i],"(",i,")--------------\n")
        }
      # Plot PMSEs as a function of bandwidth:
        par(mfrow=c(1,1))
        plot(pmse.bws, type="h", xlab="log bw");  lines(pmse.bws)
      # Even for large bandwidths the bias is smaller,
      # about 40 compared to about 80 for the running mean 'smooth()'.
      # What do the smooths near the PMSE bottom look like?
        plot(xy, pch=16, cex=.5)
        lines(smooth.lin(xy, bw=bws[6]), col="blue", lwd=2)


----------------


* 2D-SMOOTHING:

  - Two predictors x1 and x2.
  - Local averaging:
    . Define 'windows' in the x1-x2 plane
        Dilemma: Not only what size, but what shape of window?
                 ==> We use Euclidean neighborhoods after standardizing x1 and x2.
                 ==> Can be done by using bandwidths prop. to sd(x1) and sd(x2).
    . Need double-grid of points for evaluation.
    . R code:
        smooth2d <- function(xxy, bw1=1/3, bw2=bw1, nloc1=21, nloc2=nloc1) {
          x1 <- xxy[,1];  x2 <- xxy[,2];  y <- xxy[,3]  # unpack
          x1min <- min(x1);   x1max <- max(x1)          # for grid
          x2min <- min(x2);   x2max <- max(x2)          # dito
          x1s <- seq(x1min, x1max, length=nloc1)        # x1 grid
          x2s <- seq(x2min, x2max, length=nloc2)        # x2 grid
          fxx <- matrix(NA, nrow=nloc1, ncol=nloc2)     # fits: nloc1*nloc2
          act.bw1 <- sd(x1) * bw1                       # x1 bandwidth
          act.bw2 <- sd(x2) * bw2                       # x2 bandwidth
          for(i in 1:nloc1) {                           # loop over all combinations..
            for(j in 1:nloc2) {                         # of x1 and x2 grid values
              w <- dnorm(x=x1-x1s[i],sd=act.bw1) *      # weights: product of
                   dnorm(x=x2-x2s[j],sd=act.bw2)        # Gaussian kernels
              fxx[i,j] <- sum(y*w) / sum(w)
            }
          }
          return(list(x=x1s, y=x2s, z=fxx))  # return the two grids and the fits
        }                                    # note: length(z)=length(x)*length(y)

        xxy <- boston[,c("RM","LSTAT","MEDV")]
        sm2d <- smooth2d(xxy)
      Perspective plots, first with hidden lines shown: not very good...
        persp(sm2d)
      With hidden lines removed and the surface colored and shaded:
        persp(sm2d, ltheta=0, lphi=45, shade=0.9, col="orange")
      Animate to search for informative views:
        for(iplt in 0:10000) {
          persp(sm2d, theta=0.1*iplt, phi=15+iplt*0.02, r=sqrt(3),
                shade=0.5, col="yellow",
                main=iplt, xlab="RM", ylab="LSTAT", zlab="MEDV") -> res
          points(trans3d(xxy[,1], xxy[,2], xxy[,3], pmat = res),
                 col=2, pch=16, cex=.5)
        }
      We added the 3D points, not very successfully, because we can''t tell
      which are above and which below the surface.

  - [Fun project: Make this movie interactive with 'getGraphicsEvent()'
     by controlling 'theta' and 'phi' with 'onMouseDown=...'
     and maybe 'r' with 'onKeybd=...' to move in and out.]

  - Lessons:
    . 2D smoothing is somewhat less successful than 1D smoothing.
    . Beware of extrapolation in 2D: Lot''s of empty space
    . 3D smoothing: multiply the problems...


----------------



* OTHER SMOOTHING PRINCIPLES:

  - Basis expansions: expand a predictor x to a basis p1(x),...,pk(x)
    and perform multiple regression onto X <- cbind(p1(x),...,pk(x))
    Examples:

    . Global polynomials (-):
            |y - (b0 + b1*x + b2*x^2 + ... + bp*x^p)|^2 = min_b
          => multiple linear regression with monomial basis
          X <- cbind(1, x, x^2, ..., x^k)
          # Not numerically stable!  Use orthogonal polynomials instead!
          xy <- boston[,c("MEDV","LSTAT")]
          xy <- cbind(xy, xy[,"LSTAT"]^2, xy[,"LSTAT"]^3, xy[,"LSTAT"]^4, xy[,"LSTAT"]^5, xy[,"LSTAT"]^6)
          ord <- order(xy[,"LSTAT"])
          xy <- xy[ord,]
          plot(xy[,c("LSTAT","MEDV")], pch=16, cex=.5)
          with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:3])), lwd=2, col="red"))
          with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:4])), lwd=2, col="red"))
          with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:5])), lwd=2, col="red"))
          with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:6])), lwd=2, col="red"))
          with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:7])), lwd=2, col="red"))
      # ==> Surprisingly well-behaved.  For higher degrees, don't expect this.

    . Piecewise polynomials: REGRESSION SPLINES
          => multiple linear regression with piecewise polynomial basis
      + 1st 0rder:  X <- cbind(1, x, pmax(0,x-t1), pmax(0,x-t2), ..., pmax(0,x-tk))
      + 2nd 0rder:  X <- cbind(1, x, x^2, pmax(0,x-t1)^2, pmax(0,x-t2)^2, ..., pmax(0,x-tk)^2)
      + 3rd 0rder:  X <- cbind(1, x, x^2, x^3, pmax(0,x-t1)^3, pmax(0,x-t2)^3, ..., pmax(0,x-tk)^3)
        ...

  - Penalization: smoothing splines/kernelizing
                  based on generalized Ridge regression

    . General:

          fhat  =  argmin_f  sum((y - f)^2) + lambda * t(f) %*% Q %*% f

              where  Q  is symmetric and +semi-definite.

          ==> Generalized Ridge Regression with design matrix  X = diag(n).

          "Normal Equations":      - y + f + lambda Q %*% f  =  0

               --------------------------------------------
              |                                            |
              |   fhat = solve(diag(n) + lambda*Q) %*% y   |
              |                                            |
               --------------------------------------------


    . Two intuitions for generating penalties:

      + Penalize generalized differences of nearby locations

          Examples, assuming x sorted and unique
            x <- boston[,"LSTAT"]
            y <- boston[,"MEDV"]
            n <- length(x)
            # parallel sort of (x,y) according to x:
            ord <- order(x)
            x <- x[ord]
            y <- y[ord]
            # make unique x values by averaging x-ties:
            xu <- tapply(x, x, mean)   # same as unique(x):   max(abs(xu - unique(x)))
            yu <- tapply(y, x, mean)
            nu <- length(xu)
            # Dirty: We'll ignore the increased precision of yu-values at ties...
            #        We'd really have to use weighted penalized LS...

          Example 1: Penalize first order differences

            D1 <- cbind(diag(nu-1),0) - cbind(0,diag(nu-1));   D1[1:3,1:4];  dim(D1)
            # Question: What is  D1 %*% f ?
            Q <- t(D1) %*% D1
            # Question: What is  t(f) %*% Q %*% f ?
            # Now construct smooths according to the above formula:
            lambda <- 1000
            S <- solve(diag(nu) + lambda*Q)
            fu.hat <- S %*% yu
            plot(xu, yu, pch=16)
            lines(xu, fu.hat, lwd=2, col="red")
            # Try larger lambdas and repeat...

          Example 2: Penalize second order differences

            D2 <- cbind(0,D1[-1,-1]) - cbind(D1[-1,-1],0);   D2[1:3,1:4];  dim(D2)
            # Question: What is  D2 %*% f ?
            Q <- t(D2) %*% D2
            # Question: What is  t(f) %*% Q %*% f ?
            # Now construct smooths according to the above formula:
            lambda <- 100000
            S <- solve(diag(nu) + lambda*Q)
            fu.hat <- S %*% yu
            plot(xu, yu, pch=16)
            lines(xu, fu.hat, lwd=2, col="red")
            # Try other lambdas and repeat...

         Criticisms:

           (1) We penalized small and large spacings equally.
               This is not right.
               ==> Penalize smaller spacings more by weighting with 1/space^2.
                   I.e., penalize difference quotients, not differences.
                                            Q = D1^T %*% diag(1/diff(xu)^2) %*% D1

           (2) If the responses  y  are thought of as  y = f(x) + eps
               with independent errors eps, then differences are not
               independent:  y[2]-y[1] = f(x[2]) - f(x[1]) + eps[2] - eps[1]
                             y[3]-y[2] = f(x[3]) - f(x[2]) + eps[3] - eps[2]
               are correlated:  cor(eps[2]-eps[1], eps[3] - eps[2]) = -0.5
               ==> Decorrelate the difference quotients before penalizing
                   their sum of squares:    Q = D1^T %*% C^{-1} %*% D1

           Both criticisms are taken care of by solving the spline approach:

               f  =  argmin_f  sum (yi - f(xi))^2 + lambda * Integral (f''(x))^2 dx

           ==> Smoothing splines in Sobolev function spaces.


      + Kernelizing in ML (Machine Learning): Kernels as similarity functions

         Principle: For objects/cases i and j devise a similarity measure K[i,j]
                    such that  K  is symmetric and +semi-definite.
                    Then use  Q = K^{-1}  as the penalty matrix and  X = diag(n).

         Example of Gaussian Kernel:
                    x, y as above (LSTAT, MEDV from Boston housing data)
                    Define a similarity measure by
                    K[i,j] ~ exp(-(x[i]-x[j])^2/(2*sig^2))
                    K[i,i] = max K;  K[i,j] ~ 0 if x[i] and x[j] are far apart.
            # Code:
            s <- sd(x)*1                              # Kernel width: a smoothing parameter
            K <- dnorm( outer(x, x, FUN="-"), s=s)    # The kernel matrix
            # Visualize the kernel as a function on the x-x plane:
            xu <- unique(x)
            nu <- length(xu)
            sel <- match(xu,x)
            Ku <- K[sel,sel]
            plot(outer(xu, rep(1,nu)), outer(rep(1,nu),xu), xlab="x", ylab="x",
                 col=heat.colors(100)[1+Ku/max(Ku)*99], pch=16, cex=.3)
            # Ok, with s = sd(x) we think we have a reasonable kernel width.
            # We would like to form K^{-1}, but this would be catastrophic:
            K.eig <- eigen(K);  K.eig$val[1:20]
            par(mfrow=c(4,3), mar=rep(1,4))
            for(j in 1:12) {
                plot(x, K.eig$vec[,j], type="b", pch=16, cex=.5, xaxt="n", yaxt="n", xlab="", ylab="")
                text(par()$usr[1], par()$usr[3], lab=round(K.eig$val[j],6), adj=c(-1,-1)) }
            # Select eigenvalues > 10^{-6} of total (= trace) and preproject onto this subspace:
            sel <- (K.eig$val > 1E-8 * sum(diag(K)));  sum(sel)
            lambda <- .0001
            S <- K.eig$vec[,sel] %*% diag(1/(1 + lambda/K.eig$val[sel])) %*% t(K.eig$vec[,sel])
            f <- S %*% y
            par(mfrow=c(1,1))
#---------------------------------------------------------------- end of class; didn't work last time:
            plot(x, y, pch=16, ylim=c(0,max(y)))
            lines(x, f, lwd=2, col="red")
            # ==> Results quite satisfactory, but require
            #     considerable tuning of two parameters:
            #     kernel width s and penalty multiplier lambda.


         Remaining question:
           Who says that Gaussian kernels are non-negative semi-definite?
           They are, but why?
           ==> Kernelizing algebra !

    . Degrees of freedom for smoother:

      + The following produces a more intuitive way of describing the
        flexibility of the smoother

      + Observation:
          For fixed bandwidth, many smoothers are linear in the sense that
             fhat = S %*% y
          for some nxn smoother matrix S.
            Projection smoothers:  S = t(X) %*% solve(t(X)%*%X) %*% X
                                       where X is a basis expansion of x
            Shrinking smoothers:   S = solve(I + lambda*Q)
                                       where Q = difference penalty
                                                 or inverse of a ML kernel

      + Generalized degrees of freedom for linear smoothers:

             df  :=  trace(S)

        Example: Obtain df(S) above for various values of lambda
                 and show the corresponding dfs --- but we can get the trace
                 cheaply by observing:  eval(S) = 1/(1+lambda/eval(K))
                               Reason:       S  = (I + lambda*K^{-1})^{-1}
                 [Recall a homework with a section on "Spectral Theory".]

          sapply(c(0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10),     # grid of lambdas
                 function(lambda) { sum(diag(1/(1 + lambda/K.eig$val[sel]))) } )
          # E'vals of Q(lambda):        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ##     [1] 11.050651  9.866161  8.549137  7.057659  5.338530  3.333074
          # lambda = 0.00001    0.0001    0.001     0.01      0.1       1.0


  - 'Trend Filtering': This is a recent development in smoothing.
    Ryan Tibshirani (son of Rob Tibs) has excellent work on this.
    It competes with the most advanced smoother we know of:
    Sarah van de Geer''s and Enno Mammen''s "Locally Adaptive Regression Splines" (1997)
    which is computationally of order n^3.


  - Some distinctions and confusions:

    . Parzen kernels =/= RKHS/ML kernels

      Our initial smoothers are "Parzen kernel smoother".
      They are not to be confused with kernels of RKHS theory!

      + Parzen kernels regularize by mimicking convolution.
        Their principle is local averaging.

      + RKHS methods regularize through generalized Ridge regression.
        Their principle is penalizing rough functions.

    . Regression splines  =/= smoothing spline

      + Regression splines: a basis expansion approach
          Estimation: LS
          Smoothing parameter: number of basis vectors

      + Smoothing/kernelizing splines: a penalty approach
          Estimation: penalized LS
          Smoothing parameter: penalty multiplier

      + The two approaches can be combined, as proposed by Brian Marx'':
          Do not saturate the basis expansion (number of basis vectors < n)
          but still penalize higher-order basis vectors


================================================================