================================================================ * CURVE ESTIMATION / SMOOTHING / BANDWIDTH SELECTION / CROSS-VALIDATION - Idea: given x-y data (quantitative-quantitative) . Drop assumption of linearity: y = (a + bx) + eps . Assume smoothness only: y = f(x) + eps, f() smooth . Smooth => locally linear f(x+h) ~=~ f(x) + Df(x)*h + O(h^2) . If we wish to infer y=f(x) from data (xi,yi) (i=1,...,n), we have to infer an estimate f(x) from (xi,yi) with xi near x . First try: fhat(x) = ave(yi; |xi-x| fhat(x)=NA Smooth is really not smooth . Variations: running median and quantiles for 50%, 90% prediction bands! plot(xy, pch=16, cex=.5) lines(smooth1(xy, func=median), col="black", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.25)), col="gray50", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.75)), col="gray50", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.05)), col="gray80", lwd=3) lines(smooth1(xy, func=function(y) quantile(y,.95)), col="gray80", lwd=3) . 'Real' smoothness: kernel smoothing smooth <- function(xy, bw=1/5, nloc=51) { x <- xy[,1]; y <- xy[,2] xmin <- min(x); xmax <- max(x) xs <- seq(xmin, xmax, length=nloc) fx <- rep(NA, nloc) act.bw <- diff(range(x))*bw # actual bandwidth for(i in 1:nloc) { w <- dnorm(x=(x-xs[i])/act.bw) # same as dnorm(x, m=xs[i], s=act.bw) w <- w/sum(w) # so we have: sum(w)==1 fx[i] <- sum(y*w) } return(cbind(x=xs, y=fx)) } . Apply to same with 4 bandwidths: bws <- c(2, .2, .03, .005) cols <- c("red","orange","green","blue") plot(xy, pch=16, cex=.5) for(i in 1:4) lines(smooth(xy, bw=bws[i]), col=cols[i], lwd=3) Q: What would be a principled way to select the bandwidth? (stay tuned...) . Simulation to show variability of smoothers: y = 4*x^2 + eps bws <- c(2, .2, .03, .005) N <- 100 x <- sort(runif(N)) Nsim <- 200 par(mfrow=c(2,2), mgp=c(1.5,.5,0), mar=c(3,3,1,1)) for(ibw in 1:4) { plot(x=c(0,1), y=c(-2,6), type="n", pch=16, cex=.5, xlab="", ylab="") xy <- cbind(x=x, y=4*x^2 + rnorm(N)) points(xy, pch=16, cex=.5, col="gray30") for(isim in 1:Nsim) { xy <- cbind(x=x, y=4*x^2 + rnorm(N)) lines(smooth(xy, bw=bws[ibw]), col=cols[ibw], lwd=1) } lines(x, 4*x^2, lwd=2) } You should be looking at the widths of the color bands (variance!) and how much the bands systematically deviate from the target (bias!). The wiggliness of the functions is a side-effect of variance. . Lesson: BIAS-VARIANCE trade-off !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Really: bias-SE trade-off (Variance is dataset-to-dataset.) small bandwidth large bandwidth ==> ==> large variance (-) small variance (+) small bias (+) large bias (-) . Mean squared error and bias-variance decomposition: Model: yi = f(xi) + epsi E[epsi] = 0 BIAS: Bias(fhat(x)) = E[fhat(x)] - f(x) VARIANCE: V(fhat(x)) = E[ (fhat(x) - E[fhat(x)])^2 ] MSE: MSE(fhat(x)) = E[ (fhat(x) - f(x))^2 ] Bias-variance decomposition of the MSE: --------------------------------------------------- | MSE(fhat(x)) = V(fhat(x)) + Bias(fhat(x))^2 | --------------------------------------------------- Proof: E[ (Z - c)^2 ] = V[Z] + ( E[Z] - c )^2 ==> MSE is an overall measure that trades off bias and variance but... MSE is not known . PMSE or 'predictive mean squared error': PMSE(fhat(x)) = E[ (Y - fhat(X))^2 | X=x ] (Y,X) = future data fhat() = estimate based on past i.i.d. data (xi,yi) future and past data are assumed independent of each other Decomposition, pointwise at one x: PMSE(fhat(x)) = V[Y|X=x] + MSE(fhat(x)) = V[Y|X=x] + V(fhat(x)) + Bias(fhat(x))^2 ^^^^^^^^ variation around f(x) . Expectation across x-values: E[PMSE(fhat(X))] = E[V[Y|X]] + E[MSE(fhat(X))] ^^^^^^^^^ ^^^^^^^^^^^^^^^ independent criterion one of bandwidth wants to minimize over bandwidths => Minimize expected PMSE instead of MSE ??? how do we estimate the PMSE ??? . CROSS-VALIDATION: + Split data into training and holdout data (9/10 and 1/10, or 2/3 and 1/3). + Fit fhat(x) to training data. + Use fhat(x) to predict y on holdout data. + Estimate of E[PMSE]: ave((yi-fhat(xi))^2 | i in holdout data) + Repeat and average again. Machine Learning convention: Randomly partition data into 10 disjoint tenths; use each tenth as holdout; average over the ten runs. There is evidence that 1/10 is too little for holdout, and 10 repetitions are too few for stable averaging. Hence a probably better rule: Split data randomly into 2/3 and 1/3; use 1/3 as holdout; repeat 100 to 500 times and average. ---------------- * Cross-validated bandwidth choice: . R code for cross-validation of 'smooth()': # Function to perform one holdout operation and return one CV value for PMSE cv.once <- function(bw, xy, holdout, smooth) { # 'holdout' = logical vector train <- !holdout # assumed logical, not enumeration xy.train <- xy[train,] # unpacking x.holdout <- xy[holdout,1] # " y.houldout <- xy[holdout,2] # " sm <- smooth(xy.train, bw=bw) # fhat on training f.houldout <- approx(sm, xout=x.holdout, rule=2)$y # fhat on holdout return(mean((y.houldout-f.houldout)^2)) # PMSE^ from holdout } # Small difficulty: PMSE requires extrapolation # from "training set" to holdout set. # This is achieved with the R function 'approx()'; see help(approx). # Try on Boston housing data: xy <- as.matrix(boston[,c("LSTAT","MEDV")]) cv.once(.1, xy, c(rep(T,100),rep(F,406)), smooth=smooth) # Function to perform 100 random holdouts and return their PMSE values: cv <- function(bw, xy, holdout=1/3, times=100, verbose=50, smooth) { pmses <- rep(NA, times) n <- nrow(xy) nholdout <- round(n*holdout) holdout.template <- c(rep(T,nholdout), rep(F,n-nholdout)) for(icv in 1:times) { pmses[icv] <- cv.once(bw, xy, holdout=sample(holdout.template), smooth) if(verbose>0) if(icv%%verbose==0) cat(icv,"") } return(pmses) } cv(bw=.1, xy, smooth=smooth) pmses1 <- cv(bw=.1, xy, smooth=smooth) hist(pmses1, col="gray") # Obtain CV estimates of the PMSE for a large number of bandwidths: # Exponential ladder of bandwidths, factor=1.25 bws <- 1.25^((-25):(+5)) pmse.bws <- rep(NA, length(bws)) for(i in 1:length(bws)) { pmse.bws[i] <- mean(cv(bw=bws[i], xy, holdout=1/3, times=400, verbose=0, smooth=smooth)) # pmse.bws[i] <- median(cv(bw=bws[i], xy, holdout=1/3, times=400, verbose=0, smooth=smooth)) cat("----- done with bandwidth",bws[i],"(",i,")--------------\n") } # Plot PMSEs as a function of bandwidth: par(mfrow=c(1,1)) plot(bws, pmse.bws, type="h"); lines(bws, pmse.bws) # Need to take the log of the bandwidths or, equivalently, the order: plot(pmse.bws, type="h", xlab="log bw"); lines(pmse.bws) # => In prediction, bias can kill, variance cannot # The PMSE profile indicates that the 4th or 6th bandwidth was best: plot(xy, pch=16, cex=.5) lines(smooth(xy, bw=bws[9]), col="red", lwd=3) lines(smooth(xy, bw=bws[11]), col="blue", lwd=3) # Seems, though, that more smoothness would please the eyes: lines(smooth(xy, bw=bws[14]), col="black", lwd=3) lines(smooth(xy, bw=bws[30]), col="black", lwd=3) lines(smooth(xy, bw=bws[1]), col="black", lwd=3) # => The bias-variance trade-off generated by PMSE minimization with CV # is not visually optimal; it follows data too closely. # => Good MSE asks for less bias/more variance than our visual judgment. General conclusions: -------------------------------------------------- | Good performance in prediction (MSE) requires | | following the data more closely | | i.e., with less bias, more variance | | whereas | | good interpretability requires | | greater simplicity | | i.e., more bias, less variance | -------------------------------------------------- Which is worse wrt potentially catastrophic performance? ------------------------------------------------- | Variance is more benign than bias because | | bias can be unlimited, whereas variance | | is limited to V[error] by following the data. | ------------------------------------------------- CV-based choice can be transferred to regression and model selection because CV and PMSE minimization can be applied to model search in general. General conclusion transferred to regression: - for prediction: when in doubt, keep it in - for interpretability: when in doubt, throw it out * LOCAL LINEAR SMOOTHING: - Problem with naive local ('locally constant') averaging: boundary bias - Solution: Fit local lines in windows, evaluate at points of interest (Call the above smooth1() and smooth() 'local constant smoothers'.) - Needed: Weighted straight line fits - Formulas for weighted line fit given weights wi >= 0, sum wi = 1: Numerically stable version based on centered x and y. xbar = sum(w*x) # Weighted mean of x values ybar = sum(w*y) # Weighted mean of y values x.y = sum(w*(x-xbar)*(y-ybar)) # Weighted inner product x.x = sum(w*(x-xbar)^2) # Weighted squared norm b = x.y/x.x # Weighted LS slope yhat = ybar + b*(x-xbar) # Weighted straight line fit This solves: sum_i wi*(yi - a - b*xi)^2 = min_{a,b} - R code: smooth.lin <- function(xy, bw=1/5, nloc=51) { x <- xy[,1]; y <- xy[,2] # unpack the data xmin <- min(x); xmax <- max(x) # needed for the bandwidth xs <- seq(xmin, xmax, length=nloc) # the grid points fx <- rep(NA, nloc) # allocate space for the fits act.bw <- bw * (xmax-xmin) # the actual bandwidth for(i in 1:nloc) { # for all grid points ... w <- dnorm(x=x-xs[i],s=act.bw) # normal or Gaussian kernel weights w <- w/sum(w) # force sum(w)=1 xbar <- sum(w*x) # see the formulas above... ybar <- sum(w*y) x.y <- sum(w*(x-xbar)*(y-ybar)) x.x <- sum(w*(x-xbar)*(x-xbar)) fx[i] <- x.y/x.x*(xs[i]-xbar) + ybar # evaluate the line at the grid point } xf <- cbind(xs,fx) # Return the grid and its fitted values if(!is.null(colnames(xy))) colnames(xf) <- colnames(xy) return(xf) } - Applied to Boston''s MEDV and LSTAT: xy <- boston[,c("LSTAT","MEDV")] plot(xy, pch=16, cex=.5) lines(smooth.lin(xy, bw=1/5), col="blue", lwd=4) lines(smooth.lin(xy, bw=1/10), col="green", lwd=4) lines(smooth.lin(xy, bw=1/15), col="brown", lwd=4) lines(smooth(xy, bw=1/5), col="orange", lwd=4, lty=3) lines(smooth(xy, bw=1/10), col="red", lwd=4, lty=3) lines(smooth(xy, bw=1/20), col="purple", lwd=4, lty=3) ==> Local linear fit with bw=1/5 may have similar bias as the local constant fit with bw=1/20, but the latter has more variance (wiggles). - CV with smooth.lin(): Use larger bandwidths, shift the ladder by +5 bws <- 1.25^((-20):(+15)) pmse.bws <- rep(NA, length(bws)) for(i in 1:length(bws)) { pmse.bws[i] <- mean(cv(bw=bws[i], xy, holdout=1/3, times=100, verbose=0, smooth=smooth.lin)) cat("----- done with bandwidth",bws[i],"(",i,")--------------\n") } # Plot PMSEs as a function of bandwidth: par(mfrow=c(1,1)) plot(pmse.bws, type="h", xlab="log bw"); lines(pmse.bws) # Even for large bandwidths the bias is smaller, # about 40 compared to about 80 for the running mean 'smooth()'. # What do the smooths near the PMSE bottom look like? plot(xy, pch=16, cex=.5) lines(smooth.lin(xy, bw=bws[6]), col="blue", lwd=2) ---------------- * 2D-SMOOTHING: - Two predictors x1 and x2. - Local averaging: . Define 'windows' in the x1-x2 plane Dilemma: Not only what size, but what shape of window? ==> We use Euclidean neighborhoods after standardizing x1 and x2. ==> Can be done by using bandwidths prop. to sd(x1) and sd(x2). . Need double-grid of points for evaluation. . R code: smooth2d <- function(xxy, bw1=1/3, bw2=bw1, nloc1=21, nloc2=nloc1) { x1 <- xxy[,1]; x2 <- xxy[,2]; y <- xxy[,3] # unpack x1min <- min(x1); x1max <- max(x1) # for grid x2min <- min(x2); x2max <- max(x2) # dito x1s <- seq(x1min, x1max, length=nloc1) # x1 grid x2s <- seq(x2min, x2max, length=nloc2) # x2 grid fxx <- matrix(NA, nrow=nloc1, ncol=nloc2) # fits: nloc1*nloc2 act.bw1 <- sd(x1) * bw1 # x1 bandwidth act.bw2 <- sd(x2) * bw2 # x2 bandwidth for(i in 1:nloc1) { # loop over all combinations.. for(j in 1:nloc2) { # of x1 and x2 grid values w <- dnorm(x=x1-x1s[i],sd=act.bw1) * # weights: product of dnorm(x=x2-x2s[j],sd=act.bw2) # Gaussian kernels fxx[i,j] <- sum(y*w) / sum(w) } } return(list(x=x1s, y=x2s, z=fxx)) # return the two grids and the fits } # note: length(z)=length(x)*length(y) xxy <- boston[,c("RM","LSTAT","MEDV")] sm2d <- smooth2d(xxy) Perspective plots, first with hidden lines shown: not very good... persp(sm2d) With hidden lines removed and the surface colored and shaded: persp(sm2d, ltheta=0, lphi=45, shade=0.9, col="orange") Animate to search for informative views: for(iplt in 0:10000) { persp(sm2d, theta=0.1*iplt, phi=15+iplt*0.02, r=sqrt(3), shade=0.5, col="yellow", main=iplt, xlab="RM", ylab="LSTAT", zlab="MEDV") -> res points(trans3d(xxy[,1], xxy[,2], xxy[,3], pmat = res), col=2, pch=16, cex=.5) } We added the 3D points, not very successfully, because we can''t tell which are above and which below the surface. - [Fun project: Make this movie interactive with 'getGraphicsEvent()' by controlling 'theta' and 'phi' with 'onMouseDown=...' and maybe 'r' with 'onKeybd=...' to move in and out.] - Lessons: . 2D smoothing is somewhat less successful than 1D smoothing. . Beware of extrapolation in 2D: Lot''s of empty space . 3D smoothing: multiply the problems... ---------------- * OTHER SMOOTHING PRINCIPLES: - Basis expansions: expand a predictor x to a basis p1(x),...,pk(x) and perform multiple regression onto X <- cbind(p1(x),...,pk(x)) Examples: . Global polynomials (-): |y - (b0 + b1*x + b2*x^2 + ... + bp*x^p)|^2 = min_b => multiple linear regression with monomial basis X <- cbind(1, x, x^2, ..., x^k) # Not numerically stable! Use orthogonal polynomials instead! xy <- boston[,c("MEDV","LSTAT")] xy <- cbind(xy, xy[,"LSTAT"]^2, xy[,"LSTAT"]^3, xy[,"LSTAT"]^4, xy[,"LSTAT"]^5, xy[,"LSTAT"]^6) ord <- order(xy[,"LSTAT"]) xy <- xy[ord,] plot(xy[,c("LSTAT","MEDV")], pch=16, cex=.5) with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:3])), lwd=2, col="red")) with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:4])), lwd=2, col="red")) with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:5])), lwd=2, col="red")) with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:6])), lwd=2, col="red")) with(xy, lines(LSTAT, fitted(lm(MEDV ~ ., data=xy[,1:7])), lwd=2, col="red")) # ==> Surprisingly well-behaved. For higher degrees, don't expect this. . Piecewise polynomials: REGRESSION SPLINES => multiple linear regression with piecewise polynomial basis + 1st 0rder: X <- cbind(1, x, pmax(0,x-t1), pmax(0,x-t2), ..., pmax(0,x-tk)) + 2nd 0rder: X <- cbind(1, x, x^2, pmax(0,x-t1)^2, pmax(0,x-t2)^2, ..., pmax(0,x-tk)^2) + 3rd 0rder: X <- cbind(1, x, x^2, x^3, pmax(0,x-t1)^3, pmax(0,x-t2)^3, ..., pmax(0,x-tk)^3) ... - Penalization: smoothing splines/kernelizing based on generalized Ridge regression . General: fhat = argmin_f sum((y - f)^2) + lambda * t(f) %*% Q %*% f where Q is symmetric and +semi-definite. ==> Generalized Ridge Regression with design matrix X = diag(n). "Normal Equations": - y + f + lambda Q %*% f = 0 -------------------------------------------- | | | fhat = solve(diag(n) + lambda*Q) %*% y | | | -------------------------------------------- . Two intuitions for generating penalties: + Penalize generalized differences of nearby locations Examples, assuming x sorted and unique x <- boston[,"LSTAT"] y <- boston[,"MEDV"] n <- length(x) # parallel sort of (x,y) according to x: ord <- order(x) x <- x[ord] y <- y[ord] # make unique x values by averaging x-ties: xu <- tapply(x, x, mean) # same as unique(x): max(abs(xu - unique(x))) yu <- tapply(y, x, mean) nu <- length(xu) # Dirty: We'll ignore the increased precision of yu-values at ties... # We'd really have to use weighted penalized LS... Example 1: Penalize first order differences D1 <- cbind(diag(nu-1),0) - cbind(0,diag(nu-1)); D1[1:3,1:4]; dim(D1) # Question: What is D1 %*% f ? Q <- t(D1) %*% D1 # Question: What is t(f) %*% Q %*% f ? # Now construct smooths according to the above formula: lambda <- 1000 S <- solve(diag(nu) + lambda*Q) fu.hat <- S %*% yu plot(xu, yu, pch=16) lines(xu, fu.hat, lwd=2, col="red") # Try larger lambdas and repeat... Example 2: Penalize second order differences D2 <- cbind(0,D1[-1,-1]) - cbind(D1[-1,-1],0); D2[1:3,1:4]; dim(D2) # Question: What is D2 %*% f ? Q <- t(D2) %*% D2 # Question: What is t(f) %*% Q %*% f ? # Now construct smooths according to the above formula: lambda <- 100000 S <- solve(diag(nu) + lambda*Q) fu.hat <- S %*% yu plot(xu, yu, pch=16) lines(xu, fu.hat, lwd=2, col="red") # Try other lambdas and repeat... Criticisms: (1) We penalized small and large spacings equally. This is not right. ==> Penalize smaller spacings more by weighting with 1/space^2. I.e., penalize difference quotients, not differences. Q = D1^T %*% diag(1/diff(xu)^2) %*% D1 (2) If the responses y are thought of as y = f(x) + eps with independent errors eps, then differences are not independent: y[2]-y[1] = f(x[2]) - f(x[1]) + eps[2] - eps[1] y[3]-y[2] = f(x[3]) - f(x[2]) + eps[3] - eps[2] are correlated: cor(eps[2]-eps[1], eps[3] - eps[2]) = -0.5 ==> Decorrelate the difference quotients before penalizing their sum of squares: Q = D1^T %*% C^{-1} %*% D1 Both criticisms are taken care of by solving the spline approach: f = argmin_f sum (yi - f(xi))^2 + lambda * Integral (f''(x))^2 dx ==> Smoothing splines in Sobolev function spaces. + Kernelizing in ML (Machine Learning): Kernels as similarity functions Principle: For objects/cases i and j devise a similarity measure K[i,j] such that K is symmetric and +semi-definite. Then use Q = K^{-1} as the penalty matrix and X = diag(n). Example of Gaussian Kernel: x, y as above (LSTAT, MEDV from Boston housing data) Define a similarity measure by K[i,j] ~ exp(-(x[i]-x[j])^2/(2*sig^2)) K[i,i] = max K; K[i,j] ~ 0 if x[i] and x[j] are far apart. # Code: s <- sd(x)*1 # Kernel width: a smoothing parameter K <- dnorm( outer(x, x, FUN="-"), s=s) # The kernel matrix # Visualize the kernel as a function on the x-x plane: xu <- unique(x) nu <- length(xu) sel <- match(xu,x) Ku <- K[sel,sel] plot(outer(xu, rep(1,nu)), outer(rep(1,nu),xu), xlab="x", ylab="x", col=heat.colors(100)[1+Ku/max(Ku)*99], pch=16, cex=.3) # Ok, with s = sd(x) we think we have a reasonable kernel width. # We would like to form K^{-1}, but this would be catastrophic: K.eig <- eigen(K); K.eig$val[1:20] par(mfrow=c(4,3), mar=rep(1,4)) for(j in 1:12) { plot(x, K.eig$vec[,j], type="b", pch=16, cex=.5, xaxt="n", yaxt="n", xlab="", ylab="") text(par()$usr[1], par()$usr[3], lab=round(K.eig$val[j],6), adj=c(-1,-1)) } # Select eigenvalues > 10^{-6} of total (= trace) and preproject onto this subspace: sel <- (K.eig$val > 1E-8 * sum(diag(K))); sum(sel) lambda <- .0001 S <- K.eig$vec[,sel] %*% diag(1/(1 + lambda/K.eig$val[sel])) %*% t(K.eig$vec[,sel]) f <- S %*% y par(mfrow=c(1,1)) #---------------------------------------------------------------- end of class; didn't work last time: plot(x, y, pch=16, ylim=c(0,max(y))) lines(x, f, lwd=2, col="red") # ==> Results quite satisfactory, but require # considerable tuning of two parameters: # kernel width s and penalty multiplier lambda. Remaining question: Who says that Gaussian kernels are non-negative semi-definite? They are, but why? ==> Kernelizing algebra ! . Degrees of freedom for smoother: + The following produces a more intuitive way of describing the flexibility of the smoother + Observation: For fixed bandwidth, many smoothers are linear in the sense that fhat = S %*% y for some nxn smoother matrix S. Projection smoothers: S = t(X) %*% solve(t(X)%*%X) %*% X where X is a basis expansion of x Shrinking smoothers: S = solve(I + lambda*Q) where Q = difference penalty or inverse of a ML kernel + Generalized degrees of freedom for linear smoothers: df := trace(S) Example: Obtain df(S) above for various values of lambda and show the corresponding dfs --- but we can get the trace cheaply by observing: eval(S) = 1/(1+lambda/eval(K)) Reason: S = (I + lambda*K^{-1})^{-1} [Recall a homework with a section on "Spectral Theory".] sapply(c(0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10), # grid of lambdas function(lambda) { sum(diag(1/(1 + lambda/K.eig$val[sel]))) } ) # E'vals of Q(lambda): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ## [1] 11.050651 9.866161 8.549137 7.057659 5.338530 3.333074 # lambda = 0.00001 0.0001 0.001 0.01 0.1 1.0 - 'Trend Filtering': This is a recent development in smoothing. Ryan Tibshirani (son of Rob Tibs) has excellent work on this. It competes with the most advanced smoother we know of: Sarah van de Geer''s and Enno Mammen''s "Locally Adaptive Regression Splines" (1997) which is computationally of order n^3. - Some distinctions and confusions: . Parzen kernels =/= RKHS/ML kernels Our initial smoothers are "Parzen kernel smoother". They are not to be confused with kernels of RKHS theory! + Parzen kernels regularize by mimicking convolution. Their principle is local averaging. + RKHS methods regularize through generalized Ridge regression. Their principle is penalizing rough functions. . Regression splines =/= smoothing spline + Regression splines: a basis expansion approach Estimation: LS Smoothing parameter: number of basis vectors + Smoothing/kernelizing splines: a penalty approach Estimation: penalized LS Smoothing parameter: penalty multiplier + The two approaches can be combined, as proposed by Brian Marx'': Do not saturate the basis expansion (number of basis vectors < n) but still penalize higher-order basis vectors ================================================================