================================================================ CHAPTER 1 -- INTRO: FURIOUS OVERVIEW OF R, BABY STATISTICAL INFERENCE - Welcome to STAT 961: STATISTICAL METHODOLOGY . Intended for 1st Year PhD students in STATISTICS . Not recommended for students in areas other than Stats, Biostats - Not an applied stats course! . This is a thinking person''s class about foundational concepts. . Used extensively: linear algebra and intuitive probabilistic thinking - Not an R course, but . R is our computational tool. . There will be R code in class and in homeworks. . If you need to learn R bottom up, audit Stat 405. - Recommended applied courses . Stat 500, Applied Regression and Analysis of Variance (Paul Rosenbaum) . Stat 571, Modern Datamining (Linda Zhao) . Stat 520, Applied Econometrics I (Paul Shaman) - Prerequisites: . Strong conceptual thinking . An understanding of * statistical inference * elements of probability theory . Linear algebra, solid foundation . R working knowledge . LaTex typesetting - First steps into R: see the files 'Stat-961-R-intro.R' and 'Stat-961-R-crib.R' - A first linear algebra assignment will be posted shortly. ---------------------------------------------------------------- - REVIEW OF R: * What are the basic data types? ... laundry list: integer, boolean/logical, factors, dataframes, vectors, character, double, function, list, matrix, array, date basic/fundamental: character, numeric, logical (and: complex) composite: vector, matrix, array, list, dataframes also: NA, NaN, Inf, -Inf note: typeof(NA) typeof(NaN) - BASIC TYPES: . numeric: pi; 3.14159; 2; . character (string): "Someone"; 'Someone'; "Someone's Course"; "" . logical (boolean): FALSE; TRUE; F; T . complex: 2+3i . For those who know C or Java: + In R strings are basic values of type 'character'. + Unlike C and Java, there is no distinction between type 'character' and type string. + A character such as "X" is simply a string with just one character. + There is no distinction between "X" and 'X': "X" == 'X' # Execute this: TRUE . Auxiliary 'basic types': Missing values NA NaN; Inf; -Inf # These are missing values of type 'numeric'. 1/0 -1/0 0/0 1+NA 1+NaN . Checking basic types: return values of type logical is.numeric(3) is.numeric("Someone's name") is.character(pi) is.character("Someon's name") is.logical(0) is.logical(F) is.logical(NA) # big surprise! is.logical(FALSE) is.complex(2+3i) is.numeric(2+3i) is.na(NaN) is.na(Inf) . Explicit coercion to another basic type: as.character(pi) as.character(F) as.character(NaN) as.character(Inf) as.numeric("3.14159") # very useful! as.numeric("three point one four one five nine") as.complex(3) as.numeric(3+2i) as.logical(1) as.logical(0) as.logical(pi) # Any non-zero value gets coerced to TRUE as.logical(NA) # NA !!! as.logical(NA) # NA !!! . Default coercion rules: figure them out by experimenting. T + 3 3 | F -3 | F "a" | T 1L + pi + 1i 1L + 2 + 0i paste(1,2, sep="") paste(1,2) paste(T,2, sep="") "1" + "2" # Python programmers, beware! - COMPOSITE TYPES: . Distinguish between composite types containing elements that are of + uniform basic types: vectors, matrices, arrays + arbitrary types, including composite types: lists, data frames ==> Remember these types by remembering what they are good for! lists: e.g., return results from fitting models. data frames: statistical data tables with heterogeneous variable types . Vectors: 1:5 (-1):5 (-1):(-5) seq(-1,-5) seq(0,20,by=.5) c(3,Inf,NaN,-10) c("A","Buja") letters LETTERS c(T,F,F,T) . What about this: c("A",1,FALSE) ## !!!!!!! c("A",NaN) c(FALSE,NaN) c(FALSE,NA) c(1,NA) c(FALSE,NA) ==> Coercion to unique basic type according to default cercion rules . Matrices: mat1 <- matrix(NA, nrow=3, ncol=2); mat1 # column- vs row-major storage --- which does R use? mat2 <- matrix(1:6, nrow=3); mat2 # Answer: ... # force row-major: mat2x <- matrix(1:6, nrow=3, byrow=TRUE); mat2x mat3 <- cbind(1:3,4:6); mat3 mat4 <- rbind(c(1,4),c(2,5),c(3,6)); mat4 cbind(1:3, 5) cbind(1:5, 1:2) . Arrays: arr1 <- array(1:12, dim=c(2,3,4)); arr1 arr2 <- array(rnorm(20), dim=c(2,5,2)); arr2 . lists list(1:5, matrix(runif(10),5,2), letters, list(NaN,1:3,"A")) lst <- list(1:5, matrix(runif(10),5,2), letters, list(NaN,1:3,"A")) lst[1] ==> Unconstrained content Useful, e.g, for returning output from model fitting . data frames: df <- data.frame("Var1"=1:3, "Var2"=c("A","B","C"), "Var3"=c(T,TRUE,F)) df Purpose: statistical data tables with heterogeneous variable types data frame = a constrained list, such that all elements are vectors (statistical variables) of the same length but arbitrary type: is.data.frame(df) is.list(df) is.matrix(df) coercion to matrix: as.matrix(df) * What are the ways of indexing a vector (and matrix and array)? - Positive numeric indexing for selection: vec <- 1:10 vec vec[4] vec[3:5] letters[c(1,2,1,2,1,2)] - Negative numeric indexing for deselection: vec[-c(1,5,7)] vec[-c(1,5,1,7,7)] - Logical selection: vec[c(F,F,F,F,T,T,T,F,F,F)] vec[c(F,T)] vec[c(F,T,F)] vec[T] vec[F] vec[vec>5] mat[c(T,T,F,T,F),] mat[c(T,T,F,T,F),c(T,F)] mat[c(T,T,F,T,F),-2] - Selection by name for named components: vec <- c("jim"=53000, "bill"=89000, "ed"=101000) vec[c("jim","ed")] mat <- cbind("income"=c(10000,20000), "tax"=c(1000,2000)) mat[,"income"] mat[2,"tax"] arr <- array(1:12, dim=c(2,3,2), # Most general mechanism dimnames=list(c("Y","N"), c("Low","Med","High"), c("Past","Fut"))) arr["Y","Med","Fut"] arr["Y","Med",] arr[,"Med",] ==> Extremely powerful mechanism!!!!!!!! ('associative arrays', 'hash tables' in CS) # Review the four modes of bracket selection/indexing! # They are a powerful feature of R! # Exercise: x <- sample(letters, size=10000, replace=TRUE, prob=1:length(letters)) # Form vector y as long as x, which has the letter replaced # by its frequency in x. ... * Asking R questions: a <- LETTERS[11:20] str(a) # What structure? class(a) # What class, type? typeof(a) # Subtle difference to 'class' is.numeric(a) is.logical(a) is.character(a) apropos("plot") # difficulty: knowing what to ask for apropos("inver") # looking for matrix inversion? doesn't work apropos("data.frame") apropos("^is[.]") # What object names start with 'is.'? # ==> regular expressions, see below apropos("is.") # Not what you wanted! learn regular expressions! * Questions: - Implicit Coercion: What do the following lines do? sum(c(T,F,T)) mean(c(T,F,T)) x <- sample(c(T,F), size=1000, replace=T) sum(x) # meaning? mean(x) # meaning? x <- runif(1000) sum(x < 0.7) mean(x < 0.7) This is an important idiom!!! Remember it. It allows you to count how many times a condition is met. - What do you obtain if you extract a row from a matrix? mat <- cbind(1:5,11:15) mat[2,] class(mat[2,]) mat[2,,drop=F] #!!!!!!!!!!!!!!!!!!!!!!!!!! class(mat[2,,drop=F]) The argument 'drop=FALSE' may spare you endless debugging. The decision of Chambers et al. to coerce px1 and 1xp matrices to vectors by default was very controversial at the time. - What do you obtain if you extract a row from a data frame? df <- data.frame(1:5,letters[1:5]) df[2,] class(df[2,]) Why is this so? ... - What manipulations can be applied to matrices and dataframes alike? df <- data.frame(a=1:3, b=letters[1:3], c=LETTERS[1:3]) mx <- cbind(x=letters[1:4], y=LETTERS[1:4]) class(df); class(mx) df[1:2,] mx[1:2,] df[,1] mx[,1] ncol(df); ncol(mx) nrow(df); nrow(mx) colnames(df); colnames(mx) rownames(df) <- paste("Rowname",1:nrow(df),sep="_") # Assignment! rownames(df) rownames(mx) dimnames(df) dimnames(mx) - How do matrices and dataframes differ in terms of manipulations (1) that you can do on one but not the other, or (2) that have different effects? . Matrix multiplication when numeric: # Lin alg for matrices and vectors: mat1 <- cbind(1:3,11:13); vec <- rep(1,2) mat1 %*% vec # Lin alg does not work for data frames! . What happens here? length(mx) length(df) . What happens here? df <- data.frame(a=1:4, b=letters[1:4], c=LETTERS[1:4]) # reminder mx <- cbind(x=letters[1:4], y=LETTERS[1:4]) # reminder mx[2] # matrix as vector mx[5] # matrix as vector class(mx[2]) df[[2]] # same as df[,2] df[1:2] # same as df[,1:2] (see below) df[2] # same as df[,2,drop=FALSE] (see below) class(df[2]) mx[4] df[4] . And here? df <- data.frame(a=1:3, b=letters[1:3], c=LETTERS[1:3]) t(df) * Lists, one more time: - Accessing list elements versus sublists: lst <- list(1:3, letters[1:4], TRUE) lst[[2]] # This is a ... lst[2] # This is a ... lst[1:2] # This is a ... # The following behaviors are new to the instructor: lst[[c(2,4)]] # "d" lst[[1:2]] # 2 lst[[c(3,1)]] # TRUE # This mechanism apparently allows descent in the tree: # Access an element of an element. # [Please, report to the instructor if you don't see the above returned values.] # Experiment: Construct a more deeply nested list (list in list in list ...) # and find out whether tree descent works to deeper levels than 2. # - Named list elements: lst <- list(a=1:3, bb=letters[1:4], ccc=TRUE) lst$a lst$bb lst$"a" lst$'bb' lst[["a"]] lst[c("a","bb")] lst["a"] # This is a ..., whereas the previous were ... The dollar syntax, $, is only for selection of single elements, as is the double-bracket syntax, [[]]. The only syntax for selecting sublists is single brackets, []. - Dataframes are implemented as constrained lists: df class(df) is.list(df) Is it therefore a surprise that the following works? df[[2]] # what is this equivalent to? ... df[2:3] # " class(df[2:3]) # surprised? explain: ... class(df[1]) # " * Categorical variables: - What is their purpose? Storing label values such as names, genders, ..., any discrete-valued variables. - How are they generated? f1 <- factor(c("a","b","a","a","c","c")) f1 class(f1) plot(f1) f2 <- factor(c("a","b","a","a","c","c"), levels=letters) f2 plot(f2) Explain the difference between plot(f1) and plot(f2)! This is the main reason for the existence of the factor data structure. (An older and obsolete reason was storage efficiency.) - What would you expect for the following comparison? f1==f2 # ???... - Under the hood: attributes(f1) attributes(f2) levels(f1) levels(f2) class(f1) class(f2) c(f1) # Coercion to numeric vectors, losing levels as.numeric(f1) # " as.integer(f1) # " as.vector(f1) # Coercion to character vectors, losing levels c(f2) Which of the following comparisons would you prefer? Why? ... c(f1)==c(f2) as.vector(f1)==as.vector(f2) Factors can be columns in dataframes! dfr <- data.frame(f1,f2) lapply(dfr, levels) # (We haven't seen apply functions yet; stay tuned.) * Distributions: - Uniform distribution: runif(100) punif(.5) qunif(.3) dunif(c(.2,-.2)) - Other distributions: each has an 'r', 'p', 'q', 'd' version unif, norm, dnorm, chisq, f, t, cauchy, beta, gamma, geom, binom, poisson, hyper rnorm(100) pnorm(.5) qnorm(.3) dnorm(c(.2,-.2)) - Generate a Bernoulli with p=.25: # outcomes 1 or 0: rbinom(10, 1, 0.25) as.numeric(runif(10) < 0.25) # outcomes TRUE or FALSE: runif(10) < 0.25 table(runif(1000) < 0.25) # outcomes "a" or "b": ifelse(runif(10) < 0.25, "a", "b") c("b","a")[(runif(10) < 0.25) + 1] # brain teaser, but bad programming - Sampling from a finite set: sample(letters, 10) sample(letters[1:4], 10) sample(letters[1:4], 10, replace=TRUE) sample(5) # What's this? sample(5, replace=TRUE) sample(letters, size=500, replace=TRUE) sample(list("a",1:2,c(T,F)), size=10, replace=TRUE) sample(list("a",1:2,c(T,F)), size=10, prob=c(0.6,0.3,0.1), replace=T) sample(letters, size=100, prob=length(letters):1, replace=T) table(sample(letters, size=10000, prob=length(letters):1, replace=T)) # no need to normalize to sum=1: ^^^^^^^^^^^^^^^^^^^^^^ * Searching for patterns in string data: --- REGULAR EXPRESSION SYNTAX inside strings --- x <- paste0(sample(letters, size=1000, repl=T), sample(letters, size=1000, repl=T)) x # string search function: grep("a",x) # integer positions where "a" occurs grep("a",x,value=T) # vector of strings that contain "a" x[grep("a",x)] # same grep("aa",x, value=T) # Examples of a regular expressions: # pattern at the beginning: grep("^a", x, value=TRUE) # pattern at the end: grep("a$", x, value=TRUE) # ending with "a" # - Look for "a" or "b" or "c" followed by "x" or "y" or "z" grep("[abc][xyz]",x) grep("[abc][xyz]",x,value=T) # 'pattern1|pattern2' means 'match pattern1 or pattern2': grep("[abc][abc]|[xyz][xyz]",x,value=T) # matching any (really) character: "." grep("a.", x, value=TRUE) # same effect as grep("^a", x, value=TRUE) grep(".a", x, value=TRUE) # grep("a$", x, value=TRUE) # ending with "a" # matching ".": grep("[.]", c("","abc","xyz."), value=TRUE) grep(".", c("","abc","xyz."), value=TRUE) # wrong, "." matches any character # Exercise: construct a regular expression that matches identical character pairs pat <- ... grep(pat, x, value=TRUE) # Documentation for regular expressions: help(regexp) # More advanced and better structured: # Hadley Wickham's package 'stringr' * Loops: AVOID THEM!!!!!!!!!!!! - DO NOT EVER DO WHAT IS DONE IN C AND JAVA: x <- runif(1000); y <- 1:1000 # don't ever do this! tmp <- 0 # don't ever do this! for(i in 1:1000) tmp <- tmp + x[i]*y[i] # don't ever do this! Instead: sum(x*y) # Vectorized [Apart from looping, the three lines of code above are bad programming practice from another angle as well. Namely? Think about the consequences of changing 1000.] - Get used to VECTORIZATION, i.e., high-level functions that have implicit loops implemented efficiently in C: x <- runif(1000) sum(x) # Try this: x[1] <- NA; sum(x) mean(x) var(x) sd(x) min(x) max(x) range(x) mat <- cbind(1:3,11:13) rowSums(mat) colSums(mat) y <- 1:100 cumsum(y) cumprod(y) cummin(x) cummax(x) diff(y) - Some legit uses of loops: . Animations: for(i in 1:300) plot(rnorm(1000),rnorm(1000) ) # Slow things down by sleeping: for(i in 1:50) { plot(rnorm(1000),rnorm(1000) ); Sys.sleep(.25) } # Refine the plot to fix lims: lims <- 4*c(-1,1) # Fix lims for(i in 1:50) { plot(rnorm(1000),rnorm(1000), xlim=lims, ylim=lims, pch=16 ) Sys.sleep(.5) } . Multiple plots: par(mfrow=c(2,2)) # Prepare 4 plotting frames 2x2 for(i in 1:4) plot(rnorm(100),rnorm(100)) # Refine: lims, points, labels, reduce wasted margin space par(mfrow=c(2,2), mar=c(3,3,1,1), mgp=c(1.8,.5,0)) N <- 100; lims <- 4*c(-1,1) for(i in 1:4) plot(rnorm(N),rnorm(N), xlim=lims, ylim=lims, xlab="x", ylab="y", pch=16) . Conditional execution involving 'side effects' such as printing: # Being lucky with probability 0.1 in 50 attempts: for(i in 1:100) if(runif(1)>.9) cat("You lucky guy!\n") else cat("Bummer!\n") # Waiting to be lucky with probability 0.1: while(runif(1)<.9) { cat("Not yet... \n") }; cat("Finally!\n") - 'Avoid Loop' Practice: How would you design a random process that is always the highest value seen to date in a discrete Brownian motion simulation? rbinom(100,1,.5) # Build it up: 0/1 with prob=0.5 2*rbinom(100,1,.5)-1 # +-1 with prob=0.5 cumsum(2*rbinom(100,1,.5)-1) # Discrete Brownian motion process cummax(cumsum(2*rbinom(100,1,.5)-1)) # Cumulative max process Better to plot: par(mfrow=c(2,2), mgp=c(1.8,.5,0), mar=c(3,3,1,1)) for(i in 1:4) { tmp.process <- cumsum(2*rbinom(10000,1,.5)-1) plot(tmp.process, pch=16, type="l", xlab="Time", ylab="Process") lines(cummax(tmp.process), lwd=3, col="blue") } - 'apply' functions are generally preferred for shorter code but are not faster than loops: apply(), tapply(), sapply(), lapply(), ... apropos("apply") # find all apply functions A function the instructor uses a lot: sapply() df <- data.frame("a"=1:5, "b"=letters[1:5], "c"=LETTERS[1:5], "d"=c(T,F,T,F,T)) sapply(df, class) sapply(df, is.numeric) sapply(df, function(x) { if(is.logical(x)|is.numeric(x)) { mean(x) } else { table(x) } } ) - Vectorization of matrix operations: . Matrices and arrays are really vectors with a dimension attribute: mat <- matrix(1:12, nrow=4) mat dim(mat) attributes(mat) # ==> The vector 1:12 was filled into the 4x3 matrix column by column. # ==> Matrices are stored in 'column-major order'. # This is inherited from Fortran and convenient for statistics. # The C language stores matrices in 'row-major order': mat[i][j] . Treating matrices as vectors: mat * rep(1:3,each=4) mat * rep(1:4,3) mat + c(0,100) # What does this do? . Time comparisons for different vectorizations:: # Task: Given a large matrix, multiply a constant to every row or col N <- 3000; mx <- matrix(rnorm(N^2), N, N); fac <- 1:N options(digits.secs = 6) # Increase the precision of time reporting. Sys.time(); dim(diag(fac) %*% mx); Sys.time() # multiply every row with a constant Sys.time(); dim(fac * mx); Sys.time() # same but faster Sys.time(); dim(mx %*% diag(fac)); Sys.time(); # multiply every col with a constant Sys.time(); dim(t(fac * t(mx))); Sys.time(); # same but faster; t() can be slow Sys.time(); dim(mx * rep(fac,each=N)); Sys.time(); # same but faster * Plotting: - Many plotting functions in Base R to build up complex plots: 'plot()', 'text()', 'lines()', 'points()', 'abline()', 'polygon()', 'rect()', 'legend()', 'grid()',... These are extremely efficient. E.g., one can plot 10,000 rectangles in one single call to rect(). ----------------------------------------------- - Principle: | iterative refinement and debugging of plots | ----------------------------------------------- - Example: . Fine tuning a single plot: x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y) # add pch, cex, main . Multiple plots: par(mfrow=c(2,3)) for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=paste("Plot",i)) } . New window of given size, for reproducibility: windows(width=8, height=6) par(mfrow=c(2,3), mar=c(3,3,3,1), mgp=c(1.8,.5,0)) # Plot parameters for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=paste("Plot",i), ylim=c(-.3,1.3)) } . PDF device driver for inclusion in papers: pdf(file="My First R Plot.pdf", width=6, height=4.5) par(mfrow=c(2,3), mar=c(3,3,3,1), mgp=c(1.8,.5,0)) for(i in 1:6) { x <- runif(1000); y <- x^2 + rnorm(1000, s=.1) plot(x, y, pch=16, cex=.5, main=paste("Plot",i)) } dev.off() - In standard R, plots are not objects. To this end check . Hadley Wickham''s ggplot2 package, . Paul Murrell''s Grid package, . Deepayan Sarkar''s Lattice package. All these authors wrote books. * MANDATORY RULES for statistical plots/graphics/visualizations: - Do not ever submit a plot generated without fine-tuning! Plots require time and effort and thought. Example: Behold the following disastrous series of plots of response versus predictors in the Boston Housing data, using R defaults! site <- site <- "http://stat.wharton.upenn.edu/~buja/STAT-961" boston <- read.table(paste(site,"boston.dat",sep="/"), header=T) # data summary(boston) par(mfrow=c(3,5)) for(j in 1:13) plot(boston[,c(j,14)]) - Avoid postage stamps! . Use the available space left to right (6.5 inches in US letter size). . Make margins between multiple plots reasonably narrow. . Move axis labelings closer to the axis than is default. Example: windows(h=9, w=15) par(mfrow=c(3,5)) for(j in 1:13) plot(boston[,c(j,14)]) # postage stamps Now refine: par(mfrow=c(3,5), mgp=c(1.8, 0.5, 0.0), mar=c(3,3,1,1)) #!!!!!!!!!!!!!!!! for(j in 1:13) plot(boston[,c(j,14)], pch=16, cex=.9) The par() arguments have the following meanings: ----------------------------------------------------------------- | mgp: Move the axis labels and numbers closer to the axis. | | mar: Narrow the margins (num lines: bottom, left, top, right). | ----------------------------------------------------------------- - What is still wrong with the plot default of R: insufficient space around the plot! r.stretch <- function(x, fac=1.2) { z <- range(x,na.rm=TRUE); m <- mean(z); m + (z-m)*fac } par(mfrow=c(3,5), mgp=c(1.8, 0.5, 0.0), mar=c(3,3,1,1)) #!!!!!!!!!!!!!!!! for(j in 1:13) { plot(boston[,c(j,14)], xlim=r.stretch(boston[,j]), ylim=r.stretch(boston[,14]), pch=16, cex=.9) } - Plot contents: point scatters, time series, ...: . Learn the available glyphs in R; default circles are often suboptimal. Aesthetically, they look moth-eaten and often have low visual impact. plot(runif(200)) # Bad: Plot content has no visual impact. plot(runif(200), pch=16) # Better. plot(runif(200), pch=16, cex=1.2) # Even better. plot(1:25, pch=1:25, cex=2) # Want to know all symbols/glyphs plot(1:25, pch=letters[1:25], cex=2) plot(runif(100000)) # versus: plot(runif(100000), pch=".") . Two important rules: (1) Force more margin around the content than is default. (2) Force a meaningful aspect ratio on plots. ~ Bad display of plot contents: par(mfrow=c(3,4), mgp=c(1.8, 0.5, 0.0), mar=c(3,3,1,1)) for(j in 2:13) plot(boston[,c(j,14)]) ==> Extreme points fall too close to borders. Aspect ratio improperly reflects simplest structure, here: correlation. ==> Create more space around the point scatters. Create an approximately square aspect ratio on the plotting regions. ~ Example: stretch <- function(x, fac=1.1) { m <- mean(x); (x - m)*fac + m } windows(h=9, w=15) # height:width ~ mfrow parameters ## for publication: # pdf("My second R plot.pdf", h=9, w=15) # height:width ~ mfrow parameters y <- boston[,"MEDV"] # ^^^inner margin ^^^outer margin for title ylim <- stretch(range(y, na.rm=T), fac=1.2) par(mfrow=c(3,5), mgp=c(1.8,.5,0), mar=c(3,3,1,1), oma=c(0,0,3,0)) for(j in 1:13) { x <- boston[,j] xlim <- stretch(range(x, na.rm=T), fac=1.2) plot(boston[,c(j,14)], pch=16, cex=.8, xlim=xlim, ylim=ylim) } title("Boston Housing Data: Response versus Predictors", outer=T, cex.main=2) ==> Publication quality figure when rendered in PDF. . "Banking rule" when slopes matters, according to Bill Cleveland: The important slope should be about 45 degrees steep. This applies in particular to quantile plots. #---------------------------------------------------------------- end of class 7 * R Packages: Software contributed by the R community, deposited at the CRAN site - List of all packages: requires your OS (linus, macos, macosx, windows, windows64) and R version pckgs.all <- available.packages("http://cran.r-project.org/bin/windows/contrib/3.4") str(pckgs.all) # number of packages in various years: # 2012/09/11: 3956 # 2013/09/04: 4723 # 2017/09/08: 10708 # 2019/09/23: 13866 colnames(pckgs.all) # Only 1st column needed rownames(pckgs.all) # Examples of searching for package names: grep("ace", rownames(pckgs.all), val=T) # the ACE package grep("^ace", rownames(pckgs.all), val=T) # the ACE package grep("sand", rownames(pckgs.all), val=T) # a sandwich package - Install a particular package, e.g., a package 'leaps' for all-subsets regression and 'acepack' for ACE regression: Can be done with point-and-click in RStudio, or in Rterm: repos <- "http://cran.r-project.org/" # For installing packages install.packages("leaps", repos=repos) install.packages("acepack", repos=repos) [Say 'yes' to the 'personal library' question; choose a site at CMU in PA (e.g.) if asked.] #================================================================ - BABY STATISTICAL INFERENCE --- A THEORY OF PARALLEL UNIVERSES * How do we arrive at stderrs, CIs, statistical tests in the simplest case? Most fundamentally, we make stochastic assumptions/models; e.g., i.i.d. sampling. Would this assumption apply to a time series of stock quotes? Their financial returns? Would it apply to the Boston Housing data? Cases = census tracts? * Imagine observing N ojects (e.g., sample students and measure height) X1,...,XN i.i.d. with existing expectation mu and variance sigma^2 => E[X1] = mu, V[Xi] = sigma^2, C[Xi,Xj] = 0 (i!=j) * Realize: If you repeated the data collection, you would get different values X1,...,XN * Summary statistic: mean Xbar = mean(X1,...,XN) For a given dataset, you obtain one mean value. Simulation of a 'dataset': N <- 100 X <- rnorm(N, m=69, s=5) mean(X) sd(X) * But the mean would have been different from another dataset, => The mean is a random variable, even though observed only once. The point is: We could at least in principle repeat the data collection to obtain another value of the observed mean. => For illustration, run the above code multiple times and observe how mean(x) and sd(x) change from 'dataset' to 'dataset'. * Question: How much does the mean vary from DATASET TO DATASET? => variance expansion V(Xbar) = ( sum_i Var(Xi) + sum_i!=j Cov(Xi,Xj) )/N^2 = sigma^2 / N => sigma(Xbar) = sigma/sqrt(N) true standard error: dispersion of a statistic across datasets If we did not know this formula, what R code could we use to inform us about the variability of mean(X1,...,XN)? N <- 100 sim <- replicate(10000, { X <- rnorm(N, m=69, s=5); c(m=mean(X),s=sd(X))}) dim(sim) apply(sim, 1, sd) * Statistical question: Can we estimate sigma(Xbar) from ONE dataset? YES, because . sigma(Xbar) = sigma/sqrt(N) and . We can estimate sigma with s: s = sqrt(sum_i (Xi - Xbar)^2 / (N-1)) = sd(X1,...,XN) => Estimate of sigma(Xbar) is s/sqrt(N) * Statistical inference consists of approximate probability guarantees about estimates, where the probabilities are -------------------------------------------------- | LIMITING RELATIVE FREQUENCIES ACROSS DATASETS. | -------------------------------------------------- The two components that typically make such statements possible are . standard error estimates and . either of the following: + exact distributional assumptions that are analyticaly tractable or + the central limit theorems and laws of large numbers. * Central limit theorem: (Xbar - mu)/sigma(Xbar) gets ever more N(0,1) distributed as N-->Inf => mu +- 2*sigma(Xbar) contains Xbar for ~95% of datasets Xbar +- 2*sigma(Xbar) contains mu for ~95% of datasets !!!!!! ** Strangeness of the CLT: In most cases it speaks about "fraction of worlds/datasets" because we usually see only one mean of one dataset. (Exception: quality control) * Distinguish 'confidence intervals' from 'retention intervals'!! . Confidence intervals (CI): estimate sigma(Xbar) with stderr = s/sqrt(N) Xbar +- 2*s/sqrt(N) contains the true mu for 95%% of datasets . Testing: Is the assumption mu=mu0 reasonable in light of the data X1...XN? Retention intervals: mu0 +- 2*stderr contains Xbar for 95%% of datasets If Xbar is outside, reject mu0 because it makes the observed Xbar too unlikely. * What are the probability guarantees for RIs and CIs? . RI: P[ Xbar not in RI(mu0) ; mu0 true ] ~ 0.05 # rejection of mu0: Xbar too far from mu0 . CI: P[ mu in CI(Xbar); mu is true ] ~ 0.95 # for any mu ^^ * Generalization to statistics other than means and mu: . Standard errors and their estimates can be obtained analytically or computationally for most statistics . A CLT holds ALMOST invariably for well-behaved statistics, called "asymptotic normality" => Statistical inference with CIs and tests can be performed: theta.est +- 2*stderr(theta.est) contains true theta for ~95%% of datasets. ** Standard error ESTIMATES are obtained from a single dataset. => We think we can say something about the variation of Xbar - variation of Xbar from dataset to dataset - BASED ON A SINGLE DATASET! [What assumption(s) are we leveraging?] * History: Neyman, Fisher,... a huge 20th century achievement . Fisher got CIs wrong with his "fiducial intervals." [However, there is a thriving line of research giving meaning to "fiducial inference".] . Bayesians today compete with their 'posterior credible intervals.' Critique of Bayesian approaches: ~ They tend to be "model-trusting." ~ They would like be "calibrated" for good frequentist performance * Statistical inference for regression: . Strangeness of regression as treated by linear models theory: * The predictor/regresssor values are considered fixed, not random. (Regression is done "conditional on x".) * Only the response is considered random. . Hence "variability across datasets" means datasets that have the same x-values but different y-values. (In a simulation, keep x values fixed, simulate only y values.) . There exists a math-stat argument that says conditional inference is valid unconditionally (across datasets with varying x). Stats students: Look out for the notion of "ancillarity" in math stat to justify conditionality on x. . Hitch: The ancillarity argument assumes the model to be correct. ==> The fixed-x linear models theory we will examine next is also model-trusting, like Bayesian parametric approaches. ==> Importance of regression diagnostics! ** Strangeness of Time series and spatial data: - Examples: . Facebook stock price time series, US GDP time series . Boston Housing data about median housing prices in 500+ census tracts HOW WOULD ONE OBTAIN REPLICATIONS OF THESE DATA SETS? WHERE IS THE RANDOMNESS? Unlike data from experiments, clinical trials and survey sampling, we can''t obtain other datasets, not even in principle. - Proposal of frequentist interpretation: "POSSIBLE-WORLD SEMANTICS" Imagine a different Facebook stock price or GDP trajectory. Imagine a different Boston with 500+ census tracts. ==> Frequentism is purely imaginary. In our mind we are "replaying the movie of the world over and over, but every time it's different." Like in the movie "Groundhog Day." Homework: Watch it! ==> Frequentism seems less satisfactory. - Then again, the idea of good performance of data analytic procedures forces us to imagine "situations just like the actually observed one," even in "spatio-temporal" data. If someone in class finds a better way of formalizing statistical uncertainty and statistical performance, please, let the instructor know!!! It might revolutionize the foundations of statistics. ================================================================