################################################################ # # # INTRO TO R # # The R website: # # http://cran.r-project.org # # (Google "R" -> one of the first entries) # # Downloading R: # # -> Sidebar "Download, Packages": CRAN # -> any site in the US # -> Windows # -> base # -> "Download R-2.... for Windows (32 bit build)" # -> installation dialog in setup wizzard # # The setup program should self-install and create an icon on your desktop. # Clicking the icon should open up an R interpreter window ("R Gui"). # # The base is really just the base. There are many contributed # library packages whose binaries can be downloaded from # # -> Packages # # You will not have to download them explicitly, though; # there are R functions that allow you to get them while running R. # In the R Gui you can also go through the "Packages" item in the toolbar. # ################ # # OPERATION OF R: # # * For non-Emacs users: # # 0) go to the class web page and download this file: # www-stat.wharton.upenn.edu/~buja/STAT-541/Stat-541-R-intro.R # Open an editor (Word, Wordpad,...) on this file (change .R to .txt) # AND open up an R GUI window by clicking on the R icon. # Reduce the size of both windows so both are accessible. # # 1) Copy R code from this file into the R interpreter window. # Use shortcuts: In the editor highlight lines, hit -C, # then move to the R window and hit -V. # Examples: 1+2 1:10 2^(1:20) runif(10) rnorm(10) 1:10 + rnorm(10) # # 2) Experiment with R code # by editing THIS file in the editor window, or # by editing the command line in the R window (if it's one-liners). # # Commands for line editing in the R interpreter window: # Note: "^p" means you hold down the modifier key and hit "p", # just like the modifier key used for capitalization # ^p get back the previous command line for editing and executing # repeating ^p goes further back in the command history # ^b step back one character in the command line # ^f step forward one character in the command line # ^a move to the beginning of the line # ^e move to the end of the line # ^d or to delete the character under the cursor # ^h or to delete the previous character # ^k kill the rest of the line starting from the cursor # otherwise: you are in insert mode # (These editing commands are borrowed from the Emacs editor.) # # # * For Emacs users: # # download the ESS macros ("Emacs Speaks Statistics") from r-project.org: # -> R GUIs -> Emacs (ESS) -> Downloads # Download the latest zip or tar.gz file. # Unpack and install; ESS should work right away. Skip to "Operation:" and try. # If it doesn't work right away, # you may have to put these lines in your .emacs file: ## (setq inferior-R-program-name "c:/Program Files/R/R-2.7.1pat/bin/Rterm.exe") # ^^^^ path to your R executable ^^^^ ## (load-file "c:/EMACS/ESS/ess-5.3.0/lisp/ess-site.el") # ^^^path to the file "ess-site.el"^^^ # # Operation: # - Split the Emacs window into two windows: # ^x 2 # - Edit THIS file in the upper window. # ^x ^f filepath # - Start R in the lower window: # ^x o (move the cursor to the lower window) # -x R (start R inside Emacs) # - If you like to shrink one of the windows, put this line in your .emacs: # (global-set-key "\M-o" (lambda () (interactive) (shrink-window 1))) # Then -o will shrink the present window and expand # the other window by one line. # - There are macros to copy and execute lines, functions, and regions # from the upper buffer into the lower buffer: # ^c ^j (execute current line and leave the cursor in place) # ^c ^n (execute current line and move to next line of R code) # ^c ^f (execute function, assuming the cursor is # inside its body) # ^c ^r (execute region) # - A small nuisance is that the lower (R) window does not move # the bottom line to the center after executing an expression. # This can be fixed by putting the following in your .emacs: # (global-set-key "\M-s" "\C-xo\M->\C-l\C-xo") # Then -s will move the bottom line to the center. # # ################## # # # * R is an INTERPRETED LANGUAGE: # Users type expressions and see results immediately. # Example: for(i in 1:10) { if(i%%2==0) print(i) } # As opposed to: # - ... languages (C, Fortran) # - ... software (such as SAS' JMP) # # # * R is HIGH-LEVEL: # It operates on complex data structures such as # vectors, matrices, arrays, lists, dataframes, # as opposed to C and Fortran that operate on individual numbers only. # (This requires some getting used to for C programmers.) # # # * PRIMARY BEHAVIOR: Whatever is typed, print the results. 2 print(2) # same "a" print("a") # same # (Q: Why is there '[1]' preceding the results? A: ...) # Vector of length greater than 1: 1:3 print(1:3) # same # # # * SYNTAX: # - Largely scientific/math notation; base 10. # - A wealth of functions. # - Comments run from a "#" to the end of the line; no multiline comments. # - Spaces are irrelevant, except inside strings: 2+3; 2 + 3; "ab"; "a b" # - Statements can run over multiple lines: 2 + 3 + # \ 4 # / One statement # But if a statement is syntactically complete at # the end of the line, it won't continue: 2 + 3 # \ + 4 # / Two statements # - Statements can be separated by ";". 2; 3^3; sqrt(9) # #--- # # * BASIC DATA TYPES: # # # - NUMERIC: double precision by default (How many bytes?) # Integers are represented as doubles, although the print function # shows them as integer: -2.000 1E5 2E-3 # The usual unary and binary operations and analytic functions: # +, -, *, /, %%, %/%, ^, log, sqrt, sin, acos... 2+3 # Add. 5.3*1E10 # Multiply. 10%%3 # Modulo. exp(1) # Exponentiation. log(2.718282) # Log of the number 'e'; 'log' is e-based. log10(10) # 10-based log pi # That's the number 'greek pi', 3.14159 sin(pi/2) # Angles are to be given in arcs, not degrees. sin(pi) # Dito. acos(0) # This is the inverse of cos, arccos, hence pi/2. pi/2 # This is the only hardwired constant: 3.14159... # # # - STRINGS: can be single or double quoted, but the print function # uses double quotes. 'a'; "a"; 'abc'; "abc" # (In C and Python strings are character vectors. # In R strings are basic types; there is no single character type. # Characters are just strings of length 1. # There is no indexed access to individual characters and # substrings in R; one uses the "substring" function instead: substring("abcxyz",4,6) # Other basic string manipulations: paste("my","word") nchar("Supercalifragilistikexpialidocious") # There are two hardwired character vectors that contain the lower and # upper case letters: letters LETTERS # # # - LOGICAL values: have two names each, but the print function # always uses the longer. TRUE; FALSE; T; F # They are implemented as the values 1 and 0 for T and F, respectively. # They are the result of the usual comparisons: <, >, <=, >=, ==, != 1<2; 1>2; "ab" <= "abcd" "ab" > "ac"; "ab" != "AB" "ab" != 2; 0==F; 1==T # # # - MISSING values NA, Inf, -Inf: NA; NaN; Inf; -Inf; 1/0; Inf==1/0; 0/0 # Watch out: the following does not give T!!! NA==1 # If you want to test for NA, you must use the function is.na(): is.na(NA) # # # - FUNCTIONS: # * R is a FUNCTIONAL LANGUAGE: # Functions return values that in turn can be arguments to functions. # Expressions evaluate inside out, e.g., log(2*2.5))^3: 2.5; 2*2.5; log(2*2.5); log(2*2.5)^3 # # # * STATEMENTS/EXPRESSIONS: # There are two types of expressions: assignments and side effects. # 1) Assignments allocate data structures and # make variables point to them. x <- 1:3 # Allocate a vector 1,2,3 and make 'x' point to it. # 2) Side effects are essentially display operations # such as printing, plotting, showing help; unlike assignments, # they don't change the computational state of the R system. x print(x) plot(x) help("+") # Show the help page of addition. help(sqrt) # Show the help page of the square root function. help("sqrt") # Dito. # 3) Composite Statements: {print(1); plot(1:10)} # Will be needed in loops and functions. # # # - Assignments to variables come in four equivalent syntactic forms: # x <- ... # x = ... # ... -> x # assign("x",...) # Examples: x <- sqrt(2 * 3) # Root of product of 2 and 3 x = sqrt(2 * 3) # Both can be used: '=' and '<-' sqrt(2 * 3) -> x # This can be used, too, if you must... y <- c(1,3,10,1,1,1111,0,1,1) # combine 1,3,10... into a vector 'y' z <- 1:3 # Assign the vector containing 1,2,3 to a 'z'. assign("x", sqrt(2*4)); print(x) # Note that variables jump into existence upon assignment. # Unlike C and Fortran, there is no need to declare variables. # The variables are not 'typed', that is, any variable can # point to data of any type, such as numbers, logicals, vectors,... # #--- # # * HELP: help(function) or help("function") shows function documentation. help(sqrt) # (Emacs users: call help.start() before using help.) # In the output of this function, check out the section # with header "See Also:". It will tell you that you # can find related functions by calling help("Arithmetic") help("log") help("sin") help("Special") # help(c) help("c") # Same as help(c) help("*") # help(*) does not work # # * APROPOS: apropos("char") lists all functions whose name contains # the string "char". apropos("char") # This is often useful for finding related functions. # Apropos combined with the section "See Also:" in the output # of help() is a powerful tool for searching functions. # There are about 1,700 built-in functions, and more if you # download special-purpose packages from the R website. # # * Printing a function: allows you to see the arguments in a simple way runif # same as: print(runif) rnorm # (functions are "first class citizens", like numbers, vectors,...) # #--- # # * MANAGEMENT OF DATA AND FUNCTIONS: # - Listing R objects, both data and functions: either of ls(); objects() # This lists all data structures and functions that YOU defined. # - Removing data and functions: x <- 1:10 rm(x) x # - Looking for partly remembered data and functions: # In case you remember only part of a name, you can look it up # with a partial pattern: xx <- 10 ls(pattern="x") # This will list any dataset and function whose name contains "x" # such as 'last.warning'. # - List all functions that come with the base package of R: ls("package:base") # Over 1,100 functions... # - About packages: # . Packages are namespaces for data AND functions. # (You can't have a dataset 'x' and a function 'x' at the same time.) # . You can list the packages in your environment: search() # . When you use a name, R goes through the search list # shown by 'search()', package by package, stopping when # it finds the name. This implies that the same name can appear # in multiple packages, but only the one in the first package # will be found. ls <- 2:5 # mask 'ls' in "package:base" with user data ls rm(ls) # remove user data, unmasking the function 'ls()' ls # #--- # # * QUITTING: q() # R asks whether you want to save the workspace; # usually you say "yes". Splus simply quits. # #--- # # * SEMANTICS: # Every assignment creates a copy of the assigned object. # Assignment is by value, not by reference (unlike Python and C). a <- c(1,3,5) # a b <- a # 'b' gets its own copy. b # We couldn't tell from this, though. a[1] <- 2 # Assign the value 2 to the first element of 'a'. # This yields a test of whether 'a' and 'b' point to the same object. # If they did, then 'b' would also have 2 in the first position. a # We know this. b # Uhu! 'b' was not changed by 'a[1] <- 2'. # Therefore, 'b' has its own copy. # #--- 2008/09/04 # * SYNTAX: see help(Syntax) # # * VECTORS: # A vector is a sequence of items of the SAME basic data type. c(1,3,4.5) # Collect three values in a vector. c("a","ab") # Collect two strings in a vector. c(T,F,T) # Collect three logical values in a vector. c(2.1,T) # Not an error. Coercion of T to 1. c(2,"a",T) # Not an error. Coercion of 1 and T to strings. # If the items are not of the same type, they are coerced: # string <-- numeric <-- logical # (If the items are of variable types and should not be coerced, # use lists instead of vectors. See below.) # #--- # # * INDEXING AND SUBSELECTING VECTORS: # # R/S, among all languages, has probably the most powerful set of # tools for getting at elements of vectors: # * selection/mapping with positive integer indeces # * delection with negative integer indeces # * selection with logical indeces # * selection by name when vector entries are named ("associative array") # - Numeric indexes: ONE-based (unlike C, but like Fortran) a <- c(1,3,5) a[1]; a[2]; a[3] # (This is unlike Python and C which use ZERO-based indexing.) # - Vector indexes: a[c(1,3)] # or, equivalently, except for the allocation of another vector 'b': b <- c(1,3); a[b] # - Vector expansion: amounts to mapping the indexes, using 'a' as a map. a[c(1,2,1,2,1,1,1,3,3,3)] # - Exclusions with negative numeric indexes: a[-1] d <- a[-1] a[c(-1,-3)] b <- c(-1,-3); a[b] # dito a[c(-1,-2,-3)] # Nothing left. a <- a[-1] # Actually remove the first element of 'a'. # - Logical selection: a <- c(1,3,5) a[c(T,F,T)] b <- c(T,F,T); a[b] # dito a>2; a[a>2] # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! b <- (a>2); a[b] # dito # Caution: If the index vector is not of equal length, # it will be cyclically repeated: a[F] # c(F,F,F) 'F' is repeated 3 times a[T] # c(T,T,T) a[c(T,F)] # c(T,F,T) a[c(T,T,F,F,F)] # If too long, the index vector is truncated. (1:12)[c(T,T,F)] # Leave out every third item. # (The above scheme can be used to create arbitrary repeat patterns.) # - Vectors can be indexed repeatedly: a[c(1,3)][2] # Select item two of a[c(1,3)], i.e. item 3 of 'a'. # (Looks like a matrix element in C, but isn't!!) (a[c(1,3)])[2] # This is what the previous expression really means. # Think of a[c(1,2)] as the result of some selection function. a[c(1,3)][c(1,2,1,2)] # - Vector indexing and subsetting can be used for assignment: a[1:2] <- c(10,20); a # Print 'a' to see the effect of the assignment. a[c(1,3)] <- c(100,200); a a[-2] <- c(1000,2000); a a[c(F,T,T)] <- c(10000,20000); a a[2:3] <- 0; a # "0" is repeated to fill both vector elements. b <- 1:10 b[3:6] <- c(10,20); b # "c(10,20)" is cyclically repeated. # If the length does not divide, there is a warning message, # but cyclical fill-in is done anyway. # #--- # # * SOME FUNCTIONS THAT CREATE VECTORS # # - Manual entry of a vector: x <- c(-1,2,5) # # - Equispaced sequences of numbers: 3:10 10:3 seq(3, 10) seq(3, 10, by=1/3) # Third argument is names "by" seq(3, 10, len=8) seq(letters) # List of indexes into 'letters' # # - Repetitions: rep(3, 10) rep(1:3, 5) # Here is something more complex that "rep" can also do: rep(c(1,2,3,4), c(2,3,2,4)) rep(1:3, rep(2,3)) # # - Logical values: rep(T,5) ((1:10) > 5) (1:10)[(1:10) > 5] # # - Random numbers: x <- runif(5); x # Five uniform random numbers in [0,1]; see below. y <- rnorm(5); y # Five standard normal random numbers; see below. # # - Random permutations and samples: x <- sample(10); x # Random permutation of the numbers 1,2,...,10. sample # can also do sampling with replacement... sample(letters, 10, replace=T) # - Read a vector from file: The file 'sp.dat' is a very long time series; # download it from the class website. x <- scan("DATA/sp.dat", n=1000) # First n values x <- scan("DATA/sp.dat") # Reads 734656 numbers; may take a few sec. # 'x' contains 8 bytes for each element, hence this many MB: 8*length(x)/2^20 # (2^20 equals one MegaByte) object.size(x) # size in bytes; why is it not exactly 8*length(x)? # You may not want to keep this 'x' around because of its length: rm(x) # Btw, if you run out of memory, try this: options(memory=1E10, object.size=1E10) # #--- # # * AUTOMATIC VECTOR EXTENSION: # # Array-bound errors for vectors do NOT exist for positive indexes!!! # Vectors are shown as NA if indexed out of bounds, and # automatically extended if assigned out of bounds. x <- runif(5) # x has length 5. x[6] # NA, not an error message. x[10] <- 9.99 # Not an error! Element 10 now exists. x # So do elements 6,7,8,9, filled in with NA. length(x) # Assignment can extend the length. # However, negative indexes (used for exclusion) can be out of bounds: x[-11] # Out of bounds. x[-9] # Ok, due to fill-in after assigning element 10. # Automatic vector extension makes vectors very different from # matrices and arrays, for which automatic extension does not exist. # #--- # # * NAMED VECTOR ELEMENTS: # # - Vector elements can be named: x <- c(first=1,second=3,third=5,fourth=7); x # - Equivalently, naming elements can be performed in a separate step # by assigning the "names()" attribute: x <- c(1,3,5,7) names(x) <- c("first","second","third","fourth") x names(x) # - Element names can be used for indexing/subsetting: x["second"] x[c("second","fourth")] nm <- c("second","third","second","third"); x[nm] # - Named assignment/extension is possible: x[c("fifth","fourth")] <- c(10,11); x # Note: "fourth" existed and is re-assigned; # "fifth" did not exist and is an extension of "x". # - If names are not unique, the first matching name is selected: c(a=1,b=2,a=3)["a"] # - Example of automatic use: 'table()' x <- sample(letters,size=1000,replace=T) table(x) # # #--- # # * VECTORS WITH NAMED ELEMENTS AS ASSOCIATIVE ARRAYS: # # R's vectors with named elements are really a limited form # of 'associative arrays'. Associative arrays are best explained # in terms of an example: # Assume you want to look up, say, salaries of persons given by name. # Assume that salaries and names are stored in parallel vectors: slry <- c(35000, 100000, 12000) nm <- c("John", "Bill", "Jim") # Turn "slry" into an associative array by giving its elements names: names(slry) <- nm # Salaries are now "associated" with names for lookup by name, # the names are the "keys", the salaries the "values". # For example, Bill's salary is: slry["Bill"] # "Bill" is the key, slry["Bill"] the value. # # Another way to create the same: names arguments to c() slry <- c(John=35000, Bill=100000, Jim=12000) slry # # NOTE: If one wants to use numbers as keys, they have to be # converted to strings first. The conversion happens automatically # through coercion, as in names(slry) <- c(10,20,30); slry names(slry) # Thus the keys are now the strings "10", "20", "30": slry["20"] # Caution: In lookups, numbers are NOT coerced to strings, # because numbers act as indexes into the vector x <- seq(10,40,by=10) names(x) <- c(2,1,4,3) # Coerced to strings: "2","1",... x[2] # Second element of 'x'. x["2"] # Same as x[1], which has the name "2". # # #--- # # * RANKING, SORTING, REVERSING, PERMUTING: x <- c(5:1,5:10); x rank(x) sort(x) rev(x) sample(x) # random permutation of 'x'; can also sample(x, size=100, replace=T) # random sample of size 100 w. replt. sample(x, size=100) # why does this not work? # Here is one of the more important functions: order(x) # !!!!!!!!!!!!!!! sort(x) x[order(x)] # Same! # Sorts also character data lexicographically: x <- sample(letters); x # permute letters sort(sample(letters)) # This is how you perform parallel sorting: x <- runif(10) y <- -x - 100 # 'y' is just another vector of the same length as 'x'. x; y # Unordered parallel vectors ord <- order(x) x[ord]; y[ord] # Sorts both "x" and "y" in ascending order of "x". # # #--- # # * SIMPLE STATISTICS: length(x) sum(x) mean(x) var(x) sd(x) min(x) max(x) range(x) median(x) # #--- # # * CUMULATIVE OPERATIONS: x <- 1:10 cumsum(x) cumprod(x) x <- 1:10 * c(1,-1) cummax(x) cummin(x) # #--- # # * SIMPLE NUMERIC FUNCTIONS/TRANSFORMATIONS: # Most functions that accept one number and return another number # will naturally "vectorize" in R, namely, apply element by element. x <- runif(20, min=-10, max=+10) round(x) trunc(x) ceiling(x) abs(x) sqrt(x) # Comment? log(x^2) exp(1) exp(x/100) cos(pi) # "pi" is predefined; the number e=exp(1) is not. acos(0.5) # What is 'acos()'? # # ################ # # * MATRICES: # Matrices in R are vectors with additional "meta-information" # to structure them in a rectangular form. # The elements of the vector fill the matrix column by column. # ==> COLUMN MAJOR ORDER, as in Fortran, but unlike in C. # Reformatting as a matrix is achieved by giving the vector # a dimension attribute consisting of the numbers of rows and cols. # # - Reformatting vectors as matrices by filling successive cols or rows: matrix(1:12, ncol=4) # Column major (default) matrix(1:12, nrow=3) # Same; ncol is inferred matrix(1:12, ncol=4, byrow=T) # Row major, forced with "byrow". matrix(1:12, nrow=3, byrow=T) # Same matrix(0:1, nrow=2, ncol=4) # What happened? matrix(0, nrow=2, ncol=4) # " matrix(letters, ncol=2) # Elements are now of type 'character'. matrix(paste("Letter",letters), ncol=2) # When reading data in text files, 'byrow=T' is needed # for row-by-row input (download 'laser.dat' from the course page first): m <- matrix(scan("Data/laser.dat", skip=1), ncol=4, byrow=T) # ^^^^^^ Skip first line. m # Later we will see a more convenient function for reading # tabular data ('read.table()'). # - Whether something is a matrix can be checked with 'is.matrix()': is.matrix(matrix(1:12,3)); is.matrix(1:12) x <- 2; is.matrix(x) x <- 1:10; is.matrix(x) x <- matrix(0, nrow=3, ncol=5); is.matrix(x) is.matrix(matrix(0, nrow=3, ncol=5)) # tautology # # - The dimension attribute: it is the sole difference between # a vector and a matrix. It can be queried: dim(m) # Vector of length 2 with the two dimensions. dim(m)[1]; dim(m)[2] # Each dimension separately. nrow(m); ncol(m) # Same. # Vectors can be turned into matrices by assigning the # dimension attribute: m <- 1:12 # Just a vector. dim(m) <- c(3,4) # Now a matrix. m is.matrix(m) # TRUE dim(m) <- NULL # Stripping to a vector. m is.matrix(m) # FALSE # - The dimension name attributes: row- and col-names colnames(m) <- letters[1:4] rownames(m) <- LETTERS[1:3] m colnames(m) rownames(m) # - Indexing/subselecting rows and columns: (differs from C!) m <- matrix(1:12, ncol=4) m[1,4] # Element in row 1, column 4. m[1:3,] # First 3 rows. m[,3:4] # Last 2 columns. m[1:3,3:4] # Submatrix of size 3x2 (unlike Python!) m[c(1,2,1,2),] # Repeat rows 1 and 2. m[,c(1,2,1,2)] # Repeat columns 1 and 2. m[c(1,2,1,2),c(1,2,1,2)] # Repeat left-upper 2x2 matrix 4 times. m[-1,] # Select all but the first row. m[,-c(2,4)] # Select all but columns 2 and 4. m["A",] # Only works if col/rownames have been assigned. m[c("A","C"),] m[c("A","C"),"a"] # - ATTENTION: Selection of individual rows and cols generates vectors # who do no longer know that they were rows or columns. # R has no concept of col-vectors and row-vectors: m[,1] is.matrix(m[,1]) is.matrix(m[1,]) # Vector is vector, period (unlike Matlab). # You can force the issue by turning vectors into # Nx1 or 1xN matrices: x <- 1:10 dim(x) <- c(10,1); x dim(x) <- c(1,10); x # # - Index/subsetting can be used for assignment: m[1,2] <- 0 m[1,] <- 11:14 m[,1] <- 0 m[1,c(F,F,T,T)] <- c(7,10) # # - Associative array feature for matrices: rownames(m) <- c("x","y","z") # like 'names(vec)' colnames(m) <- c("a","b","c","d") m["x","d"] # number m["x",] # vector m[,"c"] # vector m[c("x","z"),c("c","a")] # submatrix (different from Python!) m[c("x","z","x","y"),c("c","a","a")] # col-row-rearrangement # # Roadmap: # - wrap up matrices # - arrays # - lists # - dataframes # - loops and conditionals # - character manipulation functions # - fcts related to distributions # - plotting functions # - writing FUNCTIONS # - querying object types # - Column and row binding: # Two functions that permit collecting cols and rows to form matrices. x <- runif(5) # Play data. cbind(1:5, x, x, rep(10,5)) # Column binding. rbind(1:5, x) # Row binding. # (Vectors are NOT thought of as columns or rows on their own; # they take on these roles inside the "cbind" and "rbind" functions.) # Both functions accept an arbitrary number of conforming arguments. # You can also bind whole matrices: cbind(rbind(1:3,11:13,21:23), cbind(1001:1003,2001:2003)) # A more friendly way of writing the same is: m1 <- rbind(1:3,11:13,21:23) # 3x3 m2 <- cbind(1001:1003,2001:2003) # 3x2 cbind(m1, m2) # 3x5 # Conforming for 'cbind()' means the arguments have equal number # of rows, for 'rbind()' it means equal number of columns. # If vector arguments are not conforming, R extends them cyclically # or clips them but may give you a warning if the shorter arguments # are not of fractional length of the longest argument: cbind(1:3,0) # second arg has size 1, = fraction of 3 cbind(1:6,1:3) # size 3 is a fraction of size 6 cbind(1:3,1:2) # size 2 is not a fraction of size 3 => warning cbind(1:3,matrix(11:14,2)) # clipping: the second arg dictates nrow # Don't rely on cyclic extension except for the simplest cases # such as repeating constants. # - Coercion of matrices to vectors: # A matrix can always be treated as a vector. # The following does not create an error message: m <- matrix(1:12, nrow=3) m[,1] # Recall the column-major convention for storing matrices in R. # # - Some useful functions for generating patterned matrices: diag(5) outer(1:3,1:4) outer(1:3,1:4, FUN=paste, sep="") x <- outer(1:3,1:4) row(x) col(x) row(x)>col(x) x[row(x)>col(x)] # ################ # # * ARRAYS: the generalization of matrices to more than 2 indexes # a <- array(1:24, dim=c(2,3,4)) a a[1,2,1] a[,2:3,c(1,4)] a[,,c(T,F,T,F)] a[,,-3] # Main use: contingency tables # e.g., table frequencies of sex by education by income bracket ... # The associative array feature exists also when a dimnames are given # To understand this, we need to talk about lists first. # # ################ read/study up to here for HW 1 ##################### # # * LISTS: # # - Recall: Vectors and matrices can carry only one basic type # of data at a time, numeric OR logical OR character. matrix(c(T,F),nrow=4,ncol=4) matrix(paste(LETTERS,letters)[1:25],nrow=5) # Lists are data structures without this restriction. # Lists are sequences of anything. They can contain: # basic data types, vectors, matrices, arrays, lists (recursion),... # In particular, they can contain arbitrarily nested lists of lists. # Examples: list(1,"1",T,F,1E100,NA,-Inf,NULL,1:5,letters[2:5],list(1,2,"a")) # Balanced binary tree: list(list(list(1,2),list(3,4)),list(list(5,6),list(7,8))) # mylist <- list(vec=1:3, mat=cbind(1:2,3:4), flags=c(T,T,F,T), lst=list(1:3,1:4)) mylist # Component names are optional. They can be glued on by force: x <- list(1,"a",T,NULL); names(x) <- LETTERS[1:4]; x # - Access to LIST ITEMS is by index with "[[..]]", or by "$name" # if names exist: mylist[[2]] mylist$mat mylist[["mat"]] # This is also how lists are printed: list(1,T) list(a=1,b=T) # # - Access to SUBLISTS is as if the list were a vector: mylist[2:3] # Sublist of length 2. mylist[2] # Sublist of length 1. mylist[c(1,4,1,4,2,3,2,3)] # Out-of-order and expansion. mylist[c("lst","mat","vec","mat")] # Named access. mylist["vec"] # Sublist of length 1. mylist[["vec"]] # Vector!!! mylist[c(T,F,T,F)] # Logical selection # # - Sublists are lists and require list item access to get at # the components: mylist[2:3][[1]] mylist[2:3]$mat mylist[2][[1]] # Again, "mylist[2]" is a list of length 1... mylist[2]$mat # Dito. # # - Primary use of lists in statistics: # collecting the results of a complex statistical or numerical # analysis, such as a regression or an eigendecomposition: eigen(matrix(1,3,3)) # is a list with two named items: $values and $vectors, # the first being a vector containing the eigenvalues, # the second being a matrix with the eigenvectors in the columns. # # # ################ # # * DATAFRAMES: # # - In statistics, data are usually rectangular tables, cases by variables. # Problem: Variables are not all of the same type. # Some may be quantitative, hence stored as numeric data,. # Other variables may be categorical and stored either # with numeric or character/string codes for categories. # => Matrices cannot accommodate variables of both types... # Solution: Data frames. They are similar to matrices, # but columns may differ in basic data types. # (The entries have to be basic, not complex.) # # Main use of dataframes: data tables with mixed-type variables # # Dataframes are printed like matrices, but they are internally # implemented as lists. # # - The function "data.frame()" can bundle conforming vectors, # matrices, other dataframes into a single dataframe: myframe <- data.frame(somenumbers=1:3, somecharacters=c("a","b","c"), somelogicals=c(T,F,T)) myframe # # - Auxiliary functions for dataframes: is.data.frame(myframe) # Checking type. as.data.frame(cbind(1:4,11:14)) # Coercing a matrix to a dataframe. # # - Many matrix manipulations carry over to dataframes, # with one exception: dim(myframe) nrow(myframe) ncol(myframe) rownames(myframe) colnames(myframe) myframe[,3] # Col 3 coerced to vector. myframe[2,] # << Sub-dataframe with row 3, NOT coerced to vector! myframe[,2:3] # Sub-dataframe consisting of cols 2 and 3. myframe[2:3,] # Sub-dataframe consisting of rows 2 and 3. # Why is 'myframe[2,]' not coerced to a vector? # Because the items might be of differing types! # Hence the row 'myframe[2,]' is still a dataframe... # # (Of course matrix multiplication does not work:) myframe %*% matrix(rep(1,10), nrow=5) # - 'read.table()' reads a tabular ASCII/txt file into a dataframe, # possibly with row and column names: read.table("laser.dat", header=T) # The file can be an URL: url <- "http://www-stat.wharton.upenn.edu/~buja/STAT-541/laser.dat" my.frame <- read.table(url, header=T) # This function has many other options for particular cases. # From the "help()" page: # read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", # row.names, col.names, as.is = FALSE, na.strings = "NA", # colClasses = NA, nrows = -1, # skip = 0, check.names = TRUE, fill = !blank.lines.skip, # strip.white = FALSE, blank.lines.skip = TRUE, # comment.char = "#") # - Since data frames are lists of columns, we can use "lapply()" (list-apply): lapply(myframe, class) # The following does not do the right thing: apply(myframe, 2, class) # Why? "apply()" expects a matrix in its first argument; # if it isn't, it gets coerced into a matrix; # but matrices can't have factors in their columns, only vectors, # so the factor columns are coerced to character columns. ################################################################ # # Flow control: loops, conditionals # m <- 1:12 #play data # # - if-conditional: if(length(m) > 15) { print("m is greater than 10") } # With "else" clause: if(length(m) > 15) { print("m > 10") } else { print("m <= 10") } # This sort of thing is most useful inside loops; see below. # - The vectorized "ifelse()" function: # Not a flow control construct, but often replaces a combination of # for-loop and if-else statements. ifelse(c(T,F,T), c(1,2,3), c(-1,-2,-3)) # The function runs down the three arguments in parallel, # checks each element in the first argument, # if true, picks the corresponding element in the second argument, # if false, picks the corresponding element in the third argument, # returns a vector/matrix/array of the size of the first argument. # If the second or third argument are not conforming, # they are cyclically repeated, as in this implementation of # 10 Bernoulli trials: ifelse(runif(10) > 0.5, "H", "T") # - Note the difference between 'if' and 'ifelse()': # * 'if' is a syntactic element that dispatches execution # depending on the outcome of a single logical outcome. # * 'ifelse()' is a function that takes a logical vector # and selects from two other vectors depending on the logicals. # - for-loop: runs over the elements of a vector. for(i in c(10,100,1000)) { j <- 2*i; print(j); } # The variable 'i' is the looping variable. It gets set to # each element of the vector in turn, and for each # the subsequent, usually composite ({...}) statement gets executed. # (Braces are optional if there is only one statement in the loop.) # - repeat-loop: needs a "break" for termination. repeat{ print("Trying..."); rnum <- runif(1); if(rnum > 0.9) { print(c("Found one:",rnum)); break } } # - while-loop: str <- "Not yet..." while(runif(1) < 0.99) { str <- paste(str,".",sep="") print(str) } print("Finally...") # # # - Looping over row or columns of matrices with 'apply()': m <- matrix(1:12, 3); m # * Looping over rows:: apply(m, 1, min) # Return a vector with minima of the rows. # ^^^ the "min()" function, applied to the 1st dimension. # * Looping over columns: apply(m, 2, mean) # Return a vector with the means of the columns. # ^^^^ the "mean()" function, applied to the 2nd dim. # * 'apply()' can also be used with functions that return vectors, # such as 'range()' and 'sort()': apply(m, 2, range) # There are complications with "apply(m, 1, range)": # the result is 2xnrow(m), not nrow(m)x2, as one would expect. # This requires transposition, 't()', to get the expected matrix: t(apply(m, 1, range)) # # # - Looping over elements of a list: lis <- list(numvec=1:10, charvec=letters, logicvec=rep(c(T,F),3), mat=cbind(1:3,1)) lis lapply(lis, length) # returns a list with one number per entry: the length unlist(lapply(lis, length)) # same but 'flattened' to a vector sapply(lis, length) # same but Simpler ('Simple apply') lapply(lis, dim) # returns a list with the dim attribute for each entry lapply(lis, sort) # returns a list of sorted vectors ################################################################ # # * STRING MANIPULATION: # # R has a number of functions that make the use of other scripting # languages such as awk, perl, python less and less necessary for # pattern matching problems. There are functions for searching # and substituting text patterns expressed as so-called # "regular expressions". Here is a collection: # # paste(..., sep = " ", collapse = NULL) # substr(x, start, stop) # substring(text, first, last = 1000000) # substr(x, start, stop) <- value # substring(text, first, last = 1000000) <- value # strsplit(x, split, extended = TRUE) # nchar(x) # string length # chartr(old, new, x) # tolower(x) # toupper(x) # match(x, table, nomatch = NA, incomparables = FALSE) # x %in% table # pmatch(x, table, nomatch = NA, duplicates.ok = FALSE) # charmatch(x, table, nomatch = NA) # grep(pattern, x, ignore.case=FALSE, extended=TRUE, value=FALSE) # sub(pattern, replacement, x, # ignore.case=FALSE, extended=TRUE) # gsub(pattern, replacement, x, # ignore.case=FALSE, extended=TRUE) # regexpr(pattern, text, extended=TRUE) # # String dimensions when plotting: # strwidth(s, units = "user", cex = NULL) # strheight(s, units = "user", cex = NULL) # # Example: download the dataset 'dict.dat' from the course page, then... dict <- scan("DATA/dict.dat",w="",quote="") sel <- grep("source", dict) # All words containing "source". dict[sel] # Stay tuned for HW 3. # # ################################################################ # # * DISTRIBUTIONS: PSEUDO-RANDOM NUMBERS, QUANTILES, CDFS, DENSITIES # # The following continuous distributions come with the standard # version of R, each with # . a random number generator, # . a quantile function, # . a cdf function, and # . a density fct. # - Uniform: # runif(n, min = 0, max = 1) # qunif(p, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE) # punif(q, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE) # dunif(x, min = 0, max = 1, log = FALSE) # # - Normal: # rnorm(n, mean = 0, sd = 1) # qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) # pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) # dnorm(x, mean = 0, sd = 1, log = FALSE) # # - Student's t: # rt(n, df) # qt(p, df, lower.tail = TRUE, log.p = FALSE) # pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE) # dt(x, df, log = FALSE) # # - F: # rf(n, df1, df2) # qf(p, df1, df2, lower.tail = TRUE, log.p = FALSE) # pf(q, df1, df2, ncp = 0, lower.tail = TRUE, log.p = FALSE) # df(x, df1, df2, log = FALSE) # # - Chi-square: # rchisq(n,df,ncp=0) # qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) # pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) # dchisq(x, df, ncp = 0, log = FALSE) # # - Exponential: # rexp(n, rate = 1) # qexp(p, rate = 1, lower.tail = TRUE, log.p = FALSE) # pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE) # dexp(x, rate = 1, log = FALSE) # # - Cauchy: # rcauchy(n, location = 0, scale = 1) # qcauchy(p, location = 0, scale = 1, lower.tail = TRUE, log.p = FALSE) # pcauchy(q, location = 0, scale = 1, lower.tail = TRUE, log.p = FALSE) # dcauchy(x, location = 0, scale = 1, log = FALSE) # # - Beta: # rbeta(n,shape1,shape2) # qbeta(p, shape1, shape2, lower.tail = TRUE, log.p = FALSE) # pbeta(q, shape1, shape2, ncp = 0, lower.tail = TRUE, log.p = FALSE) # dbeta(x, shape1, shape2, ncp = 0, log = FALSE) # # - Gamma: # rgamma(n,shape,rate=1,scale=1/rate) # qgamma(p, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, # pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, # dgamma(x, shape, rate = 1, scale = 1/rate, log = FALSE) # # Some uses: # - Random number generators are used, for example, # . in simulations, as when evaluating the efficiency # of statistical methods; # . in Bootstrap inference for frequentist statistics; # . in Bayesian computing to simulate posterior distributions. # Base of simulations: law of large numbers (rel. freq. approx. probs.) # - The quantile functions are used # . to find critical values in statistical tests; # . in q-q-plots to check empirical against theoretical distributions # (see below). # - The cdf functions are used for calculating p-values in # statistical tests. # - The normal density is sometimes used as a "kernel" in # smoothing methods. # # - Simulating discrete distributions and sampling: # The function sample # serves many purposes. It does sampling with and without replacement. # # - Bernoulli trials and Binomials: e.g., 10 head/tails, biased coin flip, ifelse(runif(10)<.55, "H", "T") sample(c("H","T"), size=10, repl=T, prob=c(0.55,0.45)) rbinom(n=10, size=1, p=.55) # (Sampling WITH replacement is also called i.i.d. sampling.) # # - Multinomial: e.g., 10 draws from uniform distribution on 4 objects sample(1:4, size=10, repl=T, prob=rep(1/4,4)) dmultinom; rmultinom # # - Sampling WITH replacement: sample(1:10, 5) # Samples w/o replacement (default). sample(10) # Random permutation of 1:10. # - The function can also be applied to non-numeric data: sample(letters) # A permutation of letters. sample(letters, 10) # 10 distinct random letters. sample(letters, 10, repl=T) # 10 i.i.d. samples of letters. sample(letters, 100) # Error! Sampling w/o replacement. sample(letters, 100, repl=T) # Fine! Sampling with replacement. # # 'sample()' is used to implement modern resampling methods such as # the bootstrap and permutation tests (see below). # # . Important note on seeding random number generators: # The "seed" stores the state of the pseudo-random number generator. .Random.seed runif(1) .Random.seed # The seed has changed. Random.seed.store <- .Random.seed # Store the seed. runif(1) # (*) Remember this random number. runif(10) # Now push the generator 10 times. .Random.seed <- Random.seed.store # Force it back to recreate (*) runif(1) # Indeed, this is the same random number as in (*). # Storing the seed is important for simulations that must be # reproduceable, for example, when putting simulation results # in a publication. # ################################################################ # # * PLOTTING: # # - The par() function for querying and setting plotting parameters. help(par) # # Some important parameters to set: par(mfrow=c(3,4), # 3x4 layout of 12 plots, filling rows first (mfcol) mgp=c(2.0, 0.5, 0), # distance of axis title ("x","y"), # numbers at the ticks, border line, # (in margin lines) mar=c(3,2.5,2.5,1.5), # margins: bottom-left-top-right (in lines) oma=rep(1,4), # outer margin around the collection # of plots on the page (in lines) mex=0.8, # expansion of margin line width, # relative to default cex=0.8) # expansion of characters, relative to default # # - List of all possible plot characters: windows(width=10, height=3) plot(1:25, rep(1,25), pch=1:25, xlab="", ylab="", yaxt="n", cex=2) # # ################################################################ # # WRITING FUNCTIONS: # # Syntax: simplest case fun1 <- function(x) { plot(x) } fun2 <- function(x) { 2*x } # Semantics 1: fun1 creates a 'side effect', plotting the input argument; # fun2 returns a result, the doubled input argument. fun1(1:10) # plots 1,2,...,10 fun2(1:10) # returns the vector containing 2,4,...,20 # Semantics 2: Functions are 'first class citizens', that is, # they can be assigned, printed, passed as arguments,... fun2 # Print the function. print(fun2) # Same. funx <- fun2; funx(1:5) # Application: len <- length; len(1:10) # What is this? (function(x) { 2*x })(1:10) # It's getting weird: funp <- function(f) print(f) funp(funp) # Assignment: x <- fun2(1:10); x # Fine. x <- fun1(1:10); x # Assigns NULL, but plots. 'fun1()' is for side effect. # Syntax: The following are equivalent ways of writing fun2(). fun2 <- function(x) { 2*x } # The original. fun2 <- function(x) 2*x # Single statement bodies don't need braces. fun2 <- function(x) { return(2*x) } # Make the return explicit. fun2 <- function(x) { y <- 2*x; y } # Value of last statement is returned. fun2 <- function(x) { y <- 2*x; return(y) } # Dito # Functions with more arguments and complex bodies: fun3 <- function(x, y) { windows(width=5, height=5) par(mgp=c(1.8,.5,0), mar=c(3,3,2,1)) plot(x, y, pch=16, cex=.5) } fun3(1:100, rnorm(100)) # Function calls with named arguments: independent of order of arguments fun3(y=rnorm(100), x=1:100) # Functions with defaulted arguments: defaults can use preceding arguments fun4 <- function(y=rnorm(100), x=1:length(y), cex=.5, pch=16, w=5, h=5, new=T) { if(new) windows(width=w, height=h) par(mgp=c(1.8,.5,0), mar=c(3,3,2,1)) plot(x, y, pch=pch, cex=cex) # 1st 'cex': argument name # 2nd 'cex': value to be passed to plot } fun4() # All arguments defaulted. fun4(1:100, rnorm(100)) # 1st and 2nd argument passed by order. fun4(1:100, rnorm(100), new=F) # Last argument is 7th by order, so needs name. fun4(1:100, rnorm(100), new=F, cex=.2) fun4(1:100, rnorm(100), new=F, cex=1, pch=3) fun4(x=1:100, y=rnorm(100), pch=2, w=10, h=2) # x= and y= not needed but more readable fun4(cex=.2) # Functions returning complex data structures: fun5 <- function(x) { list(N=length(x), location=c( mean=mean(x), mean.tr.05=mean(x,trim=.05), mean.tr.10=mean(x,trim=.10), mean.tr.25=mean(x,trim=.25), median=median(x) ), dispersion=c( sdev=sd(x), mad=mad(x), IQR=IQR(x) ) ) } summaries <- func(rnorm(10000)) summaries # Semantics 3: All passing of arguments is by value, not reference/name. # In other words, all arguments get copied into the body of the function. fun5 <- function(x) { x[1] <- Inf; x } x <- 1:3; fun5(x); x # In the above line, fun5() is working on a copy of x, also called x. # Semantics 4: Function bodies have access to outside variables. fun6 <- function() { x[1] <- Inf; x } x <- 1:3; fun6() x # 'x' was not changed. The body of fun6() worked on a copy of x. # Rule: Do not use this feature of R semantics. # Pass all data into a function as arguments. # (See Section 10.7 of Venables & Smith on "scoping rules" in R.) # Functions called inside functions: fun7 <- function(x) { rbind(x, fun2(x)) } fun7(1:4) # (This is an exception to the rule of not relying on the scoping rules of R.) # Functions defined inside functions: fun8 <- function(x) { fun9 <- function(x) { 2*x }; fun9(x) } fun8(1:5) fun9(1:5) # Does not exist; defined only inside the body. # Functions passed as arguments into other functions: x <- matrix(rnorm(1000),100,10) fun9 <- function(x) quantile(x, prob=c(.25,.50,.75)) apply(x, 2, fun9) # Equivalent: apply(x, 2, function(x) quantile(x, prob=c(.25,.50,.75))) ################################################################ # # * QUERYING R OBJECTS: # # Asking for the primary data type: class(1:10) class(T) class(letters) class(swiss) # Like 'letters', 'swiss' comes with R. # # Queries about basic data types: is.numeric(1:10) is.character(letters) is.logical(F) is.na(NA) # !!!!!!!!!!! use for removal of NAs: x <- rnorm(100); x[rnorm(100)>1] <- NA; is.na(x); x[!is.na(x)] is.infinite(-Inf) is.null(NULL) # # Queries about complex data types: is.vector(letters) is.array(array(1:12,c(2,3,2))) is.matrix(swiss) is.data.frame(swiss) is.list(swiss) is.function(mean) # # All the query functions that start with "is.": apropos("^is[.]") # '.' is a special character in regular expressions. # # Querying components of lists and columns in data frames: names(swiss) # Querying row and column names in matrices and data frames: colnames(swiss) # Same as names() for data frames. rownames(swiss) # ################################################################ # # * MORE ON R: A MENTAL MODEL FOR R # # black box with inner state described by # * Variables x, y, z, err, ..., pi, letters, LETTERS, ... # * Their values, that is, the data structures such as # values, vectors, matrices,... # to which the variables point. # * The black box has an engine, the intepreter, # that acts on the state of the box. # * The state of the black box is changed by executing # assignments, such as x <- runif(100) # * On quitting q() you decide whether to save the newly # formed variables and their values. # * Learn about the state of the black box by executing: # . ls() to learn what variables are bound # . print(x) to learn what the value of the variable is # * There are two additional black boxes: # . par(), ... to learn about the values and data structures # of two sub-boxes: # - Plotting box: # . State: plotting parameters such as character size, margin width,... # . Reports the settings of the plotting parameters: par() # . State can be changed by executing, say, par(mfrow=c(2,2)) # - Random number generator: # . State: the seed vector .Random.seed # . State changes when 1) a random number is generated, or # 2) .Random.seed is set by the user # - Printing and other parameters, e.g., line width, number of digits... options() # # ################################################################ # # Changing the working directory: # In the standard interface, one can change the working directory... # - temporarily: -> File -> Change Dir # - permanently: R-click on R icon -> Edit Properties -> Start in: # ################################################################