---------------------------------------------------------------- Ideas for next time: - major data cleaning project with "Marketing_Science.txt" - exploratory computing for analytical problems; comparison of risk curves, e.g. needs quadrature ---------------------------------------------------------------- STATISTICS 540: STATISTICAL METHODS AND COMPUTING Prerequisites: * hardware: - access to a MS Windows machine - privileges to install software - high-speed internet access (direct, cable modem, DSL) - a few hundred megabytes of disk for the software tools * skills: - knowledge of basic operation of an MS Windows PC - preferably some experience with some kind of programming Purpose of the class: * acquiring a set of computational tools for - data handling - statistical data analysis - data visualization - simulations - publication * gaining proficiency in using these tools - at the technical level, but also - at the level of deeper comprehension: . recognizing what tools are needed in a given situation . making reasonable use with good judgment * learning to use programming languages as opposed to point-and-click interfaces why programming? . it offers infinitely many more possibilities of action . it permits repeated application of sets of operations to new situations (! photoshop) . it gives an understanding of what point-and-click interfaces do (! Splus) * learning to learn new languages yourself by - finding resources: . colleagues with the needed expertise (! study of high-performers) . local online documentation . tutorials on the web . books, manuals - recognizing data structures and their use: . vectors, matrices, arrays, lists, strings,... . their use for linear algebra and geometry . their use for constructing higher-level data structures such as trees, queues, associative arrays - recognizing common syntactic constructs, e.g., arithmetic and logical operations, iteration, conditionals, assignments, functions - understanding semantic notions, such as scoping rules, functional programming, object-orientation,... Style of the class: much doing, little lecturing (? well...) * frequent projects * assigned readings and presentation to the class, seminar style * helping each other is encouraged, but only in the form of discussions; doing projects for others and copying projects of others is strictly prohibited under honor code!!! * there is no excuse "you didn't teach this"; if so, you exercise your resourcefulness and find what you need * no required textbooks Software tools: freeware you need to download and install - cygwin: unix shell for Windows why unix? . powerful set of file handling tools that mimick database operations on the level of ASCII files (sort, uniq, paste, join) . programmable shell language which allows looping over files, for example . has a pipe concept for passing data between applications . comes with an X11 implementation that allows you to open windows (xterms, editors) into unix servers such as ljsavage.wharton.upenn.edu - gnu emacs: a powerful text editor why emacs? . extensible and customizable because it is built on top of the lisp language . understands syntax of many languages such as html, latex, C, R/Splus, python, perl, shell, lisp, java . uses the syntax for intelligent highlighting, and for helping with execution, compilation,... (! R demo) - python: a high-level scripting language why python? (? wavering again...) . a prettier alternative to perl (which is ugly, built according to the model of the equally ugly shell syntax) . "scripting" means working with high-level constructs such as files, directories, URLs . often used (like perl) for cgi-scripts that permit server actions based on web input (e.g., showing a computed plot from a users browser input, such as clicks, dials, sliders, entered text) - R: a high-level statistical language why R? . freeware version of Splus, used a lot in stats depts . a powerful group of academics are committed to the future development of R . used, among other things, for scientific computing statistical data analysis static visualization simulation - ggobi: an interactive data visualization system why ggobi? . the instructor is co-designer of the system... . direct-manipulation of pictures of data . dynamic tools for 3D rotations and higher-D projections . implements linking between displays ("linked brushing") - C: (maybe) low-level compiled language why C? . almost the only game in town, together with its extension, C++ (fortran a distant second, rarely pascal, what else?) . for efficiency of critical pieces of code; ideally used as an R or python function ---------------------------------------------------------------- TEXT EDITOR: EMACS ---------------- Instructions for installing emacs: create a directory for emacs, such as C:\emacs browser: http://ftp.gnu.org/gnu/windows/emacs/latest/ select emacs-21.2-fullbin-i386.tar.gz this file is 17Mb, so may take a while put it in C:\emacs unpacking: if you do not have winzip on your PC yet, get a free evaluation version from http://www.winzip.com/ open up the ...tar.gz file, and let winzip extract the contents takes about 107Mb of disk space create a shortcut: right-click in an explorer the file C:\emacs\emacs-21.2\bin\runemacs.exe and select shortcut; then drag the created shortcut to the desktop you can edit the icon by right-clicking it and selecting "Properties" to change the name: tab "General" and edit the name to change the start-up of emacs: "Shortcut" and edit the target for example, you can get reverse video (black background) by appending " -rv" to "...runemacs.exe": "...runemacs.exe -rv" and if the .emacs file is not found, add it explicitly by adding the flag "-l C:\yourhomedirectory\.emacs" the icon should finally point at something like this: C:\emacs\emacs-21.2\bin\runemacs.exe -l C:\yourhomedirectory\.emacs C:\your homedirectory ---------------- Learning emacs: Start up emacs (using its icon). Top bar of emacs window: click "Help" and select "Emacs Tutorial" Follow its instructions carefully! Principles: * in the beginning you read a file into a "buffer" or you write into a buffer and write it out to a new file * by default you are always in input mode, except for the reserved keystrokes that use the modifiers "ctl" and "alt" A summary of the most frequent strokes: write c-x instead of ctl-x, a-x for alt-x * Cursor movements: depressing causes multiple punch ctl-b (back, <-) ctl-f (forward, ->) ctl-p (previous, up) c-n (next, down) c-a (beginning of line) c-e (end of line) a-< (beginning of buffer) a-> (end of buffer) c-v (next page) a-v (previous page) c-l (reposition the window so the cursor position is in the center) * Moving text around: c- (sets a mark at the present cursor location; (now move around and you will see a highlit area being sweeped from the mark, called the "region") c-w (erase the region and put its text in the so-called "yank buffer") a-w (put the region text in the so-called "yank buffer" w/o erasing) c-y ("yank" the text in the yank buffer into the current position of the cursor; presumable you moved the cursor) * Search, replacement: c-g (abort search, replacement, erroneous key strokes; rings bell) c-s xyz c-s c-s ... (incremental repeated forward search for xyz; if terminated with c-g, jump back; if terminated with a cursor move, stay there) c-r xyz c-r c-r ... (dito, but backward) a-% xyz abc ... (repeated replacement of xyz with abc, one replacement for each ; "n" instead of skips one replacement; "!" does all replacements without repeated query) * Files: c-x c-f filename (read file with name "filename" into a new buffer) c-x c-w filename (write buffer into a file with name "filename") c-x c-s (save buffer to the file from which it was read or to which it was previously written) * dired ("directory editor"): c-x c-f directoryname (This prompts you with the current directory in the "minibuffer". Edit it to the path of the directory you want to visit. Remove the trailing "/" and hit . This gives you a listing of files in the directory;. Then use n to move down the file/directory list p to move up f to read this file into a buffer d to mark for deletion x to expunge files marked for deletion R to rename a file C to copy a file ? to remind yourself of the dired-specific command letters h to get complete and sophisticated help/documentation about dired) * File/directory name completion: c-x c-f Shows the current directory path in the "minibuffer". Permits you to edit the path with emacs commands (c-f, c-b, c-d,...) You can use "path completion" by hitting , which comletes a path as far as there is no other choice. Example: Assume you have two files: c:/you/540/homework1.txt c:/you/540/homework2.txt Edit the path to c:/you/540/h Hitting completes to c:/you/540/homework Add "2" c:/you/540/homework2 Complete with to c:/you/540/homework2.txt Hit and the file is read into a buffer * Moving between buffers: c-x b (switch to previous buffer) c-x c-b (buffer listing, edit like dired: n, p, d, x, f, ?) * Initializations: certain things you want in every session should be put in an initialization file called ".emacs". This file should contain Emacs Lisp code, and it should be in your home directory or else in C:/ For an example file with a few goodies, go to http://www-stat.wharton.upenn.edu/~buja/STAT-540/.emacs * Running a shell in an Emacs buffer: a-x shell It starts up in the directory that contains the previous file or dired. By default this is a Win2K shell; if you want a bash window instead, you have to put something in your .emacs file. See the example file on the class web page. * Swapping c-h and : See also the example file on the class web page. * Multiple windows: c-x 2 (2 windows) c-x 1 (back to 1 window) c-x o (move cursor and input to other window) * Defining a keyboard macro: c-x ( ...[emacs commands]... c-x ) Using the keyboard macro: c-x e Example: copying the current line to the other window c-x ( c-a c-k c-y c-a c-n c-x o c-y c-x o c-x ) (This mimicks command execution support if it does not exist, for example for commands in a shell file and a shell window.) homework: ~buja/STAT-540/hw-emacs.txt ---------------- A keyboard problem: the ctl and alt modifiers Because of the extensive use of the ctl modifier in emacs, many people would like to use the caps lock key as the ctl key. I have done that on my laptop, and I also gave the ctl key the alt function, so my little left finger can easily reach both ctl (on the caps lock key) and alt (on the ctl key). These keyboard remappings, however, will pervade your whole machine, and even ctl-alt-del will be remapped... Just the same, if you have the courage and desire, you will find instructions in ~buja/STAT-540/kbd-remapping-instructions.txt Let me know if something goes wrong. Messing with the so-called registry in MS Windows can wreak havoc if not done carefully. ---------------------------------------------------------------- WEB PAGES: HTML Concepts: ......... HTML: the universal mark-up language/protocol of web pages (uses "mark-ups" or tags to format text and images) Client: computer that downloads web content Server: computer that provides web content Server for your web pages: ljsavage Directory for your web pages: ~/public_html URL: http://www-stat.wharton.upenn.edu/~yourloginnameonljsavage/filename HW: the web publishing mechanism logon to ljsavage create the directory public_html by saying mkdir public_html copy any text file into public_html: cp filename ~/public_html look at it in a browser: http://www-stat.wharton.upenn.edu/~yourloginname/filename edit the file with name "filename" on ljsavage in emacs: emacs ~/public_html/filename click the reload icon in the web browser and check whether your edits show up Online tutorials: * http://www.cwru.edu/help/introHTML/intro.html * http://archive.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer.html The first is a little old but still very good. The second covers a little more material but is basic, too. Recommended book: "HTML for the world wide web" by Elizabeth Castro, 4th edition, in the "Visual Quickstart Guide" series, Peachpit Press. ($20) HTML syntax: * define containers of web content with (usually matching pairs of) tags such as ...., which will show the content "..." in bold * whitespace in webcontent is ignored; i.e., line breaks cannot be forced with Newline; use instead the tag
for Newline or

for new paragraph; if whitespace must be respected, it is called "preformatted text" and enclosed in matching

...
tags (good for handmade tables) * every web page needs - a bracing pair of protocol tags ... and inside two disjoint containers called - a header: invisible in browsers, but used in web searches, bookmarks ... and inside ... - a body: the visible material ... * optionally you can insert invisible comments: * inside the body, structure your content in hierarchies of sections headed by titles with tags

...

for sections

...

for subsections

...

for subsubsections ... * inside sections, structure your text in paragraphs with initial

tags; the closing tags

can be omitted, but they do not hurt * for bullet lists ("unordered lists"), use
  • first bullet
  • second bullet
  • third bullet
for numbered lists ("ordered lists") use
    ...
* for tabular layouts, specify rows with vertically aligned cells:
... ... ...
... ... ...
... ... ...
* include images anywhere in the content with anchors * load other web pages with anchors, for example: Buja\'s web page at Wharton * HW 2 ---------------------------------------------------------------- Instructions for installing cygwin: you need a high-speed line, phone and modem will be awfully slow make a new folder/directory for cygwin, such as C:\cygwin browser: www.cygwin.com click "Install Cygwin now" download "setup.exe" into your new directory point your browser or other folder viewer at C:\cygwin double click on "setup.exe" this will execute the actual installation click yourself through a few "Next"s till you are asked to pick a download site pick, for example ftp://mirrors.rcn.net then you are asked to select packages, use the defaults ("Next") this will take a while (> 1hr) have an icon installed click "Finish" You should now be able to click on the cygwin icon and get a window with a unix shell. The fully installed cygwin may take 971Mb; if this is too much, you may "uninstall" for example the X86 module with the X11 window system. Path problem: "Command not found" When you try to use unix commands in cygwin, the commands may not be found. The reason is that the cygwin unix interpreter does not know where to look for them. You should create an initialization file called .bash_profile in your home directory (cygwin knows how to find that). Edit the following line into the file: export PATH=:.:/cygdrive/c/cygwin/bin:$PATH ---------------- LEARNING UNIX: CONCEPTS AND COMMANDS .................................... Execution of commands: create a cygwin window and type commands such as ls You will have emacs line editing available with the following strokes: c-p, c-n, c-f, c-b, c-a, c-e, c-d, c-k Typically you call the previous command back with c-p and edit it. Under emacs you can also run unix shell windows by saying a-x shell However, certain differences to regular shell windows exist: - The command "more" does not work, but it is not needed because you are in emacs and can page back with a-v. - Getting the previous command back is done with a-p, not c-p, because c-p gets you to the previous line in the buffer. Directory/folder conventions: c:/Documents and Setting/Default User/ (MS Windows uses \; unix uses /) Manual pages for a command: man commandname Examples: man ls man cat Changing directories ("folders" in Windows): cd dirname (e.g., "cd c:", "cd Docu" completion) cd .. (to parent directory) cd (to home directory, probably c:/cygwin/home/yourname) Learning where you are in the directory hierarchy: pwd Listing files ("dir" in Windows): ls (all files in current directory) ls dirname (files in a subdirectory) ls partialfilename* (files with pattern in current dir) ls *partialfilename (dito) ls -l (long listing with size) ls -d (listing of directories only) Making a new directory: mkdir dirname Typing a text file to the shell window: cat filename more filename (1 page at a time, hit "space" to continue, "q" to quit) less filename (same as "more", but starting with the end) Putting text into a file: this is better down with an editor cat > filename (hit "Enter") (enter your text, multilines permitted) ctl-d (EOF signal for "cat") Concatenating two files (this is where "cat" got its name): cat oldfile1 oldfile2 > newfile Removing files: rm filename [WIN2K: shortcuts] ................ a- cycle through windows a- offer a menu of windows for selection c-a select all c-z undo c-o open selected item F2 rename selected item F5 refresh active window c- display Start menu a- display the system menu of the window WIN2K AND CYGWIN ................ In a cygwin bash, you can say cmd and you will enter a W2K command interpreter. Exit by saying exit Alternatively, you can run short commands in W2K and return immediately to bash by saying "cmc /C ...". Here are a few examples of calls to W2K commands: cmd /C 'dir' cmd /C 'date /T' cmd /C 'time /T' cmd /C 'ver' cmd /C 'whoami' cmd /C 'path' There are useful commands such as "fc", "find", "findstr" whose analog we will use later in cygwin. COMPARISON OF UNIX WITH WIN2K ............................. W2K/cmd ~ Unix/bash help ~ man dir ~ ls -l type ~ cat (type does not know stdio: <, >) more ~ more ~ less ~ head ~ tail ren ~ mv del ~ rm ~ pwd copy ~ cp cd /D ~ cd mkdir ~ mkdir rmdir ~ rmdir xcopy ~ cp -R fc ~ diff sort ~ sort findstr ~ grep set ~ set date /T ~ date path ~ echo $PATH cmd ~ c:\cygwin\cygwin.bat (cmd in cygwin/bash, cygwin.bat in cmd) ~ wc date /R ~ date ~ ps File and folder/directory names: examples Win2K: C:\cygwin\etc\passwd C:\cygwin\bin\locate.exe Bash: C:/cygwin/etc/passwd C:/cygwin/bin/locate.exe File extensions: ".exe", ".bat", ".txt", ... FILE NAME EXPANSION: .................... * wildcard symbol for any sequence of characters example: ls *.exe ? wildcard symbol for any single character example: ls ?-*.exe [a-l] matches any lower case letter [0-3] matches 0, 1, 2, 3 [abc] matches any of the letters a, b, c Examples: ls ?? ls [a-c]* STANDARD I/O, PIPES, REDIRECTION: ................................. * standard input <, 0< file * standard output >, 1> file * standard error 2> file * redirection: 2>&1 error to standard output * pipe: cat xyz | sort This is somewhat similar in Unix and Win2K. Examples: under cygwin/unix cat > xyz enter some text can be multilines ... ^d cat xyz cat < xyz ls >> xyz sort xyz > xyz.sort more xyz.sort ls -l | sort -k 5 (same as: ls -l -S -r) ls -l | more set | more ">" an be used as a command to create an empty file or blank it out rm gaga ls -la gaga >gaga ls -ls gaga PIPES: ...... Many commands work both * with file arguments and * with standard I/O example: cat file ~ cat < file If a command works with standard I/O, it can be used in a pipe example: cat * | more I.e., standard output of the first command is standard input for the second. ENVIRONMENT VARIABLES: ...................... Used to define parameters of the operating system. Most importantly: "PATH", i.e., where to look for commands/executables. But variables have a role in your own shell programming as well. Bash/Cygwin/Unix: "dereference" (=substitute the value) with "$variable" echo $variable (query) variable=string (local definition for this shell) export variable=string (global definition) Win2K: dereference with %variable%. create/edit with point-and-click: right-click on icon "My Computer" -> Properties -> Advanced -> Environment Variables or set in "cmd": set variable=string (definition) set variable (query) echo %variable% (query) The PATH Variable: where to search for commands. - Win2K: semicolon-separated list of folders query: path append: path %PATH%;C:\garbagefolder; alternatively use the point-and-click interface (see above) - Bash/Cygwin/Unix: colon-separated list of directories query: echo $PATH append: export PATH=$PATH:C/garbagefolder: Examples: ME=Andreas Buja Substitution ("dereferencing"): Use $VAR or ${VAR} ({} for string concatenations) echo ME echo $ME echo ${ME} echo $MEa echo ${ME}a echo $ME"a" foo=gaga ls > foo more foo echo $foo more gaga Some predefined variables: echo $HOME echo $PATH echo $PS1 echo $PS2 echo $IFS Listing all variables: set set | more Undefine a variable: unset GAGA Undefined variables are empty: echo $GAGA Prepending to your path: (why "pre"?) export PATH=:/cygdrive/e/buja/stat-540:$PATH echo $PATH Problem with appending to variables: the first does not work, the second does. echo $PS1a echo ${PS1}a EVALUATION: ........... In order to put the std output of a command into a variable, one needs a syntactic construct: `....` (a matched pair of back quotes) Examples: FOO=`date` echo $FOO FOO=${FOO}" "`date` echo $FOO echo $FOO | wc Pick the time of day out of date: set `date`; echo $4 SHELL SCRIPTS: .............. * A simple script: basically "ls" but with a report of how many files there are. Edit the following text into a file with name "lw": tmp=`ls $*` echo $tmp echo "number of files: "`echo $tmp | wc` Then try lw Not the real thing, but ok for now; should really extract the second field. Make executable (not needed under cygwin because the protections are missing): chmod +x lg SHELL SCRIPT ARGUMENT HANDLING: ............................... - arguments: $1, $2,... - list of all args: $* "${@}" (=$1 $2 $3 ...) - number of args: $# - shell script file: $0 - exit status of last command: $? (0=normal; 1=failed) Put this text into a file "ll": ls -l $* Or: ls -l "$@" The last creates the arguments $n individually quoted, i.e., unevaluated. ll ../* The "shift" command drops $1 and renumbers: set a b c echo $* shift echo $* At the top level, $* can also be set by "set": set a b c echo $* echo $2 This puts the list of names of files in the current directory in $*: set - * echo $* MULTI-LINE INPUT W/O ^D : What if you want to input multi-line text to a command inside a script? You cannot use ^D to terminate. Instead, the following is permitted: cat > foo < foo <<\anyword a;lksjdf `date` a;slk;laj anyword more foo THE test COMMAND: SETTING EXIT STATUS ACCORDING TO PREDICATES test string test -f file test -d dir See "man test" for all possible predicates: string, file, numeric tests, as well as connectives "not", "and", "either or" Examples: >gaga test -f gaga; echo $? [ -f gaga ]; echo $? rm gaga [ -f gaga ]; echo $? Intended use: in "if", "while", "until" constructs if [ -f gaga ]; then echo gaga; fi while test " "; do sleep 2; date; done Stop with ^C. >gaga; while [ -f gaga ]; do sleep 2; ls -l gaga; done >gaga; while [ -f gaga ]; do sleep 2; set " "`ls -l gaga`; echo $5 ; done Then do in another shell: cat >> gaga ... ... ^D rm gaga Every line of input increases the length of the file "gaga". THE if CONDITIONAL: Put this in a file "foo": if test -f $*; then ls -l $*; fi Better presentation: if test -f $* then ls -l $* fi Then try foo gaga THE for LOOP: Example: create or empty files for all filenames a0, a1, a2, a3, b0, b1, b2, b3: for i in a b; do for j in 0 1 2 3; do >$i$j; done done for i in a b; do for j in 0 1 2 3; do tmp=$i$j; >$tmp; ls -l $tmp; done done Better formatting is easier to read: for i in a b do for j in 0 1 2 3 do tmp=$i$j >$tmp ls -l $tmp done done ls -l [ab][0-3] Rename a set of files: for i in `ls` do tmp=`echo $i | sed 's/i/j/'` mv $i $tmp done Complains about mostly identical file names Move only nonidenticals: for i in `ls *i*` do tmp=`echo $i | sed 's/i/j/'` if test ! $i = $tmp then mv $i $tmp echo "renamed \"$i\" to \"$tmp\"" fi done -------------------------------- PURPOSE OF SHELL: - File manipulation - Environment management - Job control We will focus on the first two. (Btw: Environment = shell variables that rule the behavior of the shell and of applications that have their own, such as latex.) TOOLS: - Basic data types: . characters . words (contiguous characters w/o whitespace) . strings (= whitespace-separated words, e.g., "word1, word2, word3" . files . integers . arrays - Shell variables - Unix commands CONSTRUCTS: - Commands with flags and arguments: ls -l a* Commands operate on strings, file content, file names, variables, and they generate a "return code" depending on success or failure. - I/O: . /dev/stdin (file descriptor 0; deflt.: keyboard) . /dev/stdout (file descriptor 1; deflt.: terminal) . /dev/stderr (file descriptor 2; deflt.: terminal) . /dev/null (null device, sink) . redirection: files to stdin, stdout to file cat < a* > allfiles (for "cat", this is the same as:) cat a* > allfiles ls y* ls y* 2>/dev/null . pipes: ls -l a* | more - Variable evaluation: $var ${var} - Command evaluation: `...` or $(...) echo `ls | wc` $(ls | wc) - Arithmetic evaluation: echo $((2+3)) MOVING CONTENT: . strings . integers . file contents (stdin, stdout) . filenames (filename expansion) . variables - string -> stdout: echo echo "this string" echo -e "this string \n ... and this string" - stdin -> string: `...` (cmd. subst.) $(...) echo `cat l*` echo $(cat l*) - string -> variable: = (assgnmt.) var="this string" - strings -> $*, $n, $# (positional variables): set, unset, shift set `ls | head -3`; echo "$*| $2| $#" - variable -> string: $... (var. subst.) ${...} echo $var echo ${var} - keyboard -> variables: read echo -n "Enter something: "; read; echo $REPLY read v1 v2 vrest; echo "$v1; $v2; $vrest" - filenames -> strings: filename expansion of *, ?, [...], [!...] echo a* echo ?? echo ?[!0-9]* - files to files: sort, uniq, join, cut, split, csplit EXPANSIONS: * Brace Expansion echo {aa,bb,cc}{1,2,3} * Variable Expansion echo $HOME * Command Substitution echo `ls` echo $(ls) * Arithmetic Expansion echo $((2+3)) * Word Splitting: on blanks, tabs, newlines (content of $IFS) * Filename Expansion/ Pattern Matching: Any word containing *, ?, [...] is a pattern. It will be expanded into a list of filenames matching the pattern. ls *a * Execution follows all the expansions. $PATH is searched for commands. QUOTES: * single characters: \ preserves literal character echo "$HOME" echo "\$HOME" * strings: '...' preserves literal string echo '$HOME' * strings: "..." allows partial evaluation echo "$HOME ---- `date`" GROUPING, LOOPING, CONDITIONALS: * Grouping: OR: cmd1 || cmd2 (cmd1; if it fails: cmd2) wc a* 2>/dev/null || echo "no files starting with \"a\"" wc z* 2>/dev/null || echo "no files starting with \"z\"" AND: cmd1 && cmd2 (cmd1; if it does not fail: cmd2) wc a* 2>/dev/null && echo "no files starting with \"a\"" wc z* 2>/dev/null && echo "no files starting with \"z\"" sequential: cmd1; cmd2 ls -l a*; wc a* asynchronous: cmd1 & cmd2 ls -l a* & wc a* grouping for joint redirection: (ls -l a*; wc a*) > foo * Looping: "for" loop for lists: for var in ....; do cmd1; cmd2; ...; done Or more readable, w/o semicolons, but Newlines instead: for var in ... do cmd1 cmd2 ... done Examples: for file in *; do echo "------------- file: $file--------"; cat $file; done for file in [ab]*; do echo $file; done for i in 1 2 3; do echo $i; done for i in "1 2 3"; do echo $i; done for i in {0,1,2,3}{0,1,2,3,4,5,6,7,8,9}; do echo $i; done * Looping: "for" loop for integers, C-like syntax for((...)); do cmd1; cmd2; ...; done Examples: for((i=0; i<10; i++)); do echo $i; done for((i=1; i<1000; i=2*i)); do echo $i; done * Looping: "while" and "until" while ...; do cmd1; cmd2; ...; done until ...; do cmd1; cmd2; ...; done Example: while [ -f gaga ] ; do sleep 5; done For condition syntax, see below. * Conditionals: "if" if [ ... ]; then cmd1; cmd2; ...; fi if test ... ; then cmd1; cmd2; ...; fi Example: if test -f a* ; then ls -l a*; fi For condition syntax, see below. * Condtionals: "case" case ... in ... pattern1) cmds... ;; pattern2) cmds... ;; pattern3) cmds... ;; ... esac Example: put the following in a script file. echo -n "Enter the name of an animal: " read ANIMAL echo -n "The $ANIMAL has " case $ANIMAL in horse | dog | cat) echo -n "four";; man | kangaroo ) echo -n "two";; *) echo -n "an unknown number of";; esac echo " legs." *** Practice: Interpret the "man"ual command in the intro by Bourne, sec. 2.10. Check out c:/cygwin/usr/man along with it. CONDITIONAL EXPRESSIONS Two types of syntax: test ... [[ ... ]] Their only purpose is for use in "if", "until", "while" constructs. Examples: * Does file exist? >file if [[ -f file ]]; then echo "yes"; else echo "no"; fi rm file if [[ -f file ]]; then echo "yes"; else echo "no"; fi This is a gentle "cat": if [ -f gaga ]; then cat gaga; fi This here waits for file "gaga" to be removed (in another bash): while [ -f gaga ]; do sleep 1; echo "waiting..."; done; echo "gone....." * Does directory exist? rm -R dir if [[ -d dir ]]; then echo "yes"; else echo "no"; fi >dir if [[ -d dir ]]; then echo "yes"; else echo "no"; fi rm dir; mkdir dir if [[ -d dir ]]; then echo "yes"; else echo "no"; fi rmdir dir * Does symlink exist? (Make one, remove one.) if [[ -L sym ]]; then echo "yes"; else echo "no"; fi ln -s .. sym if [[ -L sym ]]; then echo "yes"; else echo "no"; fi rm sym * Is string of non-zero length? if [[ string ]]; then echo "yes"; else echo "no"; fi if [[ "" ]]; then echo "yes"; else echo "no"; fi if [[ $variablewithoutvalue ]]; then echo "yes"; else echo "no"; f] * Are two strings identical? (Note separation of == and != by surrounding blanks!!) if [[ string1 == string2 ]]; then echo "yes"; else echo "no"; fi if [[ string == string ]]; then echo "yes"; else echo "no"; fi if [[ string == strings ]]; then echo "yes"; else echo "no"; fi if [[ string == string* ]]; then echo "yes"; else echo "no"; fi if [[ string == ?????? ]]; then echo "yes"; else echo "no"; fi if [[ string == ??? ]]; then echo "yes"; else echo "no"; fi if [[ string == strin[fgh] ]]; then echo "yes"; else echo "no"; fi if [[ string == strin[!h] ]]; then echo "yes"; else echo "no"; fi if [[ string != ??? ]]; then echo "yes"; else echo "no"; fi if [[ string != strings ]]; then echo "yes"; else echo "no"; fi Above, the right hand side can be a pattern using *,?,[...],[!...]. * Is one string lexicographically before another string? if [[ string1 < string2 ]]; then echo "yes"; else echo "no"; fi if [[ stringb < stringa ]]; then echo "yes"; else echo "no"; fi if [[ stringa < stringaa ]]; then echo "yes"; else echo "no"; fi * Comparing integers: -eq, -ge, -le, -gt, -lt, -ne i=6; if [[ i -eq $((3*2)) ]]; then echo "yes"; else echo "no"; fi i=6; j=12; if [[ $((i*j)) -eq $((2*i**2)) ]]; then echo "yes"; else echo "no"; fi * Composite conditionals: && ("and"), || ("either or"), ! ("not"), (...) (overriding precedence) i=1; if [[ i -le 6 && i -ge 2 ]]; then echo "yes"; else echo "no"; fi str=abc; if [[ str == abc || str == ab || str == a ]]; then echo "yes"; else echo "no"; fi ARITHMETIC EXPRESSIONS: within $((...)), as in: i=2; j=$((i**2)); echo $((10*i+j)) C-like syntax: id++ id-- variable post-increment and post-decrement ++id --id variable pre-increment and pre-decrement - + unary minus and plus ! ~ logical and bitwise negation ** exponentiation * / % multiplication, division, remainder + - addition, subtraction << >> left and right bitwise shifts <= >= < > comparison == != equality and inequality & bitwise AND ^ bitwise exclusive OR | bitwise OR && logical AND || logical OR expr?expr:expr conditional evaluation = *= /= %= += -= <<= >>= &= ^= |= assignments expr1 , expr2 comma ARRAYS: * bash arrays are zero-based, as in C. Indices have to be integers >=0. * Arrays come into existence by assignments such as arr[2]=zzzzzzzzz arr[4]=abcdef arr[0]="zeroth element" * Print an array element: echo ${arr[2]} {} are necessary to avoid shell variable substitution. Any reference w/o index points to the zeroth element. echo ${arr} * Print the whole array: echo ${arr[*]} echo ${arr[@]} * Length of the whole array: echo ${#arr[*]} Warning !!!! This is the number of non-empty elements, NOT the highest index plus 1. The above two assignments to indices 0, 2 and 4 created an array of length 3. Therefore, looping from 0 to ${#arr[*]} generally wrong. * Looping over an array: for item in ${arr[@]}; do echo $item; done Ooops! The zeroth element has an embedded blank, so when looping over the array it seems that the array elements are concatenated and the loop is over the whitespace-separated words. To get it right, quote the array: for item in "${arr[@]}"; do echo $item; done (Go figure...) * Blanking out an array: unset arr ============== Processing of Text Data ============== -------------- egrep: PATTERN SEARCH -------------- * egrep: Find lines with regular expression pattern. Reads from named files and stdin, writes to stdout. egrep 'pattern' files ... | egrep 'pattern' * Patterns are specified in terms of "regular expressions" (see above). * egrep is the extended regular expression version of the simpler grep. fgrep is a simpler and faster version without regular expressions. * egrep has a few flags that add power: -A 3 Print 3 lines of trailing context after matching lines. Application: address book -- grep a name and print the following lines that contain the address of the name. -B 3 Similarly, print 3 lines of preceding context. -a Treat a binary input file as if it were text. -c Do not print matching lines, instead print the number of matches for each input file. -f pfile Read patterns from a patternfile "pfile" which must contain one pattern per line. An empty file matches nothing. -F The patterns should be interpreted as simple string, not regular expressions. -h Suppress prefixing filenames when multiple files are input. -i Ignore case distinctions in input files and patterns. -m 5 Stop reading a file after 5 matching lines. -mmap Read input more efficiently (with memory mapping as opposed to read I/O). -n Prefix each line with a line number within its input file. -o Show only the part of a matching line that matches the pattern. -r Recursively descend directories to find input files. -v Invert: print non-matching lines. -P Patterns is a Perl regular expression. * Examples: uses of the flags egrep -n four less.txt Lines containing "four", preceded by line number. egrep -n -o four less.txt Dito but show only the matched word "four". egrep -v -i '(\|\)' less.txt | head -20 Lines without The, the, A, a (as words), ignoring case. -------------- REGULAR EXPRESSIONS -------------- * Patterns of varying complexity are used by programs such as bash, grep, sed. * See class page for resources on regular expressions. * See also "man grep" for the regular expressions specific for grep. Here is a summary: Range: . match any SINGLE character, [abc] match any of the listed characters [5-9] match any of the range 5 ... 9 [^0-9] match any non-numeric character Position: ^x match a pattern x at the beginning of a line x$ match a pattern x at the end of a line \ match x at end of words Repetition: note that is also a character that is counted. x? match x one or zero times x* match x zero or more times x+ match x one or more times x{n} match x repeated n times x{n,} match x repeated n or more times x{n,m} match x repeated >=n and <=m times Grouping: (x) match x; useful Concatenation: match any string that is the concatenation of matches. xyz... match if x is matched, followed by a match of y, followed ... Alternation: regular expressions joined by "|"; x|y|z... match any string with at least one match among x, y, z, ... Quotation: ^ ? + { | ( have special meaning. To use them in their literal meaning, they must be quoted with \ or listed with []: \? \+ \{ \| \( or easier to read: [?] [+] [{] [|] [(] * Examples of regexps: ls e:/emacs/emacs-21.2/src/a*.c egrep '/\*.*\*/' e:/emacs/emacs-21.2/src/a*.c Find comments a several C programs. egrep '\|\|\|\|\|\|\' less.txt Find a list of words (numbers). ls -l .. | egrep ^d Find directories. ls -l | grep '^...x' Find files that are executable by owner. ps -ef | grep 'bash' Find your bash jobs. egrep '.*' foo.grep .* anything, including nothing egrep '^.$' foo.grep Finds nothing under DOS/Windows ............ egrep '^..$' foo.grep ^.$ Finds blank lines!! Why not single char lines? * IMPORTANT: line terminations =, = DOS/Windows: Unix/Linux: Mac: egrep counts as a character hence: ^.$ matches a blank line under DOS/Windows ! ^$ matches a blank line under Unix/Linux ! CONSEQUENCE: Use always ".$" as line terminator, not "$" alone !!! [I have not found a flag or shell variable that would correct this problem. Tell me if you do!] EXAMPLE: foo.grep contains a line with content "abc" egrep '^abc$' foo.grep finds nothing egrep '^abc.$' foo.grep finds the line with "abc" .... egrep '^.?.?.?$' foo.grep .?.?.? zero to two characters (DOS/Windows) egrep '^.{1,3}$' foo.grep .{0,3} dito * Further Examples: egrep '\' my.dic matches bag, beg, big, bog, bug egrep '[Cc]ompan(y|ies)' foo.grep egrep '(C|c)ompan(y|ies)' foo.grep same thing: capical/lower case, singular/plural of company egrep '"*word"*' foo.grep word with or without quotes egrep '^[A-Z]+' less.txt egrep '^[[:upper:]]+' less.txt one or more upper case letters at beginning of line egrep '\' less.txt words like the, blithe, breathe, * Word games: What is this? egrep '^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$' \ "C:/Program Files/WinEdt/Dict/English/eng_com.dic" egrep '^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z$' \ "C:/Program Files/WinEdt/Dict/English/eng_com.dic" | egrep '...+' * Neat problem: How many sentences are there in a book? Start with "." as proxy. egrep '\.' less.txt | head -100 Visual clutter. egrep -o '\.' less.txt | head -100 Not informative. egrep -o '.........\.' less.txt This IS informative. egrep -o '\.' less.txt | wc There are 3047 "."s altogether. But this includes other uses such as "e.g.". Exclude them by requiring a preceding word with more than two letters. Also permit "!" and "?" as sentence endings: egrep -o '\<[0-z]{2,}\>[.!?]' less.txt Now what were the non-sentence ending cases of "."? egrep '\<.\>\.' less.txt Problem: "don't" triggers "t" as a single word, dito "'s" in "year's". Fix it by permitting "'" two positions before [.!?]: egrep -o '\<[0-z]*(\047|[0-z])[0-z]\>[.!?]' less.txt egrep -o '\<[0-z]*(\047|[0-z])[0-z]\>[.!?]' less.txt | wc Octal \047 is the value for a single quote \'. This is one way to get a single quote quoted. IMPORTANT 1): According to the bash reference manual (3.5.8.1), the order of characters is locale-dependent (it could be aAbB...). To achieve the "traditional" locale-independent order, you must add the following line to your .bash_profile: export LC_ALL=C See the file "ascii-list" on the class website for the traditional order. IMPORTANT 2): It would be more cautious to use [:alnum:] instead of [0-z], again because of the locale-dependence of ASCII order. Here are the most useful POSIX range definitions: [:alnum:], [:alpha:], [:digit:], [:uppper:], [:lower:], [:punct:], [:blank:], [:space:] Cumbersomely, these ranges have to be placed inside brackets again. Here is the above sentence finder with POSIX range: egrep -o '[[:alnum:]]*(\047|[[:alnum:]])[[:alnum:]][.!?]' less.txt - Find comments in a C program: egrep '\/\*' e:/emacs/emacs-21.2/src/alloc.c Beginning of comments. egrep '\*/' e:/emacs/emacs-21.2/src/alloc.c End of comments. egrep '(/\*)|(\*/)' e:/emacs/emacs-21.2/src/alloc.c Either. egrep '/\*.*\*/' e:/emacs/emacs-21.2/src/alloc.c Both on the same line. -------------- sed: STREAM EDITOR -------------- * Invocations: . always writes to stdout . can read from stdin or from a named infile . can work off a script in a quoted string ('...', with -e option if multiple quoted scripts) or off a script in a scriptfile (-f option) That is, the possible forms of invocation are: sed 'script' infile > outfile ...| sed 'script' > outfile sed -f scriptfile infile > outfile ...| sed -f scriptfile > outfile * Substitution: the most important application of sed. sed 's/.../.../g' (g="global", i.e., in whole line, optional) Examples: sed 's/[aeoui]/e/g; s/[AEOUI]/E/g' less.txt | head -100 Replace all vowels with e, E. sed '12s/[aeoui]/e/g; s/[AEOUI]/E/g' less.txt | head -20 Same in line 12. sed '5,15s/[aeoui]/e/g; s/[AEOUI]/E/g' less.txt | head -20 Same from line 5 to line 15. sed '/\./s/[aeoui]/e/g; s/[AEOUI]/E/g' less.txt | head -20 Same in lines with a period (needs quoting). sed -n '/\./s/[aeoui]/e/gp; s/[AEOUI]/E/gp' less.txt | head -20 Same but print only when replacement happened. Ooops, this doubles lines! Solution: cat less.txt | grep '\.' | sed 's/[aeoui]/e/g; s/[AEOUI]/E/g' | head -20 grep for "." first. sed -n 's/.*four.*/four/gp' less.txt Get "four" out for counting occurences. egrep -o 'four' less.txt Dito but simpler. sed -n 's/.*four.*/four/gp' less.txt | wc Do the counting. * Printing subsets of lines: "p" (print) and "q" (quit after...). For "p" we need to suppress default printing with -n flag; not so for "q". sed -n '11p' less.txt Print line 11. sed -n '15,19p' less.txt Print lines 15 through 19. "-n" suppresses default line printing. sed -n '1,11p' less.txt Print up to line 11. sed -n '10,$p' less.txt | head -20 Print from line 10 on. "$" means end of file. sed -n '/and/p' less.txt > foo1 Print all occurrences of "and": Same as: grep 'and' less.txt > foo2 Check with "diff" (file differences): diff foo1 foo2 sed '/and/q' less.txt Print and quit ("q") after the first "and". sed '/\./q' less.txt sed '/[.]/q' less.txt Both print up to the end of the first sentence. * Deletion: sed '/^[ ]*$/d' less.txt | head -20 Remove blank lines. sed '/^[*]$/d' less.txt | head -20 Remove lines starting with a "*". * Line numbers for pattern matches: sed -n '/good/=' less.txt sed -n '/good/= ; /good/p' less.txt | head -------------- AN APPLICATION OF EDITING FILENAMES -------------- ############### Think about the following application! ####### Gingerly work your way up to copying/moving files wholesale!! Make copies of all files with name "z.filename": ls for file in *; do echo $file; cp $file "z."$file ; done ls Now we want to change the names "z.filename" to "a.filename", but first we check that we get it right: feed the filename in "$file" to sed for substitution and print. for file in z.*; do echo $file; echo `echo $file | sed 's/^z/a/'` ; done We should use nicer presentation, and maybe use some temp variables. This is more civilized: for old in z.* do new=`echo $old | sed 's/^z/a/'` echo "Will move: "$old" -> "$new done Now we dare change names wholesale: for old in z.*; do echo $old; mv $old `echo $old | sed 's/^z/a/'` ; done Or again more readable: for old in z.* do new=`echo $old | sed 's/^z/a/'` mv $old $new echo "Moved: "$old" -> "$new done ################################################################ -------------- tr: CHARACTER TRANSLATION -------------- * This is a simple filter, stdin -> stdout; no named arguments allowed. * Good for letter-to-letter translation, squeezing repeats, deleting letters. * Translate list to list: tr [...] [...] tr [aiouAIOU] [eeeeEEEE] < less.txt | head -20 Our eE substitution exercise. tr [a-z] [A-Z] < less.txt | head -20 Everything upper case. tr [:lower:] [:upper:] < less.txt | head -20 Dito, but internationally portable, locale-independent. See "man tr" and "info tr" for many other range definitions. * Squeezing: tr -s '...' or tr -s '...' 'x' tr -s '\n' < less.txt | head -50 Squeeze blank lines. tr -s ' ' < less.txt | head -50 Squeeze repeats of blanks and tabs; '...' contains both, invisibly. tr -s [:blank] < less.txt | head -50 Dito, internationally non-portable. tr -s '[:space:][:punct:]' '\n' < less.txt | head -100 Squeeze space and punctuations to s. * Deletions: tr -d '...' tr -d [.] < less.txt | head -20 tr -d \'\`- < less.txt | head -20 tr -d [:punct:] < less.txt | head -20 Get rid of punctuations. tr -d [:blank:] < less.txt | head -20 Get rid of horizontal space. Not very successful. tr -d [:space:] < less.txt | tail -c 200 Gets rid of ALL space, vertical also. Not very successful. -------------- sort: TEXT AND NUMERIC SORT -------------- * Sorts by whole line, priority left to right, byte by byte * Priority: whitespace < numeric < capitals < lower case (Observes the traditional order for sure if "export LC_ALL=C" is in .bash_profile.) * Flags: -r reverse sort -k6 sort by key starting with column 6 -u unique (remove duplicates) -c check whether the input is sorted (but do not sort) -n numeric (interpret sort key as numeric, if numbers are not lined up by positions) -t SEP use SEP as key separator (default: transition from nonwhite to whitespace) -b ignore leading blanks -f ignore case -i ignore non-printing characters * Sorting numeric matrices: This sort of matrix should really reformatted (wait for gawk). sort foo.sort Sorts lines lexicographically. sort -n foo.sort Sorts first column numerically, which is what we want. But notice the inversion of " 4" and " 3.4" in column 2: lexicographic; blank or nothing precedes anything alphanumeric. sort -k1 -n -k2 -n -k3 -n foo.sort Finally the desired order: need to list each key/column with -n. sort -k3 -n -k2 -n -k1 -n foo.sort Sorts in reverse order of columns: 3rd first, 2nd second, 1st last. Not clear what happens when there are fewer than 3 entries in a line. * Sorting the words of a book: cat less.txt | tr -s '[:punct:][:space:]' '\n' | tr [:upper:] [:lower:] > less.words Create a list of all words, one per line. sort -r less.words > less.sort Sort words of the book. Striking speed! It would be interesting to sort by word length first, but for this we need gawk... tail less.sort Need to remove digits and a blank line. mv less.sort foo cat foo | tr -d '[:digit:]' | tr -s '\n' | tail Check: First we remove digits, then blank lines they left behind. cat foo | tr -d '[:digit:]' | tr -s '\n' > less.sort tail less.sort Did it. * Make your own American English dictionary: cd "C:/Program Files/WinEdt/Dict/English/" pwd sort eng_com.dic center.dic color.dic ize.dic labeled.dic yze.dic > e:/buja/stat-540/my.dic Change the target path to suit you. cd e:/buja/stat-540 Dito. head -50 my.dic There are headers marked by "%". Weed them out: mv my.dic foo sed '/%/d' foo > my.dic wc my.dic This dictionary has 155415 words. English is a "large" language! Look at my.dic in emacs. Now it would be nice if my.dic could be sorted first according to word length. Not possible with current tools yet... Need gawk. -------------- uniq: UNIQUE OCCURENCES, ADJACENT DUPLICATES, COUNTING -------------- * Call: uniq -flags file without flags redundant with "-u" in sort, but useful in addition are: -d show duplicates -c counts * Assumes the file is sorted to actually get uniques. Otherwise it finds only adjacent duplicates. * Examples: cat foo.uniq Look at the file. uniq foo.uniq Removed adjacent duplicate rows. * Tallying the words in the book: uniq less.sort > less.uniq The list of unique sorted words in the book. Not very usefull because "sort -u" can do that already. head less.uniq tail -50 less.uniq uniq -c less.sort > less.counts For each word we get a count of occurrences. Now that is useful! sort less.counts > less.sortedcounts Sort by counts. tail -100 less.sortedcounts The more involved problem of counting words for each sentence is part of homework 4. ################################################################################ ################################################################################ # # -------------- awk: A PATTERN MATCHING LANGUAGE WITH C-LIKE SYNTAX -------------- # * Some folks consider awk obsolete in light of Perl and Python, but: - awk is still unique in its ability to provide concise one-line filters on the shell level. - The awk syntax is pleasing: similar to C, but without declarations, in contrast to Perl with ugly shell-like syntax and Python with non-free-format syntax. - awk is comes with every Unix/Linux system - The Gnu-version of awk, "gawk", has some modern features added. - There exists a version of awk, called awka, which translates to C and compiles. This is potentially very fast. - Why awk in addition to sed? sed is faster, but awk is lots more expressive. * awk is a pattern matching language that can be used for general purpose computing. Simplest examples: gawk '/abc/' foo.grep gawk '/abc$/' foo.grep gawk '/abc.$/' foo.grep This recreates egrep, with the familiar problem of line termination in DOS/Windows. The regular expressions are the same as in egrep. * The default action of awk is to print a line if it is a match. But more general actions can be added in a "{...}" clause: gawk '/abc/ {print ++i " " $0;}' foo.grep Here a variable "i" initialized to zero came into existence. The variable "$0" denotes the current line. The three arguments ("++i", the spaces, and the line) are concatenated and printed with "\n". * A number of default variables are provided: $0 the current line NF the number of fields $1, $2, $3, ..., $NF the white-space-separated fields contained in the current line NR the line (record) number FS field separator string (default: whitespace) RS record separator string (default: "\n") OFS output field separator (!= FS) ORS output record separator (!= RS) All can be set/redefined. In particular, FS and RS can be regular expressions!!!!!! Example: FS=":", RS="[\r\n]+" This feature is very important because it allows us to redefine the record structure of a file. Variables can also be set in the gawk call. Compare: gawk '{gsub("\r","___<>"); print $NF;}' < foo.awk gawk --assign RS="[\r\n]+" '{gsub("\r","___<>"); print $NF;}' < foo.awk or: gawk '/^[[:digit:]]$/' foo.awk gawk --assign RS="[\r\n]+" '/^[[:digit:]]$/' foo.awk This solves the carriage return problem on DOS-Windows machines!!! The carriage return character "\r" is now recognized as part of the record separator. * Patterns can be missing: gawk '{print NR ": " $0;}' foo.awk * Instead of clauses one can use conditions: gawk '$1 ~ /[0-9]/ {print NR ": " $0;}' foo.awk gawk 'NR >= 2 && $0 ~ /[[:alpha:]]/ {print NR ": " $0;}' foo.awk gawk '{ if( NR >= 2 && $0 ~ /[[:alpha:]]/ ) { print NR ": " $0;} }' foo.awk * There can be BEGIN and END actions for initializations and terminations: gawk 'BEGIN{RS="a";} {print "<<<" $0 ">>>\n";}' foo.awk gawk '{print;} END{print "---------\nNumber of lines = " NR;}' foo.awk In fact, either can be a whole program. For example: gawk 'BEGIN{ fact=1; for(i=1;i<=10;i++) {fact=fact*i; print "2^" i "=" 2^i " " i "!=" fact; }}' This example does not even require an input file because input is opened only after the BEGIN clause is executed. * Arbitrary numbers of clauses/actions can be listed in a single awk program and spread over multiple lines: cat foo.awk | gawk '/^1/ { print $1; } /j.$/ { print; }' (Recall the CR-LF problem in Windows that requires ".$" for end of line.) * Printing: three forms of printing, the last is formatted printing, C-like. gawk '/l.*k/ {print ++i " " NR " " $0;}' foo.awk gawk '/l.*k/ {print("Item:", ++i, ": Line: ", NR, " ", $0);}' foo.awk gawk '/l.*k/ {printf("Item: %2d: Line: %2d %s\n", ++i, NR, $0);}' foo.awk * Short course in format statements: Take the last gawk statement above: the first string is the "format string", containing clauses of the form "%2d" or "%s". These are "formats" which get filled in by the following arguments to the print function. Rule: there must be as many arguments following a format string as there are "%"-formats. Meaning of the formats: %s a string %5s a string of length 5 %d integer %3d integer of length 3 %f float %8.3f float with 8 digits altogether, 3 after the period %e float in exponential format Examples: gawk 'BEGIN{ printf("1st: %d 2nd: %f 3rd: %10.2f 4th: %10.3E\n", 123, 1.23E+3, -12.34567, 3000.); }' gawk 'BEGIN{ printf("1st item: %d 2nd item: %f\n", 123, 1.23E+3); }' * One can also put the code in an awk script file (w/o quotes of course): echo '/^1/ { print $1; }' > scr.awk echo '/j.$/ { print; }' >> scr.awk cat scr.awk # check content awk -f scr.awk foo.awk # use scr.awk as a script and apply to foo.awk * Or use an awk script as a shell command: echo "awk '" > sh.awk echo '/^1/ { print $1; }' >> sh.awk echo '/j.$/ { print; }' >> sh.awk echo "' \"\$@\"" >> sh.awk cat sh.awk # check content sh.awk foo.awk # apply sh.awk to foo.awk * Useful one-liners from Aho-Kernighan-Weinberger (the authors of awk): gawk 'END {print NR;}' foo.awk gawk 'NR==2' foo.awk gawk '{print $NF;}' foo.awk gawk '{field=$NF} END{print field;}' foo.awk gawk 'NF > 2' foo.awk gawk '$1 ~ /^[0-9]+.$/ ' foo.awk gawk '/[0-9]/ {n=n+1} END{print n}' foo.awk * There are built-in functions such as "gsub" for global substitution in a string (default: $0) and "length" for string length: gawk '{ print ; gsub("^al","zz",$1); print " -->> "$0;}' foo.awk gawk 'length($0) > 10' foo.awk "gsub" can substitute fixed strings for regular expressions, as in this example: "zz" is substituted only if "al" is at the beginning of the line. * Further examples: - Convert time of day to seconds of the day: echo '9:20:30pm' | gawk 'BEGIN{ FS=":"; } \ /am/ { gsub("am","",$3); print $1*3600 + $2*60 + $3; } \ /pm/ { gsub("pm","",$3); print ($1+12)*3600 + $2*60 + $3; } ' - Convert date to day of the year: put the following code in a file "day-of-year": gawk 'BEGIN{ jan=0; feb=jan+31; mar=feb+28; apr=mar+31; may=apr+30; jun=may+31; jul=jun+30; aug=jul+31; sep=aug+31; oct=sep+30; nov=oct+31; dec=nov+30; } { gsub(",",""); } { if($3%4==0) { leap=1; } else { leap=0; } } { if(length($1)>=3) { mon=$1; day=$2; } else { mon=$2; day=$1; } } { printf("Year="$3" Month="mon" Day="day" Day of Year="); } /Jan|jan/ { print jan+day; } /Feb|feb/ { print feb+day; } /Mar|mar/ { print mar+day+leap; } /Apr|apr/ { print apr+day+leap; } /May|may/ { print may+day+leap; } /Jun|jun/ { print jun+day+leap; } /Jul|jul/ { print jul+day+leap; } /Aug|aug/ { print aug+day+leap; } /Sep|sep/ { print sep+day+leap; } /Oct|oct/ { print oct+day+leap; } /Nov|nov/ { print nov+day+leap; } /Dec|dec/ { print dec+day+leap; } ' "$@" Then use it in examples such as: echo '02 Mar 2002' | day-of-yr # British echo 'March 31, 00' | day-of-yr # American Or put two dates in a file "tmp" and pass it to "day-of-yr": echo 'March 31, 00' > tmp; echo '02 Mar 2002' >> tmp; day-of-yr tmp - Second of day at beginning of script: gawk 'BEGIN{"date" | getline; print;}' gawk 'BEGIN{"date" | getline; print $4;}' gawk 'BEGIN{"date" | getline; split($4,arr,":"); print arr[1]*3600+arr[2]*60+arr[3]}' We used here a named pipe "date" to place a text line in $0. IMPORTANT: If you want to use a system command twice in a named pipe, you must first close the pipe after the first use!!! Example: time before and after 10^7 runs through a loop. gawk 'BEGIN{ "date" | getline; split($4,arr0,":"); close("date"); print $0 > "/dev/stderr"; for(i=1; i<=10000000; i++) { if(i%1000000==0) print i > "/dev/stderr"; } "date" | getline; split($4,arr1,":"); close("date"); print $0 > "/dev/stderr"; }' We wrote to "stderr" because "stdout" seems to be buffered and shows up only after termination. That is not useful if one wants to learn about the progress of a program. [Remark after teaching: there seems to exist a function in gawk that did not exist in previous awks -- systime() -- which returns the seconds since 1971. Use this in homework 5 instead of the date pipe above!] -------------- A CASE STUDY: DATA FROM AN FCC AUCTION -------------- * Goal: retrieving and reformatting files of an FCC auction. This task had to be performed twice a day for over a month (Dec 2000 - Jan 2001) to keep track of the state of FCC Spectrum Auction 35 for the bidders of AT&T Wireless. Bidding occurred initially twice a day an more frequently in the later phase of the auction. Retrieval, reformatting and visualization had to be automated; downloading manually twice a day was not feasible. We reconstruct here part of the retrieval and reformatting task. First, however, some explorations to get a sense for the data. Visit the following URL in a browser: http://wtbwww13.fcc.gov/PCS/Broadband/BTA/Auction_35/Results/txt/ You find a file with the list of bidders ("35_xref.txt"), a few files that describe the final state of the auction in the final round 101, and two subdirectories that contain the files for rounds 1...50 and 51...100. The file "35_xref.txt" lists - for each initial bidder (many dropped out early) - the FCC code (used in the data files), - the FCC account (they had to deposit sizable $s to buy "eligibility"), - the official name (often a cover for a large company), and - the "small business and minority credit" if any (a boondoggle). Download this file with the (extremely useful) command "wget", but beforehand some preparations (change the paths for your machine): mkdir e:/buja/stat-540/Auction/ cd e:/buja/stat-540/Auction/ We will be using this URL a lot, so we store it in a shell variable: BASEURL='http://wtbwww13.fcc.gov/PCS/Broadband/BTA/Auction_35/Results/txt/' wget $BASEURL"35_xref.txt" We used the command "wget" to download the file. It allows us to do in programs what we usually do with clicks in a browser. Look at "35_xref.txt" in emacs; remove the top line. The company names have spaces; replace them with underscores to see what is there. You can do that in emacs or with "tr": cat 35_xref.txt | gawk 'NR!=1' | tr " " "_" > tmp mv tmp 35_xref.txt Be sure not to overwrite the same file with tr! Now the stats: in emacs move to the last line and read off the line number, or wc 35_xref.txt 88 auctioneers! Most of them small, though. Their initial bidding power can be found in the initial "eligibility" file: click -> T001_051 -> 35_001e.txt Column 4, called "max_elig", is the initial number of households for which the bidder bought permission to bid. As the bidding rounds go by, eligibility may drop because the FCC requires that if a bidder bids below 80% of eligibility, eligibility will be decreased (except if the bidder takes one of a max of 5 waivers). The bidding fraction is raised by the FCC to 90%, 95%,... as the rounds go on to force the bidders to show their true intent and true willingness to pay up. This avoids "pouncing" big hitters. (Other rules: only FCC-prescribed bidding amounts are permissible, reset by the FCC after every round, to avoid signalling with the low digits.) * First task: make eligibility files human readable by replacing the columns "bidder_num" and "fcc_acct" with the bidder name. Download first an eligibility file for experiments: wget $BASEURL"T001_050/35_001e.txt" wc 35_001e.txt A new problem: we need to process two files in gawk, "35_xref.txt" and "35_001e.txt" and use the names in the former to adorn the latter. It appears that the bidders are listed in the same order in both files, but this would not work in the bid files (for example, "35_001s.txt"). Instead, we use a programming technique that is independent of the ordering of the files, and which would also work for the bid files. We read the labels in "35_xref.txt" into an "associative array", consisting of keys and values, the keys being the bidder numbers, the values being the bidder names. Then we can look up bidder names from their bidder numbers. We build the code up step by step. First we use "getline" for input from an arbitrary file: gawk 'BEGIN{ while ("cat 35_xref.txt" | getline) { print $0; } }' The "while" loop tests for open the presents of records for "getline", which returns +1 if a record is present, 0 if not. Next we stick two columns of "35_xref.txt" into associative arrays as follows: - The bidder number ($1) is the key. - The labels ($3) and the credit ($4) are the values of two arrays. gawk 'BEGIN{ while ("cat 35_xref.txt" | getline) { bnum[$1]=$1; blabel[$1]=$3; bcredit[$1]=$4; } for(bkey in bnum) { printf("%5d %2d %s\n", bkey, bcredit[bkey], blabel[bkey]); } }' The associative arrays jump into existence without declaration. What makes them "associative" is the fact that we can use arbitrary indices or "keys" to store and access elements, not just integers. But there might be a tedious detail if you have not take care of it yet: the carriage return problem. Seemingly bkey and bcredit are not printed because bcredit was at the end of the line and inherited the carriage return character, which moved blabel to the left and overwrote bkey and bcredit... Re-execute the gawk code above: it will work. It worked before, but the CR created misleading output. Now that we know how to create the associative arrays indexed by bidder_num, we can reformat the eligibility file 35_001e.txt: cat 35_001e.txt | gawk ' BEGIN{ while ("cat 35_xref.txt" | getline) { blabel[$1]=$3; bcredit[$1]=$4; } } { if(NR==1) { printf("max_elig req_act cred Company\n") } else { printf("%10d %10d %2d %s\n", $4, $8, bcredit[$2], blabel[$2]); } }' This is worth saving: cat 35_001e.txt | gawk ' BEGIN{ while ("cat 35_xref.txt" | getline) { blabel[$1]=$3; bcredit[$1]=$4; } } { if(NR==1) { printf("max_elig req_act cred Company\n") } else { printf("%10d %10d %2d %s\n", $4, $8, bcredit[$2], blabel[$2]); } } ' > 35_001e_massaged.txt Look at the file in various orders: sort -k4 35_001e_massaged.txt # alphabetically by company name sort 35_001e_massaged.txt # by size, i.e., purchased eligibility In reality "Salmon_PCS" is Sprint PCS under the cover of an NW Indian tribe to take advantage of minority credits (25%), and similarly "Alaska_Native_Wireless" is AT&T Wireless under cover of an Alaskan Inuit tribe. By comparison, Voicestream and Verizon are paying up without cover. In the end, it was Verizon who came out strongest by far. We must now massage all the eligibility files. First some wholesale downloads: for fn in 35_0{0,1,2,3,4,5}{0,1,2,3,4,5,6,7,8,9}e.txt; do echo $fn; done # testing... for fn in 35_0{0,1,2,3,4,5}{0,1,2,3,4,5,6,7,8,9}e.txt; do wget $BASEURL"T001_050/"$fn; done for fn in 35_0{5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}e.txt; do wget $BASEURL"T051_100/"$fn; done wget $BASEURL"T051_100/35_100e.txt" # missed this one in the above enumeration wget $BASEURL"35_101e.txt" # the last one is in the top directory ls 35_*e.txt # check ls 35_*e.txt | wc The loops run over some non-existing files, such as 35_000e.txt, but that is not a problem: wget gives an error message but the loop continues. Now we need to bundle the gawk code in scripts to work on all files: for fn in 35_*e.txt do fnroot=`echo $fn | gawk '{ split($0,fnsplit,"."); print fnsplit[1]; }'` echo $fnroot # the root part of the file name, extension ".txt" removed cat $fn | gawk ' BEGIN{ while ("cat 35_xref.txt" | getline) { bnum[$1]=$1; blabel[$1]=$3; bcredit[$1]=$4; } } { if(NR==1) { printf("max_elig req_act cred Company\n") } else { printf("%10d %10d %2d %s\n", $4, $8, bcredit[$2], blabel[$2]); } } ' > $fnroot"_massaged.txt" done By the end of the auction, only few big hitters remained (35), as the zero eligibility numbers of the final round show: sort 35_101e_massaged.txt This is the end of the case study. In the real thing, all four types of files had to be downloaded and reformatted a couple of times a day: eligibility files, bid files, high-bid files, and waiver files. After downloading and reformatting, the files were read into Splus/R and subjected to about 400 lines of code to turn them into visual summaries of the history of overall activity, of bids per bidder, and of high-bids per license. -------------------------------------------------- Reminders: - Emacs: if you are not using them yet, recall some conveniences from ".emacs" . whipping one line to the other window with "c-x c" in any mode, . shrinking one of two or more windows with "a-o". - Emacs: "a-x re-builder" allows you to build and debug regular expressions. (A limitation: the curly quantifiers such as "{3,4}" do not work.) - Emacs: To learn about what there is in Emacs, it helps to browse the list of key bindings by clicking in the top bar of the emacs window: -> -> or hitting c-s-H b - Shell: Bash recognizes an initialization file in which useful things can be placed. The file is .bash_profile and it should be placed in the home directory. Here are two of its uses: - Shell: adding to PATH by adding statements like PATH=/cygdrive/e/emacs/bin/:$PATH Cygwin intelligently provides "/cygdrive/e" as an equivalent for "e:" because of the conflict with the Unix-style field separator ":" in PATH. Recall Win2K uses the separator ";" in My Computer -> -> -> - Gawk: solve carriage return problem with the following in .bash_profile: alias gawk='gawk --re-interval --assign RS="[\r]*[\n]"' Remarkable are two things: . As for the shell: It does not recurse in selfsubstitutions. One can redefine,e.g., alias ls='ls -l' alias rm='rm -i' which are indeed two popular examples. . As for gawk: The "--assign" flag permits defintion of variables outside the code. We redefined the record separator "RS" to be a regular expression(!!!) that recognizes any number of carriage returns followed by one newline. By allowing regular expressions in "RS", "FS", one has great freedom in chopping up files in different ways. ------------------ CASE STUDY: SOCIAL NETWORK DATA -------------------- # Examining "Marketing_Science.txt", which was saved from "Marketing Science.xls": # You can download this file from the usual place. # Because these data are intellectual property of my wife # (Nermin Eyuboglu at Baruch College, CUNY), # I don't want to see any anchors pointing to this data. # The data contain co-authorships of articles published betweein 1981 and 2001 # in the journal "Marketing Science". # # Our task is to filter the data into a form that gives us # information about the structure of the social network formed # by co-authorship. We will get as far as determining all connected # components in the network. # Header from "Marketing_Science_bang.txt", one field per line: Journal | Volume| #| Year| Page # | Article | Subj.Code| First Name | Last Name | Affiliations | PhD | # Examine one field at a time: # i=2: volumes 1-20 # i=3: 4 numbers per volume # i=4: years 1982-2001 # i=6: articles; if sorted by frequency one sees the articles with the most co-authors i=6; cat Marketing_Science.txt | gawk --assign FS="\t" --assign i=$i '{print $i}' | sort | uniq -c # last names: cat Marketing_Science.txt | gawk --assign FS="\t" '{print $8}' | sort | uniq -c | sort # with first names: see, e.g., Schmittlein, Bradlow cat Marketing_Science.txt | gawk --assign FS="\t" '{printf("%s, %s \n", $9, $8);}' | sort | uniq -c | sort | wc # with affiliations: shows a problem -- multiple ways of writing affiliations cat Marketing_Science.txt | gawk --assign FS="\t" '{printf("%s, %s (%s) \n", $9, $8, $10);}' | sort | uniq -c | sort # Plan: # 0) check consistency of author names # 1) for each article, collect the list of authors on a line, one line per article # 2) create an array that lists pairs of co-authors and their frequencies # 3) extract list of authors: these are "co"-authors, # i.e., excluding those who are exclusively single-authors # 4) extract the social network of a given author: # his/her connected component in the graph of co-authorship # 5) Determine the sizes of all connected components # 0) check cleanliness of author names cat Marketing_Science.txt | gawk --assign FS="\t" '{print $9 $8}' | sort -u # need to eliminate trailing blanks: cat Marketing_Science.txt | gawk --assign FS="\t" '{gsub(" $","",$8); print $9, $8}' | sort -u # still a problem with first names and middle initials; leave as is for now and clean up later; # continue as if data were clean; rerun the code later on clean data # 1) for each article, collect the authors on a line, # with trailing blank removal, and blanks mapped to "_": cat Marketing_Science.txt | gawk ' BEGIN{ FS="\t"; } $6 !="" && NR!=1 { this_article=$6; if(old_article!="") { if(this_article!=old_article){ printf("\n"); } else { printf(" "); } } gsub(" $","",$8); gsub(" ","_",$8); gsub(" ","_",$9); printf("%s,_%s", $9, $8); old_article=this_article; } END {printf("\n")}' > co-authors.txt # 2) create an array that lists pairs of co-authors and their frequencies: # preserve symmetry by including both pairs (x,y) and (y,x) cat co-authors.txt | gawk ' { for(i=1; i<=NF; i++) for(j=1; j<=i; j++) { key=$i" "$j; freq[key]++; key=$j" "$i; freq[key]++; } } END{ for(key in freq) print(key,freq[key]); } ' > co-author-pairs.txt # 3) extract list of authors: these are "co"-authors, # i.e., excluding those who are exclusively single-authors cat co-author-pairs.txt | gawk '{print $1; print $2;}' | sort -u > author-list.txt # by comparison, a list including single authors is: cat co-authors.txt | gawk '{for(i=1; i<=NF; i++) print $i; }' | sort -u > author-list-all.txt # 4) extract the social network of a given author: # his/her connected component in the graph of co-authorship gawk ' BEGIN{ arr["Schmittlein,_David_C."]=1; while("cat co-author-pairs.txt" | getline) { i++; list1[i]=$1; list2[i]=$2; }; do { grew=0; ring++; for(i in list1) { if(arr[list1[i]]==ring && arr[list2[i]]==0) { arr[list2[i]]=ring+1; grew=1; } } } while(grew==1); for(key in arr) { if(arr[key]>0) {print key, arr[key]; } } }' | sort > gaga.txt # 317 authors in this component # # another approach, with "co-authors.txt": gawk 'BEGIN{ arr["Schmittlein,_David_C."]=1; while("cat co-authors.txt" | getline) { N++; coauthors[N]=$0; } do { grew=0; ring++; for(i=1; i<=N; i++) { split(coauthors[i], ca) for(j in ca) { if(arr[ca[j]]==ring) { for(k in ca) { if(ca[k]!=ca[j] && arr[ca[k]]==0) { arr[ca[k]]=ring+1; grew=1; } } } } } } while(grew==1) for(key in arr) { if(arr[key]>0) { print key, arr[key]; } } }' | sort > gagag.txt # again 317 authors in this component # it's the same lists: diff gaga.txt gagag.txt # Even the "tree rings" are the same in both algorithms. # Why? Following an shapr suggestion by Juntian and Peng, I modified the # the algorithms so they would grow only from previous tree-layers, # not from elements of the layer that's currently being added. # The latter was the case when we used the test # "if(arr[ca[j]]>0)" # instead of # "if(arr[ca[j]]==ring)" # This introduced a dependence on the order of the inputs. # The new algorithm avoids this dependence. # # The algorithm is an example of "breadth-first search", meaning, # we exhaust all possibilities of going one step out from the current bag. # In "depth-first search" we'd follow the descendents of every acquired # element right away. # Depth-first search lends itself for recursion; it produces a "forest". # Breadth-first search has the advantage of producing systematic "tree rings", # i.e., it measures distance from the initial starting point. # # A good (although still emerging) resource: # "Dictionary of algorithms and data structures" ("dads") # www.nist.gove/dads/ # 5) Determine the sizes of all connected components: gawk 'BEGIN{ while("cat author-list-all.txt" | getline) { N++; auth[N]=$0; }; while("cat co-author-pairs.txt" | getline) { i++; list1[i]=$1; list2[i]=$2; }; while(1==1) { seed=""; for(i=1; i<=N; i++) { if(arr[auth[i]]==0) { seed=auth[i]; } } if(seed=="") { break; } else { arr[seed]=1; print seed; Mprev=M; M++; } do { grew=0; for(i in list1) { if(arr[list1[i]]>0 && arr[list2[i]]==0) { arr[list2[i]]=arr[list1[i]]+1; grew=1; M++; print list2[i]; } } } while(grew==1); print M-Mprev" <<<<<<<<<<< component size"; } }' > MS-components.txt # "histogram" of component sizes: cat MS-components.txt | gawk '/<<<=0.5 1!=1 1.0000000000000001==1 1.000000000000001==1 1E15==(1E15+1) 1E16==(1E16+1) (2.)**52==(2.**52+1) (2.)**53==(2.**53+1) 1J*1J==-1 i=1 while(i<=35): print [i, 2**i]; i+=1 # The above requires a blank line to get going. # The standard is indentation, though. Python # is NOT a free-form language. I tries to impose # rules of readable writing by indenting blocks # that would be in braces "{}" in C and Gawk : i=1 while(i<=35): print [i, 2**i] i+=1 # Again this requires termination with a blank line. # Here is an if-statement: # Note how python mode helps out with indentations # by knowing the reserved keywords. x=10 if x < 0: print 'negative' elif x == 0: print 'zero' else: print 'positive' if x < 0: print 'negative' else: if x == 0: print 'zero' else: print 'positive' # Yet again: need a blank line to terminate the "else" block. # Here are some "for" loops: for x in ['cat','dog','mouse','snake','supercalifragilistikexpialidocious']: print x,"has",len(x),"characters" for i in range(1,10): print i, 2**i # This brings us to data structures: ################ # - Strings, i.e., sequences of characters # inside "...", '...', """...""" "abc" 'abc' "'a'b'c" '"a"b"c"' """ a multi-line string""" ################ # Lists: # elements can be anything; # elements can be assigned with zero-based addressing, C-style a=[1,2,3] a [a,4,5,6] [a,[a,a],[[a]]] a[0]=-1 a[3]=10 a[2]=10 a ################ # Dictionaries: # hash tables or associative arrays # indexing with arbitrary keys h={'one':1, 'two':2, 'three':3} h h['two'] h['five'] h['four']=4 h h[1000]='thousand' h h[1000] h[2**100]='very long integer...' h h[1.2E3]='float' h h[1.2E3] h[1+1j]='complex' h h[1+1j] # What about this: h[[1,2,3]]='list' # Doesn't work. So what keys are permitted? # Answer: "immutable" objects. # These are: strings, numbers of any type, and "tuples"... ################ # Tuples: # "immutable lists" # formed with parens or nothing t=(1,3,4) t t=1,3,5 t=2 t[1] t[0] t[1]=2 # This is what makes it immutable: assignment of elements # changing length are not possible. One can re-assign, though: t=(1,3,2) t # Immutability makes it usable as a key in a dictionary: h[t]='a tuple...' h h[t] # Therefore dictionaries an be used for "sparse matrix storage": m={(1,2):1.2, (2,3):2.3, (1000,501):1000.501} m[(1,2)] m[1,2] m[2,3] ################ # The "atom problem" of tuples: # a syntactical confusion caused by the use of parens for tuples. # Parens are also used as expression delimiters, as in (1+2)*3 (1<=2) & (3<4) # Hence (1) # does not make a tuple with one element but simply the number 1. # Syntactic solution provided by the author of Python: append a comma 1, (1,) # However, an empty pair of parens makes an empty tuple: () a=(); len(a) # The "atom problem" does not exist for lists and strings: [1] "1" ################################ # Strings, lists and tuples are "sequence" types, and they share # access methods: indexing and slicing # Running examples: s='12345' l=[1,2,3,4,5] t=(1,2,3,4,5) # --- C-style indexing, zero-based --- s[0] # first element l[0] t[0] s[2] # third element l[2] t[2] s[-2] # second element from the end (this is beyond C-style, though) l[-2] t[-2] # --- Slicing: s[i:j] --- # # access by intervals of indices, addressed by "fence posts" # example: s = "abcde" # # s = | a | b | c | d | e | # fence posts addr: 0 1 2 3 4 5 # s[2:4] = | c | d | # # equivalence: s[i] = s[i:(i+1)] # i.e., single indexing fetches the element to the right of the fence post # # reverse fence post addressing: # # s = | a | b | c | d | e | # reverse addr: -5 -4 -3 -2 -1 # s[-3:-1] = | c | d | # # empty strings: s[i:i], i.e., both ends point to the same fence post. # # # Warning: s[3:1] = '' # for reversing a string, use a loop to produce, # or a detour via mutable lists that offer the "reverse" method: s = "abcde" tmp=list(s); tmp.reverse(); from string import join; join(tmp,sep="") # # WARNING: -0 = 0 = left-most !!!! << | # Thus s[-3:-0] produces an empty string, not the tail from -3. # # mixed addressing permitted: # s[2:-1] = s[-3:4] = s[2:4] = s[-3:-1] # # heads and tails: s[:i], s[i:] # s = | a | b | c | d | e | # 0 1 2 3 4 5 # -5 -4 -3 -2 -1 # s[:3] = s[:-2] = | a | b | c | # s[3:] = s[-2:] = | d | e | # # Examples: s[:3] l[:3] t[:3] s[1:3] l[1:3] t[1:3] s[1:-2] l[1:-2] t[1:-2] s[-2:] l[-2:] t[-2:] # # --- Assigment to indices and slices --- # # - Assignment into lists is permitted because they are "mutable". # - Assignment into strings and tuples is not permitted because they are "immutable". # # The semantics of single index and slice assignment differ: # - Single index assigment replaces an ELEMENT with an OBJECT. # - Slice assigment replaces a SLICE with a LIST. # Consequences: # - Single index assigment preserves the length. # - Slice assigment can increase or decrease the length. # # Examples: l = [1,2,3,4,5] # Single index: l[2] = [10,11,("a",20)] l # Slice: l[2:3] = [10,11,("a",20)] l # Contraction: l[2:5] = [] l # Insertion: l[2:2] = [33] l # Expansion: l[2:3] = [33,34] l # Emptying out a list without changing its location: l[0:len(l)] = [] l # With index and slice assignment, other structures and symbols # with shared reference keep seeing the reference of l. # With reassigment of l, sharing of reference is lost. # (This is a rather deep issue of reference and the identity of data structures.) # Compare: l = [1,2,3,4,5]; l2= l # l and l2 share reference. l = [] # Reassigment of l l2 # We're losing the sharing of reference between l and l2. l = [1,2,3,4,5]; l2= l # One more time: l and l2 share reference. l[0:len(l)] = [] # Slice assignment of l. l2 # We're preserving the sharing of reference between l and l2. # See below the "is" and "is not" comparison for mutable data structures. # It checks physical identity in memory. ################################ # Sequence unpacking: # nothing deep, just a convenience, when the length of a sequence is known. s='12345' a,b,c,d,e = s # Unpack a string. a l=[1,2,3,4,5] a,b,c,d,e = l # Unpack a list. b t=1,2,3,4,5 a,b,c,d,e = t # Unpack a tuple. c # Obviously, the length of the sequence must match the number # of variables. ################################ # Comparison of sequences: # - lexicographic (1st element, 2nd element, 3rd element...) # - numbers < lists < strings < tuples (a Python convention) # - empty sequence < non-empty sequence of the same type ('') 1< "A" "A" < "a" 1 < [] [] < "" [] < "a" "" < () "a" < (1,2) [1,2] <= [1,2,3] [1,2,3] <= [1,2] (1,2) <= (1,2,3) (1,2,3) <= (1,2) [1,2] == [1,2] [1,2] < (1,2) [[1,2],3] >= [1,2,3] # true [[1,2],3] <= [1,2,3] # false ... why? [1] < [[1]] [] < [1] "" < "1" [1,2] < [[1],2] # Comparison with "==" does NOT check identity # Instead, it checks isomorphism of the structure and identity of the leafs. # Example: a = [1,"ab",(2,3),[[[4]]]] b = a c = [1,"ab",(2,3),[[[4]]]] a==b # true a==c # true, too!! # a, b, c are equal in the sense of isomorphism and identity of leafs # In order to check identiy of reference, one needs a stronger comparison: ################################ # Identity comparisons: "is" and "is not" # # Check whether data structures are identical # in the sense that they occupy the same memory locations. a=[1,2]; b=a; c=[1,2] a is b # true a is not c # true a=(1,2); b=a; c=(1,2); a is b # true a is not c # true a={"one":1, "two":2}; b=a; c={"one":1, "two":2} a is b # true a is not c # true # Unfortunately, strings do not behave this way. # For strings, "is" is the same as "==" "ab" is "ab" # true # For integers and floats, identity is the same as "==": -10 is -10 # true 8L is 8L # true 1E10 is 1E10 # true -10 is not -10L # true: type difference 10 is not 1E1 # true: dito # Strangely, this is not so for complex numbers: 1+2j is not 1+2j # true, why???? # It would have been more consistent if identity comparison with "is" # behaved the same for all data types, including strings, integers and floats. # # However, the real use of "is" and "is not" is for MUTABLE data structures. # Avoid "is" and "is not" for data structures other than lists and dictionaries. ################################ # Discuss "Quick Reference": # - "Operations on all sequence types (lists, tuples, strings)" # - "Operations on mutable (=modifiable) sequences (lists)" # - "Operations on mappings (dictionaries)" # Examples that came up in the discussion: (2 or 12) (12 or 2) ([] or {}) # Concatenation: [1,2]+[] # Repetition: [1,2]*3 "lingfeng"*3 # Re-assignment of list elements a=[1,2,3] a[2]=a[2]+1 # sensible; does not trample a[2] inadvertently a # min and max follow the comparison rules for "<" and "<=" min("abc") min([1,2,3]) min([[],()]) # append and extend for lists are just syntactic sugar for slice assignment: a.append(3) # add one object at the end of the list a.extend([3]) # add a list at the end # sort: IN PLACE!!! does not return anything a=[2,3,0,1] a.sort() a a=a.sort() # This trashes a!!! # copying dictionaries: a={"l":[1,[1]], "i":2, "c":1+1j, "d":{1:"a",2:"b",3:"d"}, "t":(1,2,[3,4]), "f":1E10} a b = a c=a.copy() c c==a # true c is a # false b is a # true a.clear() # in place! a is b # still true c == a # no longer true ################################################ # Functions: # Defining a function is not a compiler directive (like in C), # but a runtime executable statement # Example: def add3(a,b,c): x = a+b+c return x # There are two ways of calling a function: # 1) by position: add3(10,1,11) # 2) by keyword: add3(b=1,c=-1,a=0) # There are some rules regarding calls with mixed positional and keyword args: # 3) positional arguments have to appear BEFORE all keyword arguments, # which implies that only sequences of initial arguments can be positional: add3(1,c=-1,b=0) add3(1,0,c=-1) add3(1,2,b=2) # incorrect, causes TypeError #### # IMPORTANT: # Mutable arguments are passed "by reference", # but numbers are passed "by value". # Passing a mutable data structure as argument into a function # can therefore affect the data structure with "side effects". # Example: def nix1(a): a[0:1]=[] a = [1,2,3] nix1(a) # Returns nothing but has a side effect on a. a # But trying to affect a single number does not work: def nix0(x): x=0 a = 2 nix0(a) a # Remains unaffected. # Therefore, if side effects are wanted, do NOT pass numbers as arguments, # pass only mutable data structures. [Thanks to Juntian for raising the issue.] # On the other hand, beware of unintended side effects!! #### # An important issue in the semantics of functions are "scoping rules", # i.e., where are variables found. def fun1(x): y=3 return(x+y+z) # Note: # - x is a formal argument (as opposed to an actual argument in a call to the function), # - y a local variable, # - z a global variable. z=10 fun1(2) del z # remove z fun1(2) # bombs... # Rule: # - local variables are found first, # - then global variables. # Consequence: global variables of the same name are shadowed by local variables. # Example: a = "abc" # global variable a def foo(): a = [1,2,3]; print a # local variable a foo() # the local variable is seen, the global variable is shadowed #### # A function can process an unlimited number of arguments: # they get collected in a list. def adder(*numbers): s=0 for num in numbers: s+=num return s adder(1,2,3,4) adder(3) adder(3,[3]) # bombs! # If you want to write more robust code, insert type checks at the # beginning of the code, such as type(num)==type(0) or .... #### # Arguments can be given default values, so they can be omitted from the function call: def add_named(a=1, b=2, c=3, d=4): return a+b+c+d add_named() # all args defaulted add_named(d=10) # keyword arg, all others defaulted add_named(1,1,1,1) # all positional args add_named(1,1,d=0) # two positional args and one keyword arg add_named(1,d=0,1) # doesn't work: positional args must precede named args (see earlier) !!! #### # Arbitrarily many keyword arguments can be given to functions defined as follows: def pr_named(**args): for key in args.keys(): print "key="+str(key)+" val="+str(args[key]) pr_named(one=1, two=2, three=3) # This could be emulated by a dictionary argument, which is a bit more cumbersome: def pr_named(d): for key in d.keys(): print "key="+str(key)+" val="+str(d[key]) pr_named({"one":1, "two":2, "three":3}) #### # Defaults are executed only once at the time of the definition of the function: z = 2 def f(x=z): print x f() # 2 z = 3 f() # still 2 #### # Functions are first class citizens in that they can be passed around by variables: def f(x): return(2*x) f(2.3) g = f # Make g point to the same function. g(2.3) # Works. # Let's try something to push the envelope: Collect functions in lists. funlist = [] for i in range(11): def f(x,i): return(i*x) funlist.append(f) funlist[0](2) funlist[1](2) i i=2 funlist[0](2) funlist[1](2) # So this doesn't work. "i" is a global variable. # The idea here would be to give each function internal state, # but this is not the mechanism for achieving it. # Due to Cengiz, we now have a solution to the problem: funlist = [] for i in range(11): def f(x,j=i): # <<< return(j*x) # <<< funlist.append(f) funlist[0](2) funlist[1](2) i i=2 funlist[0](2) funlist[1](2) # Why does this work? Because everytime the body of the for-loop # is executed, a copy of the value of "i" is deposited in the # default clause "j=i", i.e., we effectively execute "def f(x,j=0)" # "def f(x,j=1)", "def f(x,j=2)",... # Note that every time thru the loop we create a new function # and redefine f. Hence the elements of funlist are NOT identical, # although they all do the same thing. funlist[0] is not funlist[1] # true funlist[0] is funlist[0] # true # Example of the use of functions for internal state and encapsulation: def make_account(): balance = [0.0] # simple value doesn't work; need a pointer to pass on def account(action, amount=0, balalksdjfksadljfalksdaslkd=balance): if action[0] == "d": print "deposit: " + str(amount) if amount > 0: balalksdjfksadljfalksdaslkd[0] += amount print "balance: " + str(balalksdjfksadljfalksdaslkd[0]) elif action[0] == "w": print "withdraw: " + str(amount) if amount > 0 and balalksdjfksadljfalksdaslkd[0] - amount > 0: balalksdjfksadljfalksdaslkd[0] -= amount print "balance: " + str(balalksdjfksadljfalksdaslkd[0]) else: print "balance: " + str(balalksdjfksadljfalksdaslkd[0]) return(account) # This is a hack: lacking lexical scoping rules (nested reference to next outer namespace), # a hard-to-guess argument was introduced which is not volunteered to the customer. # But it works: my_account = make_account() your_account = make_account() my_account("d", 100) your_account("d", 1000) my_account("w", 60) your_account("w", 600) my_account("w", 10) your_account("w", 100) #### # Anonymous functions that execute one single expression: f = lambda x: x**0.5 f(2) f = lambda x,y: x+y f(1,2) (lambda x,y: x+y)(1,2) # Historically lambda forms are inherited from Lisp. # In Python the body of a lambda form is limited to a single simple expression, # NO block statements such as if's or for's or while's are permitted. # Lambdas are often useful for simple throw-away functions that are used just once. #### # Functions can be defined and used inside functions, which may be useful if # the function on the inside has no global use. # Example: a function that vectorizes the evaluation of the Gaussian density # based on an internal function that evaluates the density for one number: def normal_density(x,m=0,s=1): from math import pi, e def n_d(y,m,s): # the local function return(e**(-((y-m)/s)**2/2)/(2*pi)**0.5 * s) if x < []: # we permit calls with both single numbers and lists return(n_d(x,m,s)) else: n = len(x) if m < []: m = [m]*n # extend the single mean to a vector of means if s < []: s = [s]*n # dito for a single stdev list_of_density_values = [] for i in range(len(x)): # now assume that x, m, s are of the same length list_of_density_values.append(n_d(x[i],m[i],s[i])) return(list_of_density_values) normal_density(0.2) normal_density([0.2,0.3]) normal_density((0.2,0.3)) ################ # Advanced list operations: #### # - List comprehension # syntax: # # [<> for <> in <> if <>] # # where <> and <> are usually about <>. # Make up a list: a = range(1,11)*3 a # Remove all elements with value 1: b = [item for item in a if item !=1] b # Recalling the comparison conventions, one can get rid of nested elements: a = [1,2,3,[1,2,3],[[4]],4.0,5.0,[[[6]]],(10,11),"ab"] a [item for item in a if item < []] # Or the opposite, get all composite items: [item for item in a if item >= []] # Or by type: [item for item in a if type(item) == type("")] [item for item in a if type(item) == type([])] [item for item in a if type(item) == type(())] [item for item in a if type(item) == type(0.0)] # Numeric examples: [item**3 for item in range(0,101)] # powers of 3 [(item, item%7==0) for item in range(100,151)] # flag with divisibility by 7 # Parallel lists can be processed as well, using the "zip" function: zip([1,2,3],[4,5,6]) [x*y for (x,y) in zip([1,2,3],[4,5,6])] # We note that in the syntax "[expr for var in expr]" the variable "var" # can be multiple variables really. This is not new syntax because # "var in ..." means "var = item for each item in list", and when "item" # is a tuple, it can be unpacked into multiple variables with a single assignment. # Nested loops (possibly each with an "if" condition) run over set-theoretic products: [[i1,i2] for i1 in [1,2,3] for i2 in [4,5]] # Application: a 5-dim cube a = [[i1,i2,i3,i4,i5] for i1 in [0,1] for i2 in [0,1] for i3 in [0,1] for i4 in [0,1] for i5 in [0,1]] a.sort() a len(a) # This worked. Now build it up to arbitrary dimensions with iteration: cube = [[0],[1]] # Initialize to a "1-D cube". for i in range(5): # Makes a 6-D cube because we started with a "1-dim cube". cu = cube; cube = [] # Must make sure we're not trampling on the previous cube. for vec in cu: cube.extend([vec+[0],vec+[1]]) cube len(cube) # correct: 2**dim # Iterated loops with dependence: # Although iterated loops run over product sets, it is possible to collect # only a subset, such as ordered tuples: ord_tup = [(i,j) for i in range(10) for j in range(10) if i < j] ord_tup len(ord_tup) # Indeed: 45 = 10 choose 2. #### # - Filtering: often a special case of list comprehension, possibly faster, # returns lists or tuples, according to input (!= comprehension) # We introduce this not for your use but so you are able to read others' code. filter(lambda x : type(x)==type(1), [0.0, 0, 1.0, 1]) def f(x): return(type(x)==type(1)) filter(f, (0.0, 0, 1.0, 1, [], (), {}, (1,2), [10])) # Default function is to weed out False values (0, 0.0, (), [], {}, ""): filter(None, (0.0, 0, 1.0, 1, [], (), {}, (1,2), [10], "", "a")) # Returns value, not side-effect: a = [0.0, 0, 1.0, 1, [], (), {}] filter(None, a) a # The above examples can be as clearly done with comprehension: [x for x in [0.0, 0, 1.0, 1] if type(x)==type(1)] [x for x in (0.0, 0, 1.0, 1, [], (), {}, (1,2), [10], "", "a") if x] #### # - Mapping: Again often a special case of list comprehension, possibly faster, # Again, we introduce this not for your use but so you are able to read others' code. map(lambda x: x**3, range(1,11)) # Multiple lists can be processed in parallel: map(lambda x,y: x**y, [1,2,3,4,5], [.1,.2,.3,.4,.5]) # Again, this can be emulated as clearly with comprehension: [x**3 for x in range(1,11)] [x**y for (x,y) in zip([1,2,3,4,5], [.1,.2,.3,.4,.5])] #### # - Reduce: This is different! # It is a generalization of sum_i a[i], for operations other than additions. # Examples: # Summing, in its various Python meanings: def f(x,y): return(x+y) reduce(f, range(11)) # Fine. reduce(f, string.letters) # Do "import string" if necessary. reduce(f, [[1],[2],[3],[4]]) # Ok.. # Factorials: def f(x,y): return(x*y) reduce(f, range(1,6)) # This reveals the order in which accumulation is done: left to right. def f(x,y): return((x,y)) reduce(f, range(11)) # The function doesn't have to be symmetric in its arguments. # With asymmetry one can form arbitrary summations: def f(x,n): return(x + 1/float(n)**2) reduce(f, range(11)) # We don't even have to use both arguments. Here's the geometric series: z = 0.5 def f(x,foo): return(x*z + 1.) reduce(f, range(1,15)) # Not bad: We know for z=0.5 the sum is 2.0. # This function does not even use the arguments; it has side effects instead: z = 0.5; r = [1] def f(foo1,foo2): r.append(r[-1]*z+1) # Accumulates the values of the geometric series. reduce(f, range(11)) # Can be executed repeatedly; every time 10 more terms are added. r ################ # dir(): Finding content and documentation; # applies to ANY object: numbers, data structures, functions, modules,... dir(3) # Lists the methods and variable belonging to the object "3". # Methods have this syntax: object.method(args) # Examples: print (3).__doc__ # not a method but a variable (3).__add__(5) # methods... (3).__sub__(2) (3).__mul__(2) (5).__div__(2) (5).__mod__(2) (5).__divmod__(2) (3).__str__() # same as str(3) (3).__hex__() # could be useful (3).__oct__() # dito # What about lists? Here are the list methods and constants: dir([]) # Examples: print [].__doc__ [1,2,3].__len__() [1,2,3].__str__() a = [1,2,3] a.reverse(); print a a.pop(); print a a.__imul__(2) # Weird: this both returns a result and has a side effect. a # Same for dictionaries: dir({}) print {}.__doc__ a = {1:10, 2:20, 3:30} a.copy() # shallow copy ################ # String operations: # #### # - Recall addition and multiplication: "abcde" + "ABCDE" "abcde"*10 #### # - Format operator "%": numbers -> strings # This is the equivalent of the "sprintf" function in C. # Examples: 'example of an int: %3d' % 234 'example of a float: %10.3f' % 2.23 print """\n Example of 1) a string: '%s', 2) a float: %5.2f, 3) a long int: %20d """ % ("foo", 2.3, 123456789123456789) #### # - String representations for ALL data structures, including complex ones: # str(...), repr(...), `...`. # Often these functions are the same, but # . str() tries to give human-readable strings, while # . repr() tries to give interpreter-readable strings. # Pairs of backquotes `` are equivalent to repr(). str(range(10)) repr(range(10)) `range(10)` #### # Find out what else there is in ways of string manipulations: # List the whole module 'string': got to import it first (see later) import string print string.__doc__ dir(string) ['_StringType', '__builtins__', '__doc__', '__file__', '__name__', '_float', '_idmap', '_idmapL', '_int', '_long', 'ascii_letters', 'ascii_lowercase', 'ascii_uppercase', 'atof', 'atof_error', 'atoi', 'atoi_error', 'atol', 'atol_error', 'capitalize', 'capwords', 'center', 'count', 'digits', 'expandtabs', 'find', 'hexdigits', 'index', 'index_error', 'join', 'joinfields', 'letters', 'ljust', 'lower', 'lowercase', 'lstrip', 'maketrans', 'octdigits', 'printable', 'punctuation', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitfields', 'strip', 'swapcase', 'translate', 'upper', 'uppercase', 'whitespace', 'zfill'] #### # String variables: string.whitespace # a string containing all characters considered whitespace string.lowercase # a string containing all characters considered lowercase letters string.uppercase # a string containing all characters considered uppercase letters string.letters # a string containing all characters considered letters string.digits # a string containing all characters considered decimal digits string.hexdigits # a string containing all characters considered hexadecimal digits string.octdigits # a string containing all characters considered octal digits string.punctuation # a string containing all characters considered punctuation string.printable # a string containing all characters considered printable #### # String methods: string.capitalize("abc, def AaB aaB") string.capwords("abc, def AaB aaB") string.upper("abc, def AaB aaB") string.lower("abc, def AaB aaB") string.swapcase("abc, def AaB aaB") print string.find.__doc__ string.find("abcde abcde", "cd") string.find("abcde abcde abcde", "cd", 6, 11) print string.rfind.__doc__ string.rfind("abcde abcde", "cd") print string.join.__doc__ string.join(["abc","d ef"," ghi "], sep="") # "+" for arbitrarily many words print string.split.__doc__ print string.rjust.__doc__ string.rjust("2.3",5) print string.ljust.__doc__ string.ljust("abc",5) string.ljust("abcxxxxxx",5) # doesn't truncate print string.center.__doc__ string.center("abc",20) print string.replace.__doc__ string.replace("abcde abcde", "ab", "xyz") string.replace("abcde abcde", "ab", "xyz", 1) string.atoi("201") # 201 as decimal number string.atoi("201", 8) # 201 as octal number string.atoi("101", 2) # 101 as binary number string.atol("123440000000000000000000000000") string.atof("2.3") string.atof("1E5") # The documentation of the string module said that most of the functions are # obsolete and now provided as methods for string objects. Let's see: dir("") ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__repr__', '__rmul__', '__setattr__', '__str__', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper'] # Indeed, many string functions are found here as string methods. # This reduces the usefulness of the string module mostly to its # variables (e.g., "string.uppercase"), as well as the conversion # function "atoi" which allows conversion to any base (in addition to 2, 8, 10). # Example: print three columns with right-justification for x in range(1,10): print string.rjust(`x`,2), string.rjust(`x**2`,3), string.rjust(`x**3`,3) # Or, equivalently, with methods: for x in range(1,10): print `x`.rjust(2), `x**2`.rjust(3), `x**3`.rjust(3) ################ # Files, I/O: # Except for opening a file, functions on files have "methods" syntax. # # Write: The file will probably get written in your home directory; check! fn=file("this_file.txt", "w") dir(fn) # What can an open file do? write, writelines,.... fn.write("supercalifragilistikexpialidocious\n") # Check now: there's nothing in the file yet. fn.close() # Now there's something in the file. # One can not rely on seeing written content in the file before it is closed. # Closing has the effect of flushing lingering I/O buffers. # Here is the documentation for "open", which it turns out is an alias for "file": print open.__doc__ #----- Here is the __doc__ string: file(name[, mode[, buffering]]) -> file object Open a file. The mode can be 'r', 'w' or 'a' for reading (default), writing or appending. The file will be created if it does not exist when opened for writing or appending; it will be truncated when opened for writing. Add a 'b' to the mode for binary files. Add a '+' to the mode to allow simultaneous reading and writing. If the buffering argument is given, 0 means unbuffered, 1 means line buffered, and larger numbers specify the buffer size. Note: open() is an alias for file(). #----- # Write multiple lines: # Although it says "writelines" you have to add your own "\n". # It really means writing out strings that sit in a list or tuple. fn=file("this_file.txt", "w") fn.writelines(["supercalifragilistikexpialidocious\n","second line\n","third line\n"]) fn.close() # Read lines, one by one, till "" is returned: fn=file("this_file.txt", "r") print fn.readline() print fn.readline() # End of file. fn.close() # A typical piece of code for reading a file in a loop: fn=file("this_file.txt", "r") while 1: a = fn.readline() if not a: break print a fn.close() # Read the whole file into one long string: fn=file("this_file.txt", "r") a = fn.read() fn.close() print a # Read the whole file into a list, one line per element: fn=file("this_file.txt", "r") a=fn.readlines() fn.close() # Look at the content wholesale or line by line: print a for item in a: print item, # Recall the comma at the end: to prevent "\n" ################ # Modules: # - Modules are sets of variables, functions, classes, # collected in ONE file for a shared purpose. # - A module file must have extension ".py", as in "math.py". # - During the first load of a module, a byte-compiled version with extension ".pyc" # is created. Byte-compiled code is portable, unlike true compiled code. # (Java is another example of byte-compilation; # C, Fortran do true machine- and OS-dependent compilation.) # - The module files are found along a Python-internal search path, # which itself is bound to a variable in the "sys" module: sys.path. # You can add to, or delete from, this list. # - Examples: math, string, sys, os, re # - Two types of importing of modules: # "from math import *" # "import math" # - "from"-import executes the complete module code right away, # i.e., you get a full copy right away. # - plain "import" executes only when reference to the module is made, # and then only those parts that have been referenced. # - "from"-import makes variables accessible in "unqualified" form, # as in "exp(10)", "log(10)". # - Plain "import" requires the module name as a prefix, as in # "math.exp(10)", "math.log(10)". # - Plain import permits "dir(module)" and "reload(module)"; # "from"-import doesn't. # - It becomes clear that "from"-import is somewhat dangerous because # it may clobber variables in your current namespace. # Example of "from"-import: from math import log # import the "log" function from module "math", nothing else log(10) # log can now be used "unqualified", i.e., w/o prefix, "math.log()" from math import * # import everything exp(1) # among all else, "exp" can be used w/o qualification e; pi # some useful constants # Example of plain import: import math math.exp(10) # Requires "qualified" reference. math.log(10) math.e; math.pi dir(math) # Listing content of the module. reload(math) # Reloading permitted. # Reloading makes more sense for modules that you write and modify. #### # Sticking your own function into a module: fn_addnums = file("addnums.py", "w") fn_addnums.write(""" def my_adder(*numbers): s=0 for num in numbers: s+=num return s """) fn_addnums.close() import addnums dir(addnums) addnums.my_adder(1,2,3) # Qualified. from addnums import * my_adder(1,2,3) # Unqualified. # If this doesn't work, there may be a problem with your python path: import sys sys.path sys.path.append("e:\\buja\\stat-540") # The current directory -- adapt to your setup. # Now it should work: import addnums addnums.my_adder(1,2,3) # In what follows we look at some other useful modules that come with Python: # sys, os, operator, glob, random, pickle, re # in addition to the above: # math, string ################ # - "sys": import sys print sys.__doc__ # Contains stdin, stdout, stderr, path, argv, for example. sys.modules # Dictionary of loaded modules. sys.path # Python's search path. # The following are the devices for communicating with the outside from within a Python script: sys.argv # Contains the arguments, both flags and files. sys.stdout # So you should say "import sys" or "from sys import *" in a script. dir(sys.stdout) # Try the "write" method, for example: sys.stdout.write("This is garbage: ;alsjdflksadjflkdjfsalkdjfsalkdfd \n") ################ # - "os": # Operating system functions that help you make code platform-independent. import os dir(os) # Useful functions in this module: os.system("ls *.txt | wc") # Calls the shell. # The "system" function is of limited use. # If you want to capture the output of the calls to the shell, use "popen": os.popen("ls *.txt") os.popen("ls *.txt").read() print os.popen("ls *.txt").read() # A more typical idiom for popen is: for fn in os.popen("ls *.txt").readlines(): fn = fn.rstrip() # Right strip trailing whitespace, including "\n" and "\r" first_line = file(fn,"r").readline() first_line = first_line.rstrip() # Again, right strip. print fn + ": >>>" + first_line + "<<<" # The above loop prints the first line of every ".txt" file. # The doc string explains the exported variables. # I don't think you are supposed to set them; it would simply have no effect. # Instead use os functions such as print os.__doc__ ################ # - "operator": # a re-implementation of intrinsics such as # "+", "-", ">", "*",... # but in function syntax: # "add()", "sub()", "gt()", "repeat()" (for non-numbers) or "mul()" (for numbers),... # This is obviously useful for "reduce(fun,list)", where fun must be a function. import operator print operator.__doc__ dir(operator) reduce(operator.mul, range(1,6)) # 5! factorial ################ # - "glob": # Shell-style file name expansion. The only relevant function is "glob.glob()". import glob dir(glob) print glob.__doc__ print glob.glob.__doc__ # Try something in your own directory. glob.glob("Notes*") glob.glob("*.txt") ################ # - "pickle": # Dump a data structure to a "pickle" file, and reverse, # load a data structure from a "pickle" file. # There is a slight complication: you should import two modules, # "pickle" and "cPickle"; the former has the documentation, # but the latter is a LOT faster because it is a reimplementation in C. import pickle, cPickle print pickle.__doc__ # essentially just a "dump" and "load" function # Example: a = range(1000000) fn = "python_pickle_a" cPickle.dump(a, file(fn,"w")) del a a = cPickle.load(file(fn,"r")) len(a) # Regular pickle did not finish in a useful time, # so cPickle is a necessity for data this size. ################ # - "random": # Random numbers and random selection. import random print random.__doc__ # Surprisingly long documentation, but it's all about "thread safety". dir(random) # Many distributions, some under two names: "gauss", "normalvariate" # Examples: a = range(1,21) random.choice(a) # 1 U[1,2,...20] [random.choice(a) for i in range(10)] # 10 i.i.d. U[1,2,...20] random.shuffle(a); a # random permutation of [1...20], in place a[0:10] # 10 U[1,2,...20] sampled with replacement [random.random() for m in range(10)] # 10 i.i.d. U(0,1), the basis of everything else. [random.uniform(-1,1) for m in range(10)] # 10 i.i.d. U(-1,1) [random.gauss(0,1) for m in range(10)] # 10 i.i.d. N(0,1) [random.betavariate(5,7) for m in range(10)] # 10 i.i.d. Beta(5,7) [random.gammavariate(1,3) for m in range(10)] # 10 i.i.d. Gamma(1,3) [random.expovariate(2) for m in range(10)] # 10 i.i.d. Exponential(2) # Derived: [(random.random() < 0.25) for m in range(10)] # 10 i.i.d. Bernoulli(p=.25) from operator import add reduce(add, [(random.random() < 0.25) for m in range(20)]) # 1 Binomial(n=20,p=.25) # 10 Bin(n=20,.25): [reduce(add, [(random.random() < 0.25) for m in range(20)]) for m in range(10)] # This is too opaque; we should define a binomial generator (assume "import operator"). def one_binomial(n,p): return(reduce(operator.add, [(random.random() < p) for m in range(n)])) def binomial(N,n,p): import operator return([one_binomial(n,p) for m in range(N)]) binomial(1000, 20, .25) # 1000 i.i.d. Bin(n=20,p=.25) binomial(10, 20000, .25) # 10 i.i.d. Bin(n=20000,p=.25) # What would be a plausible approximation to Bin(n=20000,...)? [int(random.gauss(20000*.25,(20000*.25*(1.-.25))**.5)) for i in range(10)] # Homework: Poisson (among others) # In what follows are a few more examples of distributions and Monte Carlo # simulation techniques. # - Negative Binomial NB(prob,Nfail): distribution of the number of successes before Nfail # failures are reached. def neg_bin(prob, Nfail): fail_count = 0; succ_count = 0 while fail_count <= Nfail: bernoulli = (random.random() < prob) succ_count += bernoulli fail_count += (1-bernoulli) return(succ_count) neg_bin(0.2, 10) [neg_bin(0.5, 10) for i in range(20)] [neg_bin(0.5, 10000) for i in range(20)] # - Acceptance-Rejection sampling: # Assume we know how to sample from one density g(x), but not from another density f(x). # Assume further that we know f(x) <= c*g(x) for some c>0. # Then we can generate samples from f(x) by sampling from g(x) and accepting a sample if # U < f(x)/(c*g(x)) for a uniform random number U. # Example: Let f(x) be the linear density f(x) = 2*x on the unit interval. # Let g(x) be the uniform density g(x) = 1 " . # Hence f(x) <= 2*g(x). # f(x)/(c*g(x)) = (2*x)/(2*1) = x # Generate 100 samples: samples = [] while len(samples) <= 100: x = random.random() # Sample from g(x). if random.random() < x: # The auxiliary random number U. samples.append(x) samples.sort() for x in samples: print "%5.3f" % x # - The inverse-cdf method would have been easier in this example: -1 # If U ~ U[0,1] and if F(x) is the cdf of the density f(x), then F (U) ~ f. # If g(x) = 2*x on the unit interval, then the cdf is G(x) = x**2, with inverse U**0.5: samplesI = [random.random()**0.5 for i in range(100)] samplesI.sort() # Compare: for i in range(len(samplesI)): print "%5.3f %5.3f" % (samples[i], samplesI[i]) # - Bivariate Gaussian with given correlation and standardized marginals (m=0,s=1): def gauss_2d(rho): gauss1 = random.gauss(0,1); gauss2 = random.gauss(0,1) return((gauss1, gauss1*rho + gauss2*(1-rho**2)**0.5)) gauss_2d(0.5) bi_gauss_data = [gauss_2d(0.9) for i in range(100)] bi_gauss_data # If you want to write this out to a file: fn = file("bivariate_gaussians.txt", "w") for line in bi_gauss_data: fn.write("%8.6f %8.6f \n" % line) fn.close() # - Samples from the distribution of selected order statistics for a Gaussian sample of size N: def gauss_orderstats(N, orderstats): sample = [random.gauss(0,1) for i in range(N)] sample.sort() return(tuple([sample[ord] for ord in orderstats])) gauss_orderstats(100, [0, 10, 25, 50]) [gauss_orderstats(100, [25, 50]) for i in range(100)] [gauss_orderstats(100, [50])[0] for i in range(100)] # - An example of a permutation test: 2-sample comparison # Make up some data: x = [random.gauss(0.0,1.0) for i in range(200)] y = [random.gauss(0.3,1.0) for i in range(300)] # Question: Knowing only x and y, would we say they are from the same distribution or not? # Permutation test: under the null hyp, we can shuffle the x and y values against each other. # Use the t-statistic as a test statistic: from operator import add def t_stat(x_y_pair): x, y = x_y_pair meanx = reduce(add, x) / float(len(x)) meany = reduce(add, y) / float(len(y)) numerator = meanx - meany rss = 0.0 for xval in x: rss += (xval - meanx)**2 for yval in y: rss += (yval - meany)**2 denominator = (rss / (len(x) + len(y) - 2.0))**0.2 return(numerator/denominator) # Observed t-stat: observed = t_stat([x,y]); observed # One sample from the permutation distribution of the union of values: def perm_null_sample(x_y_pair): x, y = x_y_pair z = x+y; random.shuffle(z) return([z[:len(x)],z[len(x):]]) # Finally, a sample from the permutation null distribution of the t-statistic: perm_null = [t_stat(perm_null_sample([x,y])) for i in range(500)] # How many null value are to the left of the observed value? reduce(add, [item < observed for item in perm_null]) # Zero: significant with a p-value of less than 0.2%. # We will do this in greater detail in Stat 541. # - Bootstrap standard error estimates: # Given a random sample and a statistic computed from it, # construct a standard error for the statistics by # . repeatedly sampling the data with replacement, # . evaluating the statistic on the resampled data, # . collecting these bootstrap values of the statistic, and # . calculating their standard deviation. # Example: standard error of the median def median(x): z = x[:]; z.sort(); return(z[len(z)/2]) def bs_sample(x): return([random.choice(x) for i in range(len(x))]) def std_dev(x): mean = reduce(add, x)/float(len(x)) rss = 0.0 for xval in x: rss += (xval - mean)**2 return((rss/(len(x)-1.0))**0.5) bootstrap_stderr = std_dev([median(bs_sample(x)) for i in range(500)]) bootstrap_stderr # Future homework: # Build on this code and write code that estimates the variance of # the bootstrap estimate of standard error of the median. # - Random walk: on integers, # biased with a given prob for stepping up, # truncated at a maximum number of steps. # Sample from the distribution of first crossing times above a fixed threshold. def crossing_time(barrier = 10, prob=0.5, max_steps=10000): walk = 0 for step in range(max_steps): walk += 2*(random.random() < prob) - 1 if walk >= barrier or step+1 >= max_steps: break return(step+1) crossing_times = [crossing_time(max_steps=100) for i in range(20000)] crossing_times.sort() crossing_times[0:200] # Tabulate the simulated distribution: use the shell's "uniq -c". fn = file("tmp","w") fn.writelines(map(lambda(x): str(x)+"\n", crossing_times)) fn.close() import os # Need "system()" to call the shell. os.system("uniq -c tmp") # The first column is the frequency, the second the number of steps. # Even with 20,000 samples it's still a rough estimate of the true first-crossing distribution. # This sort of task seems to call for a little bit of C code so one can run more simulations. ################################################################ # - "re": # Regular expressions. import re dir(re) print re.__doc__ print re.sub.__doc__ print re.subn.__doc__ print re.template.__doc__ print re.compile.__doc__ print re.findall.__doc__ print re.split.__doc__ print re.search.__doc__ print re.match.__doc__ # Useful is also looking at the code of a function: print re.search.func_code # On my machine this returned: # # Read this file into an emacs buffer, but you may have to translate the path to # c:/cygwin/lib/python2.2/sre.py # This is where you find a useful summary of Python regular expressions: r"""Support for regular expressions (RE). This module provides regular expression matching operations similar to those found in Perl. It supports both 8-bit and Unicode strings; both the pattern and the strings being processed can contain null bytes and characters outside the US ASCII range. Regular expressions can contain both special and ordinary characters. Most ordinary characters, like "A", "a", or "0", are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'. The special characters are: "." Matches any character except a newline. "^" Matches the start of the string. "$" Matches the end of the string. "*" Matches 0 or more (greedy) repetitions of the preceding RE. Greedy means that it will match as many repetitions as possible. "+" Matches 1 or more (greedy) repetitions of the preceding RE. "?" Matches 0 or 1 (greedy) of the preceding RE. *?,+?,?? Non-greedy versions of the previous three special characters. {m,n} Matches from m to n repetitions of the preceding RE. {m,n}? Non-greedy version of the above. "\\" Either escapes special characters or signals a special sequence. [] Indicates a set of characters. A "^" as the first character indicates a complementing set. "|" A|B, creates an RE that will match either A or B. (...) Matches the RE inside the parentheses. The contents can be retrieved or matched later in the string. (?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below). (?:...) Non-grouping version of regular parentheses. (?P...) The substring matched by the group is accessible by name. (?P=name) Matches the text matched earlier by the group named name. (?#...) A comment; ignored. (?=...) Matches if ... matches next, but doesn't consume the string. (?!...) Matches if ... doesn't match next. The special sequences consist of "\\" and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. \number Matches the contents of the group of the same number. \A Matches only at the start of the string. \Z Matches only at the end of the string. \b Matches the empty string, but only at the start or end of a word. \B Matches the empty string, but not at the start or end of a word. \d Matches any decimal digit; equivalent to the set [0-9]. \D Matches any non-digit character; equivalent to the set [^0-9]. \s Matches any whitespace character; equivalent to [ \t\n\r\f\v]. \S Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v]. \w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus characters defined as letters for the current locale. \W Matches the complement of \w. \\ Matches a literal backslash. This module exports the following functions: match Match a regular expression pattern to the beginning of a string. search Search a string for the presence of a pattern. sub Substitute occurrences of a pattern found in a string. subn Same as sub, but also return the number of substitutions made. split Split a string by the occurrences of a pattern. findall Find all occurrences of a pattern in a string. compile Compile a pattern into a RegexObject. purge Clear the regular expression cache. escape Backslash all non-alphanumerics in a string. Some of the functions in this module takes flags as optional parameters: I IGNORECASE Perform case-insensitive matching. L LOCALE Make \w, \W, \b, \B, dependent on the current locale. M MULTILINE "^" matches the beginning of lines as well as the string. "$" matches the end of lines as well as the string. S DOTALL "." matches any character at all, including the newline. X VERBOSE Ignore whitespace and comments for nicer looking RE's. U UNICODE Make \w, \W, \b, \B, dependent on the Unicode locale. This module also defines an exception 'error'. """ # So there are extensions to the regex of egrep and gawk. # Examples: Here is a string on which we do searching and replacing. str = "111111111 abc 2222222 abc 33333 4444444 abc 555555555def6666666" # "search" looks for a match with a string or a regular expression: print re.search.__doc__ # The function returns a "match object" if there is a match, None if not: re.search("abc|def", str) # Match: returned something... re.search("(abc)|(def)", str) # The same. re.search("(xxx)|(yyy)", str) # None. # A match object can return the location of the first match: dir(re.search("abc|def", str)) # Has methods "start", "end", "span", "group". re.search("abc|def", str).start() # The fence post where the first match starts. re.search("abc|def", str).end() # ... where the first match ends. re.search("abc|def", str).span() # Both start and end in a 2-tuple. re.search("abc|def", str).group() # Says what string was actually matched. # There is a new distinction between greedy and non-greedy matching (see docu above): re.search("(abc)+", " abcabc ").span() # greedy: maximal match "abcabc" re.search("(abc)+?", " abcabc ").span() # non-greedy: first match "abc" re.search("(abc)*?", " abcabc ").span() # (not very meaningful: empty match at beginning) # While we know "?" as the 0-1 quantifier from egrep and gawk, # it still has this role in Python, but it also has the additional # function of qualifying other quantifiers to be non-greedy. # "re.findall" is related to "search" but returns the list of all matched strings # in some form, but it wouldn't tell you where the matches occurred: re.findall("abc", str) # Returns "abc", matched 3 times. re.findall("abc|def", str) # This is a genuine regular expression re.findall("xxx|yyy", str) # Returns an empty list. # "re.sub" is a generalization of thre "string.replace" function/method # to regular expressions. It substitutes all matches with a replacement string: re.sub("abc|def", "---", str) # "subn" is a variant that returns a 2-tuple containing string and number of substitutions: re.subn("abc|def", "---", str) # A new (compared to egrep and gawk) and very powerful feature is that # "re.sub" can take a FUNCTION instead of a substitution string. # You can make variable substitutions depending on the match. # The function must be written to accept a match object and return a string. # Example: def substitutor(match_obj): if match_obj.group()=="abc": return("xxx") elif match_obj.group()=="def": return("yyy") # Obviously, the function should know what to do with the potential matches. # Test: substitutor(re.search("abc|def", str)) # The intended application to "str" defined above: re.sub("abc|def", substitutor, str) # And indeed it works: "abc" -> "xxx", "def" -> "yyy" # There is a generalization of splitting, "re.split", to regular expressions: re.split("abc", str) # Same as string.split(str, "abc") re.split("abc|def", str) # This is like gawk. # Compilation of regular expressions for speed-up: regex = re.compile("abc|def") # Use the compiled expression instead of the string: # You then use the methods for search, split, sub, subn, findall: re.search(regex, str).span() re.search(regex, str).group() re.findall(regex, str) re.sub(regex, "xxx", str) # WARNING: Beware of backslashes! # Python interprets things like "\n" and "\b" and replaces them with # Newline or Backspace, for example. # If you need them uninterpreted, use RAW STRINGS: r"\ \\..." r'\ \\ \\\ ' r""" \n \b ...""" # The backslashes are doubly quoted when Python returns them in a "cooked" (= not raw) string. # Regular expressions are often entered as raw strings. # Example: re.findall("\b\w\w\w\b", str) # No match: \b is interpreted as backspace not "begin word" re.findall(r"\b\w\w\w\b", str) # Three matches of "abc": begin word, 3x alphanums, end word. # If you need to mix interpreted and raw strings, add them: re.findall(r"\b\w\w\w\b"+"|\b", str+"\b"+str) # Indeed, the "\b"s in the raw string are begin/end-of-word markers, # and the "\b" in the cooked string is a backspace. # WARNING: Beware of parens! # Recall in egrep/gawk parens played no role other than setting precedence. # In Python regex's, parens indicate subgroups for which matching is reported. # We start with known behavior: re.search("abc", str).group() re.search("abc", str).span() # New in Python regex: # Define, for example, a subgroup "(bc)" of "abc" by saying "a(bc)". # Group 0 is the whole regex: re.search("a(bc)", str).group(0) # 0 is the default, so this is the same as .group() re.search("a(bc)", str).span(0) # " # Group 1 is the subgroup indicated by parens: re.search("a(bc)", str).group(1) re.search("a(bc)", str).span(1) # With two groups, nested: re.search("a(b(c))", str).group(0) # The whole regex. re.search("a(b(c))", str).span(0) # " re.search("a(b(c))", str).group(1) # Group 1. re.search("a(b(c))", str).span(1) # " re.search("a(b(c))", str).group(2) # Group 2. re.search("a(b(c))", str).span(2) # " re.search("a(b(c))", str).groups() # All sub groups in the match. re.search("a((b)(c))", str).groups()# " # With two non-nested groups: re.search("a(bc)|d(ef)", str).group(0) re.search("a(bc)|d(ef)", str).span(0) re.search("a(bc)|d(ef)", str).group(1) re.search("a(bc)|d(ef)", str).span(1) # Similarly, compare: re.findall("abc|def", str) # Returns a list of matched strings. re.findall("a(bc)|d(ef)", str) # Returns tuples: one of ('bc','') and ('','ef') per match. re.findall("(abc)|(def)|(ghi)", str) # Similar, returns triples. # What's the purpose of groups? # One is backreference: how do you, for example, look for word repetitions? # You need a backreference to a preceding word and ask for an identical match. # Reference to groups is made with "\1", "\2", ... : # \1 asks for an exact match with the preceding match of group 1. # Example: re.findall(r"(\b\w+\b)\s\1", "There are are silly typos such as as word repetitions.") # word^^^^^^^ ^^refers to matched word. # Better with compilation, where one can ask to ignore case: word_rep = re.compile(r"(\b\w+\b)\s\1", re.IGNORECASE) re.findall(word_rep, "The the silliest typo is is word repetition.") # In light of the complexity of regular expressions, # a verbose mode in the compilation function is provided. # The regex string can be commented and spread over multiple lines. # Whitespace characters are NOT interpreted, unless quoted or in a range "[ ]". word_rep = re.compile(r""" ( # open group \b # | beginning of word \w+ # | | non-empty word \b # | end of word ) # end group \s # whitespace ( # open group (need the repetition in a group for substitution, see below) \1 # | another match of the above group -> word repetition ) # end group """, re.VERBOSE | re. IGNORECASE) # Note how multiple flags are bitwise OR-ed in the last argument. re.findall(word_rep, "The the silliest typo is is word repetition.") # Let us fix it: First we make sure we understand what type of match object we get. re.search(word_rep, "The the silliest typo is is word repetition.").group() re.search(word_rep, "The the silliest typo is is word repetition.").group(1) # Ready for the substitution: def sub_fun(x): return(x.group(1)) re.sub(word_rep, sub_fun, "The the silliest typo is is word repetition.") # Or with an anonymous lambda: re.sub(word_rep, lambda x: x.group(1), "The the silliest typo is is word repetition.") # All this can be done with plain programming, of course. # It is not too hard to split a string into words, check for adjacent word identities # (modulo upper/lower case), toss the repetition from the list, and rejoin the list # back to a single string. Well, maybe the result would not be identical to the # regex solution because splitting and re-joining may not reproduce the original whitespace. # Now for a bigger example: the book in the file "less.txt". fn = file("less.txt", "r") txt = fn.read() # Read the whole thing into a loooong string. len(txt) # That's not so terribly long: 382,771 bytes. txt[0:1000] # The first 1,000 bytes, but that's not readable. print txt[0:1000] # Now it's readable. # As earlier with the shell command "tr", we can remove all Newlines: txt_nonewlines = re.sub("\r\n|\n|\r", " ", txt) print txt_nonewlines[1:1000] txt_linepersentence = re.sub('[.]"|[.]', r".\n", txt_nonewlines) # Should do better than this, like in the homework: print txt_linepersentence[1:1000] ################################################################ # Classes, Object-Oriented (OO) programming # Motivating example: # Formalize computationally a statistical model, e.g., simple linear regression. # You'd like to collect the input data x, y, and combine them in a single data structure # together with regression coefficients, their t-values, RSS, residuals, outliers,... # Simple approach: # - data structure = dictionary # - a function to compute the regression quantities x = [random.gauss(0.,1.) for i in range(100)] y = [x[i] + random.gauss(0.,1.) for i in range(100)] # y = b*x + a + eps; b=1, a=0 regr = {"x":x, "y":y} def simp_regr(x,y): n = len(x) if len(y) != n: print "x and y have unequal length"; return mx = reduce(add, x)/float(n) # Assume "from operator import *", in ".init.py". my = reduce(add, y)/float(n) sxx = reduce(add, [(x[i]-mx)*(x[i]-mx) for i in range(n)]) if sxx < 1E-100: print "var(x) too small"; return sxy = reduce(add, [(x[i]-mx)*(y[i]-my) for i in range(n)]) b = sxy / sxx a = my - b*mx # Use: my = a + b*mx r = [y[i] - (a + b*x[i]) for i in range(n)] s = math.sqrt(reduce(add, [r[i]*r[i] for i in range(n)])/(n-2)) return {"a":a, "b":b, "s":s, "r":r, "x":x, "y":y} regr_example = simp_regr(x,y) # Reasonable estimates, given the truth is a=0, b=1, s=1: regr_example["a"] regr_example["b"] regr_example["s"] # Another way to organize a model: bundle the data structure and function in a class. class simple_regr: "Class for creating simple regression objects" a = None; b = None; s = None; R2 = None; w = 1.0 def fit(self): n = len(self.x) mx = reduce(add, self.x)/float(n) # Assume "from operator import *", in ".init.py". my = reduce(add, self.y)/float(n) sxx = reduce(add, [(self.x[i]-mx)*(self.x[i]-mx) for i in range(n)]) if sxx < 1E-100: print "var(x) too small"; return sxy = reduce(add, [(self.x[i]-mx)*(self.y[i]-my) for i in range(n)]) syy = reduce(add, [(self.y[i]-my)*(self.y[i]-my) for i in range(n)]) self.b = sxy / sxx # Slope self.a = my - self.b*mx # Intercept (use: my = a + b*mx) self.r = [self.y[i] - (self.a + self.b*self.x[i]) for i in range(n)] # Residuals self.s = math.sqrt(reduce(add, [self.r[i]*self.r[i] for i in range(n)])/(n-2)) # Stdev self.R2 = sxy*sxy/(sxx*syy) def __init__(self, x, y): if len(y) != len(x): print "x and y have unequal length, not initialized!"; return self.x = x; self.y = y self.fit() def show(self): print "Slope = %4f, I'cept = %4f, Stdev = %4f, R^2 = %5.3f" % (self.b, self.a, self.s, self.R2) # This is the "class instantiation" that creates an object, or class instance: xy_reg = simple_regr(x,y) # Sanity check: the data made it into the object correctly. xy_reg.x == x; xy_reg.y == y # Checking some "instance variables": xy_reg.a xy_reg.b xy_reg.s xy_reg.R2 # Using an "instance method" for showing itself: xy_reg.show() # The regression example was too complex for starters, so let's step back: # A simpler example is a "date class" containing year, month, day. # Again, it could be implemented as a dictionary and a function: def make_date(y,m,d): return {"y":y, "m":m, "d":d} today = make_date(2002,"Nov",5) today # But a more satisfying way of bundling is with a date class. # This approach allows us to implement data integrity checks and # various representations. class date: "Class to check, maintain and show dates" def check(self): if type(self.year) != type(1): print "(year is improperly specified)" if self.month not in ("Jan","January","Feb","February","Mar","March","Apr","April", "May","Jun","June","Jul","July","Aug","August","Sep","Sept", "September","Oct","October","Nov","November","Dec","December"): print "(month is improperly specified)" if not (type(self.day) == type(1) and self.day >=1 and self.day <= 31): print "(day is improperly specified)" def set(self, y=2002, m="Jan", d=1): self.year = y; self.month = m; self.day = d self.check() def __init__(self, y=2002, m="Jan", d=1): self.set(y, m, d) def set_year(self, y=2002): self.year = y; self.check() def set_month(self, m="Jan"): self.month = m; self.check() def set_day(self, d=1): self.day = d; self.check() def get(self): return (self.year, self.month, self.day) def get_gb(self): self.check() return "%d %s, %d" % (self.day, self.month, self.year) def get_us(self): self.check() return "%s %d, %d" % (self.month, self.day, self.year) # Creation of a date: today = date(m="November", d=5) yesterday = date(m="November", d=4) # Use of the two methods: today.get_us() yesterday.get_us() today.get_gb() # You can reset the instance variables anytime: today.month = "Nov" today.get_us() # Now this was dangerous, because you bypassed the checks! # But we programmed the "show" methods to do a check: today.month = "Novem" today.get_us() # Good programming is to provide and use "set" methods: today.set_month("Novem") today.set_year(2002.0) today.get_us() # Comments: # - Class instances generate their own "namespace" which must be # qualified with the instance name (NOT the class name). # # - Within the class definition, the instance name is not available, # which is why "self" is used as a stand-in. # # - At runtime, "self" is replaced with the instance name. # # - The method "__init__(self,...)" is called when the class # is instantiated; the method receives the arguments with # which the class was called. Example: The class "simple_regr" # is called by "simple_regr(x,y)", so "x" and "y" are passed to # the "__init__(self,x,y)" method. # ################ # Misc. details: # - To start up with an initialization file, "init.py", say, type into a bash: python -i .init.py # - To list all ASCII strings in order, do this: for i in range(256): "%3d = %3o = %3x = %1r = %3s" % (i,i,i,i, chr(i)) ################ # "Exceptions": Errors that can be intercepted. # # Example: a variable is not defined; type: xcvxs # The interpreter answers: Traceback (most recent call last): File "", line 1, in ? NameError: name 'xcvxs' is not defined # "NameError" is the name of this type of exception. # You can intercept it as follows: try: xcvxs except NameError: xcvxs = [] # Now "xcvxs" is defined: xcvxs # If you want to intercept a certain type of error in your code, # generate an example to learn what type of exception it generates, # then intercept it with try/except clauses. # Another example: zero division, if you're not sure whether some # denominator has become zero. Try: 1/0 # Generates a "ZeroDivisionError" exception. Deal with it: a = 1. b = 0. try: a/b except ZeroDivisionError: b = 1E-10 c = a/b # Another example is stepping outside the index range of a list or tuple or string: s = "abc" s[5] # Index error: try: s[5] except IndexError: s = s.ljust(6) s[5] # Last example: undefined key in a dictionary d = {"a":1, "b":2} d["c"] # "KeyError": try: d["c"] except KeyError: d["c"] = 3 d["c"] # try/except is often used, not as a hack, but as an accepted idiom. # If you don't want to do anything in the "except" clause, use a "pass" statement: try: xcvxs except NameError: pass ################################################################ # Python scripts: # Stick the code between the two lines "#-----" in a file, # "pyscript", say. It is something like an egrep, but gives # the next line as context, and separates hits with a line of dashes. #---------------- Start script with next line. #! /usr/bin/python import sys, re line_num = 0 pattern = re.compile(sys.argv[1]) prev_line = "" for line in sys.stdin: line_num += 1 if re.search(pattern, line): print "--------------------------------" if line_num > 0: print "Prev. Line " + str(line_num-1) + ": " + line, print "This Line " + str(line_num) + ": " + line, prev_line = line #---------------- End script with previous line. # The following are examples of the use of the script. # (Remember that these lines need to be copied into a bash.) pyscript regex < Notes-Stat-540.txt pyscript "html|web" < Notes-Stat-540.txt # The book file we analyzed earlier: another two checks for end of sentences. pyscript "\s[.]\s" < less.txt # Isolated periods. pyscript "[.]\w+" < less.txt # Periods followed by alphanumerics. pyscript "[.]\S+" < less.txt # Periods followed by non-whitespace ################################################################ # Vector/matrix/array computation: module "numarray" # must be downloaded and installed from www.python.org ("numpy") import numarray # Keep them qualified so we know what we're fetching. dir(numarray) # Tons of functions. doc(numarray) # Rather disappointing. n = numarray # To save us some typing and for readability. # Examples: Creating and shaping arrays. v = n.array(range(10)) # By default a vector = 1-dim array v m = n.array(range(24), shape=[4,6]) # A matrix = array with 2 dims. m # We note that the last index runs fastest, ~ C, !~ Fortran. ar = n.array(range(24), shape=[4,3,2]) # A 3-dim array. ar # Again, last index is fastest. # Again, last index (dim 3) runs fastest, first (dim 1) slowest. n.shape(ar) # Selfexplanatory. n.rank(ar) # Abuse of terminology, != rank in lin.alg. n.size(ar) # Corresponds to "len". ar.setshape([2,3,4]) # Reshaping, but dims must be compatible. ar # Types: In contrast to lists, arrays have a fixed, uniform, numeric type. n.array(["a","bc"]) # Bombs. # One can force types: n.array(range(24), shape=[2,3,4], type="Complex") # "Promotion" (coercion) is done as needed: # int + long -> long, either + float = float; either + complex = complex # Examples of promotion: n.array([1,1.0]) # Becomes float. n.array([1,1.0,1j]) # Becomes complex. ar * (1+1j) # " # Indexing and slicing of arrays: zero-based, with some novel flexibility. # With vectors: v = 10*n.arange(10) v v[2:4] # Familiar slicing. v[2] # Familiar indexing. # A new type of indexing: v[[2]] # A one-element vector, NOT a number. v[[2,3]] # A two-element vector. v[[5,4,3,3,4,5,6]] # Think of this as mapping [5,4,3,3,4,5,6] with "v". ind = [0,1,0,1,0,1]; v[ind] # Another example. # With matrices: m = 10*n.arange(12, shape=[3,4]) m # We can subselect with slices, but be careful: m[,0:2] # Bombs. m[:,0:2] # Works: need slicing in all dimensions. # Now this may seem a little strange: # A matrix can take two index arrays of arbitrary but same shape. # The result is another array of the same shape but mapped through the matrix # element by element: m[[1,1,2],[0,0,3]] # -> a vector of length 3 formed from the index pairs (1,0), (1,0) and (2,3) of "m". # A more complex example: m[[[1,2],[0,3]],[[0,3],[0,0]]] # -> 2x2 matrix formed from the index pairs (1,0), (2,3), (0,0), (3,0) of "m". # In summary, this type of indexing allows the use of "m" as a multi-argument mapping. # (Btw, notice we didn't have to convert the lists to arrays. They got coerced automatically.) # The idea of indexing as mapping carries over to higher orders: i1 = [[0,1],[2,3]]; i2 = [[3,2],[1,0]]; i3 = [[1,1],[0,0]] ar[i1,i2,i3] # -> a 2x2 matrix, same shape as i1, i2, i3. # Odd: a single index is permitted and is interpreted in the left-most position: m[1]; m[2] m[[1,2,1,2]] m[[3,2,1,0]] ar[1] ar[[1,2]] ar[[[1,2],[0,0]]] # This last example is getting out of hand. # We should probably stay away from this for clarity. # Getting back to the idea of indexing as mapping, one can fill arrays # in terms of mapping conditions: i1 = n.array([[0,1],[1,0]]) i2 = n.array([[0,0],[1,1]]) i3 = n.array([[1,1],[0,0]]) val = n.array([[10,20],[30,40]]) x = n.zeros(shape=[2,2,2]) x[i1,i2,i3] = val x # Of course one can fill subsets of an array with slices: y = n.zeros(shape=[2,3,4]) y y[0:1,0:2,2:3] = 1 # A slice consisting of 2 entries. y # Then there is something called strides: m = n.array(range(24), shape=[4,6]) m m[::2] m[4::-2,:] # There is a notion of "ufuncs", universal functions, that get automatically # applied to all elements of an array: ar = ar + 1 ar * 2 ar**3 n.log(ar) # n.log != math.log n.exp(ar) # Binary operations on multiple arrays: br = n.ones(shape=[2,3,4]) # "ones" and "zeros" make constant arrays. br + ar # Elementwise addition. br * ar # Elementwise multiplication (NOT matrix mult.!) br ** ar # " power (NOT matrix powers!) br < ar # Elementwise boolean array. # "numarray" module was able to redefine the built-in arithmetic operations. # The operations come from numarray classes such as n.add and n.multiply. # They work even on lists, not just arrays, but they always return arrays: n.add([1,2,3], [3,6,9]) n.add([[1,2],[4,5]], [[3,6],[10,11]]) # The lists have to have a natural interpretation as arrays, though: n.add([1,2,[4,5]], [3,6,[10,11]]) # Bombs. # Outer product: a disappointent. n.outerproduct([1,2,3], [1,2]) # Fine n.outerproduct([1,2], ar) # What is this? "ar" got strung out to a vector. # Fortunately, there are other "outer products" that work on arbitrarily shaped arrays. # They work not only with multiplication: # C_ijklm = A_ijk * B_lm, # but with addition as well: # C_ijklm = A_ijk + B_lm. # Therefore they are methods of classes that implement 2-argument operations: dir(n.add) dir(n.multiply) # Both have an "outer" method. oa = n.add.outer # Abbreviate for readability. om = n.multiply.outer # dito oa([1,2], ar) # A 4-dim array as it should be. om([1,2], ar) # This is the traditional outer product. om([1,2,3,4], oa([0,1,2], [0,1])) # A 3-dim array. # Outer products should be associative: om([1,2,3,4], om([1,2,3], [1,2])) == om(om([1,2,3,4],[1,2,3]), [1,2]) # True. Comparison is elementwise, hence this is a 3-dim Boolean array. # Why are outer products important? # a_ij = x_i*y_j form the so-called rank-one matrices. # In particular, # a_ij = x_i*x_j form orthogonal projections if (x_i) is unit length. # Recall that an eigendecomposition is a sum of orthogonal projections # onto eigenvectors, weighted by eigenvalues. # Next to "outer", ufuncs have "reduce" and "accumulate" methods: a = n.ones(shape=[2,3,4,5,6]) a = n.array(range(720), shape=[2,3,4,5,6]) n.shape(a) n.shape(n.add.reduce(a, dim=0)) n.shape(n.add.reduce(a, dim=1)) n.shape(n.add.reduce(a, dim=2)) n.shape(n.add.reduce(a, dim=3)) n.shape(n.add.reduce(a, dim=4)) # xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n.add.reduce(a, dim=0) n.add.reduce(a, dim=1) n.add.reduce(a, dim=2) # (There was a bug in numarray: to make "reduce" work on the last dimension, # I had to comment out line 746 in c:/cygwin/lib/python2.2/site-packages/numarray/ufunc.py # but this leaves other bugs: after a reduce, one is not guaranteed any specific order of # dimensions/axes.) n.add.accumulate(a, dim=0) n.add.accumulate(a, dim=1) n.add.accumulate(a, dim=2) # There is matrix multiplication: a = [[0,1,],[2,3]]; b = [[1,1,1],[2,2,3]] n.matrixmultiply(a, b) # 2x2 mmult 2x3 n.dot([[0,1,],[2,3]], [[1,1,1],[2,2,3]]) # There is a shorter synonym: n.dot; n.matrixmultiply # Have same physical address. n.dot is n.matrixmultiply # Hence they are identical. # But there does not seem to exist true linear algebra: # decompositions such as QR, Cholesky, eigen, svd,... # For this, you need to download for example the "scipy" module, # which, however, is pretty hairy. ################################################################ ################################################################ # # ---------------- The C Language --------------- # # We lower the language level and turn to C. C is compiled and fast, # but usually more laborious to code. # C and its object-oriented version C++ have replaced most other # compiled languages. # (The other language that is still around is Fortran, but mostly # because there exists so much legacy code, mostly in the sciences. # Pascal is probably on the way out.) # If you can, avoid programming at the C level; use it if you have to, # but even then only for those parts of your problem that depend on # speed. Use Bash or Python to glue the pieces together. Often it is # possible to write relatively small C programs that perform narrowly # defined jobs, and Bash or Python stitch them together. For example, # it often makes no sense to open dozens of files from within C. # Write C code for a single input file, preferably "stdin", and let # Bash or Python loop over the files and repeatedly call the C # program. Even Bash, for what it does, is generally more readable # than C. For readability Python is usually best. # Even if you have to ultimately code in C, you might still want to # prototype the algorithms in Python or Gawk first, to get the ideas # straight before you get bogged down in the arcane details of C. # [The above are opinions; feel free to disagree.] # ** C syntax: # - Free form: Statements are terminated by ";" and/or enclosed in curly # braces "{" and "}" to form blocks (recall Gawk). # Unlike Python, no indentation is enforced, but it is part of good style. # Emacs c-mode helps. # - Comments are enclosed in matching pairs "/*" and "*/" which can # span multiple lines. Compare: The sharp sign "#" marks the # beginning (to the end of the line) of a comment in Bash, Gawk, # Python. # ** Compilation: # - Cygwin comes with the "gcc" compiler in c:/cygwin/bin. # Our standard Bash incantation for compilation of one file is: gcc -o prog prog.c # prog is the executable file, prog.c the program file. prog # Assuming "prog" needs no arguments, you just call it to run. # - Example: Put the following in the file "prog.c": main() { printf("\n\n Hello World!\n\n"); } # and compile and run it as shown above. # ** Data structures and variables in C: # - While in Gawk and Python variables and their bindings to data # structures jump into existence by simple assignment, this is not so # in C. You have to declare every variable and its type ("int", # "double", "char") before the first use, including looping variables: int i, j, k; double weight, height; char c1, c2, c3; i = j = k = 0; weight = 325; height = 74; c1 = 'a'; c2 = 'b'; c3 = 'c'; /* char literals are single quoted */ # - "High-level" data structures in C are arrays and 'structs'. # We only introduce arrays. # - Arrays must have one uniform type (int, double, char,...), # and their length has to be declared at compile time. # [There exists dynamic memory allocation, but that is for later.] double arr[100]; char name[30]; /* This is a string! */ int table_months_by_states[12][50]; # - Strings are one-dim character arrays. [Compare: Python has no # character type; strings are fundamental, although they are # sequences; Python characters are strings of length 1; in C the # character 'a' is different from the string "a" that contains 'a' as # its only element (actually: "a" has a null character for termination). # - Array indices are zero-based: char mid_init[3]; mid_init[0] = 'J'; mid_init[1] = '.'; mid_init[2] = ' '; int num[3]; num[0] = 10; num[1] = 20; num[2] = 30; # - Multidimensional arrays are arrays of arrays: int table[12][50]; # This means that table[0], table[1], ..., table[49] are all # 1-dim arrays of length 50. # As a consequence, the rightmost index runs the fastest. # Multi-dim indexing uses brackets for each dimension separate: table[0][0] = 10; table[0][1] = 20; # - Arrays, including strings, are mutable in content, but not in length. # [Compare with Python: strings and tuples are immutable, # but lists are mutable in content AND length because of # permitted insertions and deletions.] char a[2]; a[0] = 'x'; a[1] = 'y'; a[0] = 'a'; a[1] = 'b'; # Actually, for all practical purposes strings can be shorter than # the declared length: terminating with a zero character '\0' # tells the print functions to stop here. # Example: add this to the file "prog.c": char c[10]; c[0] = 'a'; c[1] = 'b'; c[2] = '\0'; c[3] = "x"; c[3] = "y"; c[3] = "\n"; printf(c); # The part "xy\n" will not show because printf stops at the zero character. # - Strings can be initialized in the declaration: char a[] = {"Yes, Sir, Sir!\n"}; # But saying a[] = {"..."} later on is not legal; # i.e., this syntax works only for initialization, not assignment. # - for-loops: just like in Gawk -- "for(init.; term.cond.; incr.) {...}". # (Actually, just the opposite: Gawk is like C.) Try this in prog.c: char a[] = {"Yes, Sir, Sir!\n"}; int i; for(i=0; a[i] != '\0'; i++) printf("%c\n", a[i]); /* Write one character per line. */ # ^^^^^^^^^^^^ end-of-string test; more typical is a test of the form "i < n". # (Format strings are pretty similar across Gawk, Python, C.) # - Reading from stdin and writing to stdout: scanf and printf. # "printf" is familiar from Gawk, but "scanf" corresponds to Gawk's "getline". # Both are formatted ("...f") I/O. main() { double r; printf("\n Give me a real number: "); scanf("%lf", &r); printf("\n The number is %f\n\n", r); } # We will see later why "scanf" has "&r": it wants you to give it a pointer, not a value. # This is how side effects are achieved: after "scanf", r points to a new place. # - Looping over stdin: # "scanf" not only has a side effect, it also returns 1 as long as there is valid input. # This leads to the following typical idiom for reading stdin, and files in general: main() { double r; printf("\n Give me a real number: "); while(scanf("%lf", &r) == 1) printf("\n The number is %f.\n Give me another number: ", r); } # ** First small project: Tabulating a file containing real numbers. # We will read the data from "stdin" and write the tabulation to "stdout". # We will hardcode bin boundaries into the program; # the tabulation will be an array containing the counts of reals in each bin. # We first produce a sample file of input "data" as uniform random numbers # in Python: python -i import random, os os.chdir("e:/buja/stat-540") # My Python gets confused about directories. fn = file("unif_random_numbers.txt","w") fn.writelines([str(random.uniform(0,1))+"\n" for i in range(100)]) fn.close() # Now we prototype the tabulation algorithm in Python: posts = [.2,.4,.6,.8] nposts = len(posts) tab = [0]*(nposts+1) for r in [float(s) for s in file("unif_random_numbers.txt","r").readlines()]: for i in range(nposts+1): if i==nposts or r < posts[i]: tab[i] += 1; break print tab # Now the tabulation in C: main() { double posts[4] = {.2,.4,.6,.8}; int nposts = 4; /* cannot be found out; has to be put in by hand */ int tab[5] = {0,0,0,0,0}; /* dito */ int i; double r; /* This line is the idiom for looping over a file: assignment and test in one. */ while(scanf("%lf", &r) == 1) for(i = 0; i <= nposts; i++) if(i == nposts || r < posts[i]) { tab[i]++; break;} for(i = 0; i <= nposts; i++) printf("%d ", tab[i]); printf("\n"); } # Put the above in a file "tab.c", compile, execute: gcc -o tab tab.c tab < unif_random_numbers.txt # In a very real context, the above program was the seed of a 700 line program # that could perform the following tasks (at AT&T Labs): # - Read gigabytes of binary data (unlike the above program that reads ascii data) # containing billions of telephone records. # - Tabulate the data into multi-dimensional tables according to bins defined in terms of # . time of day (e.g., 6am,9am,noon,1pm,4pm,6pm,9pm), with circular bins covering midnight, # . day of the week (e.g., Mo,Tu,We,Th,Fr,Sa,Su, or weekday vs weekend), also circulary, # . duration (e.g., 5 bins defined by the boundaries 5sec,1min,10min,1hr) # . area codes of caller (212, 215, 415, 973,...), # . area codes of callee (dito), # . a categorical variable RBU for caller (business/residence/unknown). # . same for callee (business/residence/unknown), # then write the resulting table to an ascii file. # - Extract features of given phone numbers by creating tables of # time of day x day of week x duration x area code of callee x RBU of callee # for each phone number in a given list of phone numbers # (e.g., size 100,000 for a classification problem, and # of size 80,000,000 for a scoring problem), # then write these "features" to a large binary file. # - Read both the table structure and the list of phone numbers # beforehand from auxiliary input files. # - Write an ascii file with Splus code for reading the table # and/or feature array into Splus arrays, with pretty labels and all. # ** Second small project: Filtering ascii files to binary files. # # The reason for doing this is to relief Python of the need to do ascii conversion, # which as we will see is very slow. You may be able to read a file into Python # but you may choke in the ascii-to-number conversion. # (The same holds for Splus and R which choke on I/O of large files.) # There will be three versions of filters: # a2b_double: convert ascii to binary double floats # # The first filter, a2b_double, is easy and only requires the C function "fwrite()" # in addition to what we already know. # # The next two filters give us examples of argument passing from the shell # to a C program. The filters a2b and b2a require flags to indicate the data type. # a2b_double.c: # You can say "man fwrite" to bash for a terse documentation of "fwrite()", or "info libc" # (down to "Library Index", , Emacs-search "^s fwrite", ). # We also use "ferror()" to learn in which item things went wrong if they did ("man ferror"). #---------------- begin a2b_double.c /* ascii-to-binary filter for double floats Usage: a2b_double < ascii-infile > binary-outfile Every 250,000 items a message is printed on stderr. At the end the total number of items read is printed on stderr. Compile: gcc -o a2b_double a2b_double.c */ #include main() { double dbuffer; long items_read; items_read = 0; while (fscanf(stdin,"%lf",&dbuffer) == 1) /* read ascii as doubles */ { fwrite(&dbuffer,sizeof(dbuffer),1,stdout); items_read++; /* binary write */ if(items_read % 250000 == 0) fprintf(stderr,"%d items read\n",items_read); } if( ferror(stdin) ) { fprintf(stderr,"?Read error, record %d/n",items_read+1); exit(1); } fprintf(stderr,"Finished; %d elements read\n",items_read); exit(0); } #---------------- end a2b_double.c # In order to test this on a larger scale, we also need some large ascii files. # Here is another C routine that writes uniform random numbers to stdout: #---------------- begin rand2a.c /* Writes n uniform [0,1] random numbers to 6 digits on stdout. Writes a message to stderr every 100,000 numbers. Compile: gcc -o rand2a rand2a.c Usage: rand2a 10000000 > bigdoubles.txt */ #include #include main(argn,args) int argn; char *args[]; { int n, i; n = atoi(args[1]); for(i=0; i bigdoubles.txt # for generating 10,000,000 floats, and a2b_double < bigdoubles.txt > bigdoubles.bin # for filtering them to binaries. # Of course we could have written the random numbers to binary right away, # but the point is that we usually get ascii data, and we speed things up # for analysis software if we can filter ascii to binary, as we will see next. ################################################################ # # Large data issues: query, I/O # # We will play with the English dictionary: # Recall the English dictionary that comes with W2K? # If you search this file for "English dictionary" (Emacs: ^R...), # you will find processing of the following directory and files: words_dir = "C:/Program Files/WinEdt/Dict/English/" words_files = ["eng_com.dic", "center.dic", "color.dic", "ize.dic", "labeled.dic", "yze.dic"] # We will redo some in Python, and then move on to questions of lookup. # Because "dictionary" means "hash table" in Python, we are not using this term here, # although we did the first time we looked at these files. words = [] # Remember: adding lists needs the "extend" method, not "append". for fn in words_files: words.extend(file(words_dir+fn).readlines()) len(words) # As we noted earlier: English is a "large language". print words[0:100] # Clean out the lines starting with "%" which are obviously headers, # but do it conservatively, in case something goes wrong: wordsx = [item.replace("\n","") for item in words if item[0] != "%"] # Worked; let "words" point to the clean dictionary: words = wordsx # Q: Why does the following not remove the dictionary? del wordsx # A: ... # The dictionary is not sorted yet: words.sort() # Takes 1 second! # Algorithmic problem: What is a fast way of looking up items in a sorted sequence? # Answer: Bisection # Method: recursively, halve the sequence to the half that contains the query item. # Analysis: The number of comparisons is in the order of binary log of len(sequence). # For example, len=1,000 requires about 10 comparisons; len=1,000,000 about 20. # Generality: Works for any ordered sequence, numbers, strings, tuples, lists,.... # Note: The particular ordering is irrelevant; used as a purely computational device. # There exists a binary search in the module "bisect": B = bisect dir(B) # This one returns a fence post past the query item: doc(B.bisect) # ... and this one before the query: doc(B.bisect_left) # If we bisect for "gibberish" in the sequence, we get this fence post: B.bisect_left(words,"gibberish") # The look up is: words[B.bisect_left(words,"gibberish")] # If one looks up an item that is not in the sequence, one still gets a fence post: qitem = "giberisch" # (This would be the German spelling.) words[B.bisect_left(words, qitem)] # Now what is "gibers"? Anyhow, it seems to be English, and it does not match the query: words[B.bisect_left(words, qitem)] == qitem # How fast is this if we have to do a lot? count = 0 for i in range(1000): count += ("gibberish" in words) # A faster way to establish whether we had a hit is with a comparison, as opposed to "in" test: count = 0 qitem = "gibberish" for i in range(1000): count += (words[B.bisect_left(words, qitem)] == qitem) # Next problem: # Usually, we do not wish to look up an item in a list, but a record in a data table. # The lookup is by one of the fields in the table, such as name or SSN. # # We construct a data table consisting of 5 columns (fields), one a random integer # below 1,000,000, the other a random letters, random floats,.... # That is, the records are just quintuples consisting of an integer, a letter, a float,... # (You may have to choose a smaller number on your machine, e.g., 100,000.) import string, copy, random, bisection; S = string; C = copy; R = random; B = bisect f0 = range(1000000) f1 = [i+10000000 for i in f0]; R.shuffle(f1) f2 = [R.uniform(0,1) for i in f0] f3 = [R.choice(["f","m"]) for i in f0] tab = zip(f0, f1, f2, f3) tab[:5] len(tab) # # We would like to look up a list of records in tab according to the second field. How? # # Answer 1: linear search set = [record for record in tab if record[1] in range(10000000,10000010)] # slow set = [record for record in tab if 10000000 <= record[1] and record[1] < 10000010] # faster set = [record for record in tab if record[2] < .01] set = [record for record in tab if record[3] == "m"] len(set) # # Answer 2: Create an "index" in the sense of databases, # that sorts the table according to the query field; # then use bisection. # An index can be interpreted as a sorted list of pairs, # consisting of the query field and the list index into the table. # Example: An index for the second field of "tab" would be constructed as follows. # List of pairs (field,index): index = [(tab[i][1], i) for i in range(len(tab))] # (Takes a while.) # Sort, implicitly by the field: index.sort() # Now "index" is a database index. The following sorts the table by the second field: tab_s2 = [tab[i] for f,i in index] # Indeed: tab_s2[0:5] # Lookups with bisection are as follows: # If we want to look up the record with second field = 10407321, # we need to bisect "index" so the first field is 10407321. # But "index" is a list of pairs sorted by the first field, # which forces us to augment 10407321 to a pair such as (10407321,0) # that can be compared with elements of "index". # Therefore, B.bisect_left(index, (10407321,0)) # gives us the index into "index" whose pair has first element = 10407321. # The second element is the index into the table "tab": index[B.bisect_left(index, (10407321,0))][1] # And indeed, this record has second field = 10407321: tab[index[B.bisect_left(index, (10407321,0))][1]] # It may be reasonable to hide this arcane piece in a function: def get_rec_with_f2_eq_to(f2): return tab[index[B.bisect_left(index, (f2,0))][1]] # Now for some comparisons with linear lookup: # Look up a single item: [tab[i] for i in range(len(tab)) if tab[i][1] == 10407321][0] # lin get_rec_with_f2_eq_to(10407321) # bis # Clearly bisection is faster. This becomes more visible as we look up more items, such as 3: set = [tab[i] for i in range(len(tab)) if tab[i][1] in [10500000,10400000,10300000]] # lin set = [get_rec_with_f2_eq_to(item) for item in [10500000,10400000,10300000]] # bis # It gets even more pronounced as we look up 100 items: set = [tab[i] for i in range(len(tab)) if tab[i][1] in range(10500000,10500100)] # lin set = [get_rec_with_f2_eq_to(item) for item in range(10500000,10500100)] # bis # This is a range query, though, which can be speeded up some with inequalities: set = [tab[i] for i in range(len(tab)) if 10500000 <= tab[i][1] and tab[i][1] <= 10500100] # Overall message: if a lot of queries have to be processed, bisection wins hands down. ################ # # -- I/O of large datasets -- # # How are we going to get a million records in and out of Python? # # Try to dump and re-load the table "tab" with pickle: import cPickle cPickle.dump(tab, file("tab.pickle","w")) # One time this crashed after half the file was written. # Another time, after freshly starting up Python, it took about 3 minutes. # The file is 54,776,586 bytes long. # # What about direct file I/O? # All conventional I/O needs ASCII conversion: tab_str = str(tab) # Again, one time this took about two minutes, another time it crashed. # If it works, this produces one long string: len(tab_str) # 45889384 characters tab_str[:500] # Over 45mB. Now the actual output: file("tab.dat","w").write(tab_str) # This was fast. And we got it all out on disk: os.system("ls -l tab.dat") # 45889384 bytes # A more human-readable form of output is with formats: tab_strlst = ["%6d %8d %8.6f %1s\n" % rec for rec in tab] # Took about a minute. file("tab_format.dat","w").writelines(tab_strlst) # This was very fast. # Due to limiting of the precision of real numbers in the format ("%8.6f"), # we waste less space then the free format, and in addition we have Newlines # appended, and the file is human readable with lined-up columns: os.system("head tab_format.dat") os.system("tail tab_format.dat") os.system("ls -l tab_format.dat") # 27000000 bytes = 27 mB # # We summarize the big message: The time sink is ASCII conversion, not I/O!! # # Next we will read the data back in from the formatted file: tab_in = file("tab_format.dat","r").readlines() # Fast. And correct: tab_in[:5] # The conversion: tabx = [(int(line[0:6]), int(line[7:15]), float(line[16:24]), line[25:26]) for line in tab_in] # Again, this took longer, but not extremely so. (Another time, though, it crashed.) # It appears possible to read in, and unformat, a file of 27 mB. ################ # # -- Arrays (not numarray's): # # An alternative storage possibility is with arrays. # Recall: Arrays are different from other sequence types in that # array elements must be of the same type. # One could use the module "numarray" (see earlier) or "Numeric" (see later) to this end, # but there is also a simpler module "array" that comes with the standard Python distribution # and therefore does not need extra downloading: import array; A = array doc(A) # def doc(x): print x.__doc__ (Put this in an init file.) # Arrays use single character type codes. dir(A) # Has just a single method: "A.array(...)" doc(A.array) # Examples of arrays: A.array("i", [1,2,3]) A.array("f", [1.,2.,3.]) A.array("c", "abc") # # Variables/columns in data tables have typically a uniform type, as do arrays. # This suggests that variables be implemented with arrays. # So far we have implemented data tables in terms of lists of RECORDS REPRESENTING CASES, # but an alternative is in terms of lists of ARRAYS REPRESENTING VARIABLES. # Or, if the number of variables is small, one can just keep the variables # as a set of independently named arrays. # # Compare: # - Storing variables in arrays = storing columns of a matrix (column major). # This is our orientation in statistics: variables before cases. # - Storing cases as tuples = storing rows of a matrix. # This is the orientation in databases: records (=cases) before attributes (=variables). # # As an example, we reformat "tabx" (row major) to "taba" (col major): taba = [ A.array("i", [rec[0] for rec in tabx]), A.array("i", [rec[1] for rec in tabx]), A.array("f", [rec[2] for rec in tabx]), A.array("c", [rec[3] for rec in tabx]) ] # Bearable in terms of time. # Note that access to "taba" is reverse from "tabx": taba[1][5] == tabx[5][1] # var^ ^case case^ ^var # # Before going on, we get rid of unnecessary data to avoid memory problems: del tab_in del tabx # Takes surprisingly long, which indicates that the garbage collector was busy claiming # released memory. # # Back to arrays: # We reproduce searching/querying with indexing and bisection, now implemented with arrays. # Index construction: index = zip(taba[1],range(len(taba[1]))) # Takes a while. index.sort() # In case we lose it: cPickle.dump(index, file("tab_index.pickle","w")) # In case we need to retrieve it: index = cPickle.load(file("tab_index.pickle","r")) # A query example: index[B.bisect_left(index, (10500000,0))][0] # The query 10500000 fetches 10500000 indeed. # # Again, we hide the tedium of a query in a function: def get_rec_with_f2_eq_to(f2): i = index[B.bisect_left(index, (f2,0))][1] return (taba[0][i],taba[1][i],taba[2][i],taba[3][i]) # Example of retrieval: case/record with field 2 = 10500000 get_rec_with_f2_eq_to(10500000) # A disadvantage of plain arrays (non-numarray) is that they cannot be indexed by lists: taba[1][[1,2,3]] # (bombs) # whereas this would work in numarray. # So we have to do list comprehension if we want to retrieve multiple records: [get_rec_with_f2_eq_to(f2) for f2 in range(10500000,10500100)] # which is instantaneous, as opposed to linear search: [(taba[0][i],taba[1][i],taba[2][i],taba[3][i]) for i in range(len(taba[0])) if taba[1][i] in range(10500000,10500100)] # which takes forever. # # A useful feature of arrays, however, is binary I/O. # One can write an array to a file in binary format: taba[1].tofile(file("taba1","w")) # 4 bytes per integer, no overhead: os.system("ls -l taba1") # Reading back in: works only as appending to an existing array (a touch unfriendly), # and one has to know the number of items to be read (same). taba1 = A.array("i",[]) taba1.fromfile(file("taba1","r"),1000000) # But it works, the array is correctly retrieved: taba1 == taba[1] # # Happily, we have the "a2b_double" conversion filter, which we can use # to convert lists of ascii numbers to binary numbers, which # can then be read into an array efficiently. # # For demonstration, we write 10 million floats to a file. # This takes about 10 minutes, so be patient. # We write out 10 million in all, 100 chunks of size 100,000 each # so we have a sense of progress: fn = file("bigdouble","w") i = 0 while i <100: fn.writelines(["%6.4f\n" % random.random() for j in range(100000)]) i+=1; print i fn.close() # 70 mB, as it should be: os.system("ls -l bigdouble") os.system("head bigint") os.system("a2b_double < bigdouble > bigdouble.bin") # This took a couple of minutes, even though it is a C program! # (Un)Formatting is not easy even for C code. os.system("ls -l bigdouble.bin") # 80 mB, as it should be. # Binary read into an array: bigdouble = A.array("d",[]) bigdouble.fromfile(file("bigdouble.bin","r"), 10000000) # Took just a few seconds!!! len(bigdouble) bigdouble[:10] bigdouble[-10:] ################ # # Summary -- "large" data handling in a high-level language such as Python # # 1) What size data your machine is able to handle depends largely # on the amount of main memory available. # On a machine with 512 mB of memory it iss possible to handle # data structures that take up in the order to 50mB of space. # One reason why this may not be more is that many operations # need space to allocate a copy of the original data structure, # which runs the memory needs up to 100mB for a 50mB structure. # Example: reading from file -> string version of the data # ASCII-to-float/int conversion -> numeric version # reshaping to a different matrix form -> structured version # derivation of selected indices -> auxiliary data of comparable size # ... # # 2) Memory being the constraining factor, it may be good advice # to release no longer needed data structures, # so the garbage collector can claim its memory. # In Python, releasing memory is achieved with the "del" statement, # which may take surprisingly long to execute (couple of seconds) # for this reason: garbage collection. # [Modern languages such as Python maintain "reference counts" for # all data structures; when the count sinks to zero, the structure # is garbage-collected because it has dropped out of the user's # universe of reference. I.e., none of the user's variables and # pointers can reach the structure anymore.] # # 3) A major consideration in the handling of "large" data (~50-100 mB) is I/O. # Of the two I/O functions -- reading/writing disk files and (un-)formatting # (= converting between ASCII and binary form) -- the time-consuming part # is formatting; pure disk access is comparatively fast. # It can therefore be desirable to convert ASCII data to binary form with a filter # written in the C language, before reading the data into a high-level system such as Python. # Similarly, it may be desirable to write the data to disk in binary form, and # convert the binary file to human-readable or machine-independent form with a C filter. # # 4) Data we see in statistics is usually stored in some form of # rectangular data structure. Nomenclature # - In statistics we call the data structure a "data matrix"; # its rows are called "cases"; its columns "variables". # - In databases, the data structure is called a "table" or "relation"; # its rows are called "records"; its columns "attributes". # [A relational database is a collection of related tables.] # More important than nomenclature is the difference in storage priority: # - In statistics, we analyze relationships between variables; # we therefore prefer to store tables such that variables are easily handled. # => Column-major storage; data = list of variables. # - In databases, the priority is querying the data, i.e., efficient access to cases. # One therefore prefers to store tables such that cases are # => Row-major storage; data = list of cases. # # 5) The problem of querying data can be solved efficiently if the data fits in main memory. # Memory is typically limited enough that sorting can be done in very short time. # Querying data can then be done with bisection in base-2-log-of-N time # after sorting/indexing the data according to the query attribute. # High-level software typcically offers bisection functions. # # 6) In statistics we can often make the assumption that a set of variables is # of uniform numeric type. It is then efficient to store the data in matrix/array form. # Linear algebra software can be applied to analyze the relations between variables. ################################################################ # # NUMERICAL LINEAR ALGEBRA, in Python: # # We go over a few basic matrix operations as implemented in Python. # # Should have downloaded and installed the "Numeric" module from # www.python.org (-> "The Vaults of Parnassus" -> "math") import Numeric; N = Numeric # # Import and abbreviate: import array; A=array import multiarray; M=multiarray import Numeric; N=Numeric import LinearAlgebra; L=LinearAlgebra # # Notes: - This assumes that Numeric has been downloaded from www.python.org. # - "Numeric" is the precursor of "numarray", which is still somewhat incomplete. # - "Numeric" comes with "LinearAlgebra". # - "LinearAlgebra" is a set of wrapper functions for the corresponding functions # in Lapack, the standard low-leverl library for numerical linear algebra. # - "array" has vectors only, essentially lists of uniform type, but faster. # - "multiarray" is the multidim extension with matrices (2D) and arbitrary dim arrays. # - You have to check which functions are available where. For example # "array" has no sort, but "multiarray" has. # "LinearAlgebra" has no matrix multiplication but "Numeric" and "multiarray" do. # - The impression is that Numeric supersedes both "multiarray" and "array", # but it is not part of the standard distribution of Python yet. # # So what linear algebra is in "Numeric" and "multiarray"? dir(N) dir(M) N.matrixmultiply(M.array([[1,2],[3,4]]), N.array([[1,1],[1,0]])) M.matrixproduct(M.array([[1,2],[3,4]]), N.array([[1,1],[1,0]])) N.matrixmultiply([[1,2],[3,4]], [[1,1],[1,0]]) M.matrixproduct([[1,2],[3,4]], [[1,1],[1,0]]) # So M-matrices, N-matrices, and lists with matrix structure can be multiplied # interchangeably with "Numeric" and "multiarray". # Unfortunately, the two do not have the same name for matrix multiplication. # # We continue with "Numeric" only. # Tests of a few simple matrix manipulations: N.array([[1,2,3],[4,5,6]]) # Form an array, a matrix in this case. N.transpose(N.array([[1,2,3],[4,5,6]])) N.trace([[1,2],[3,4]]) # Trace = sum of diagonal. N.innerproduct([1,2,3],[1,1,1]) N.negative([1,2,3]) N.diagonal([[1,2],[3,4],[5,6]]) N.sort([[1,2],[-1,1],[3,1],[2,1]],axis=0) # Sorting rows by the first column. N.sort([[1,2],[-1,1],[3,1],[2,1]],axis=1) # Sorting columns by the first row. N.searchsorted([10,11,12],[12,11,11,12]) # Bisection 1st argument for queries in 2nd arg. N.concatenate( ([[1,1],[2,1],[3,1]], [[4,1],[5,1]], [[6,1]]) ) # 1 arg, a tuple of sequences. # # Matrix multiplication and transposition are so useful that we introduce abbreviations # to unclutter notation: N.m = N.matrixmultiply N.t = N.transpose # Now for the real linear algebra: # What is in "LinearAlgebra"? dir(L) # You can find the source in the file L.__file__ # or, actually, in the ".py" version of this file (it is a compiled ".pyc" file) # In my case, L.__file__ says # /usr/lib/python2.2/site-packages/Numeric/LinearAlgebra.pyc # which should be translated to # c:/cygwin/lib/python2.2/site-packages/Numeric/LinearAlgebra.py # because the cygwin directory "/usr/" is "c:/cygwin/" for W2K. # # Here is a list of functions defined in this file: os.system("grep 'def ' c:/cygwin/lib/python2.2/site-packages/Numeric/LinearAlgebra.py") def _commonType(*arrays): def _castCopyAndTranspose(type, *arrays): def _fastCopyAndTranspose(type, *arrays): def _assertRank2(*arrays): def _assertSquareness(*arrays): def solve_linear_equations(a, b): def inverse(a): def cholesky_decomposition(a): def eigenvalues(a): def Heigenvalues(a, UPLO='L'): def eigenvectors(a): def Heigenvectors(a, UPLO='L'): def singular_value_decomposition(a, full_matrices = 0): def generalized_inverse(a, rcond = 1.e-10): def determinant(a): def linear_least_squares(a, b, rcond=1.e-10): # # Unfortunately, "LinearAlgebra" is unfriendly in that it does not coerce/promote lists # to arrays: L.inverse([[1,2],[3,4]]) # Bombs. L.inverse(N.array([[1,2],[3,4]])) # Works. # Checking that this is indeed an inverse: N.m(N.array([[1,2],[3,4]]), L.inverse(N.array([[1,2],[3,4]]))) # = pretty much the identity matrix. # # LS: doc(L.linear_least_squares) # Build up design matrix x (= arg a) and response y (= arg b): x = N.t([[1]*20, [random.random() for i in range(20)]]) x y = N.array([1.0+2.0*xrow[1] for xrow in x]) lls = L.linear_least_squares(x, y) lls # = ([coeffs], [RSS, rank(x), [singular values of x]]) lls[0] # = [coeffs] lls[1] # = [RSS] lls[2] # = rank(x) lls[3] # = [singular values of x] # # # Before we go any further, we want to brush up on linear algebra in a more orderly fashion. # Therefore, we step back to basics and build up our intuitions and interpretations # of linear algebraic concepts step by step. # ################ # # A computational/geometrical/statistical recap of Linear Algebra: # # The following assumes that you know linear algebra. # It is a condensed summary and interpretation of linear algebra # with computation, high-dim geometry, and statistical application in mind. # # # - Forming vectors and matrices: # # 1) We can create vectors and matrices as 1D and 2D arrays from lists and lists of lists. # Vectors: from lists of elements vec = N.a([1,2,3]) vec # Matrices: from LISTS OF ROWS mat = N.a([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]) mat # If a matrix is given as a list of columns, use them as rows first and transpose: N.t([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]) # Note transposition does promotion, so you do not need to use "N.a" first. # # 2) If a linear list is to fill a matrix of a certain size # (in row-major order only!), use "N.resize": mat = N.resize([1,2,3,4,5,6,7,8,9,10,11,12], [4,3]) mat # which created a 5x2 matrix. # # 3) Odds and ends: checking shape and overall size (=length) N.shape(vec) N.shape(mat) N.size(vec) N.size(mat) # # # - Selections/permutations: # # 1) Selecting elements: indexing vec[2] # element 3 of the vector "vec" mat[1,2] # element in row 2 and column 3 of the matrix "mat" # # 2) Selecting columns and rows: slicing/indexing mat[:,1] # column 2 as a vector mat[2,:] # row 3 as a vector # # 3) Selecting submatrices: Slicing with fence posts mat[1:3,:] # rows 2 and 3 mat[:,2:] # cols 3 to the end, here just the third column, but as a 3x1 matrix # # 4) Selecting/repeating/permuting arbitrary rows/cols of a vector or matrix: N.take(vec, [2,0]) # a vector containing elements 3 and 1 of "vec" N.take(mat, [3,2,3,0]) # a matrix containing rows 4,3,4,1 of "mat"; default: axis=0 N.take(mat, [2,1,0], axis=1) # a matrix with cols in reverse order of "mat" N.take(mat, range(N.shape(mat)[0])*2) # 2 copies of "mat" stacked on top of each other N.take(mat, range(N.shape(mat)[1])*2, axis=1) # 2 copies of "mat" side by side (default: axis=0) # # # - Inner products: # N.i = N.innerproduct # Example: a = N.a([0,1,2,3]) b = N.a([1,1,1,1]) N.i(a,b) # It is really the same as: N.sum(a*b) # ("*" is the Numeric version of element-by-element multiplication) # The simplicity of the definition betrays is geometric importance. # The most important aspect of inner products is that they determine # the metric concepts of geometry: # NORM (length), ANGLE, ORTHOGONALITY, ORTHOGONAL PROJECTION. # In what follows we give a gentle re-introduction to the geometry of metric p-space, # more precisely, Euclidean p-space/n-space. # # # - Pythagoras: # # It all starts with the Pythagorean theorem: c^2 = a^2 + b^2 if angle(a,b) = right angle. # After introducing p-space as the space of coordinate vectors x = (x_1,...,x_p), # one generalizes the Pythagorean theorem by declaring the coordinate axes to be orthogonal. # [Note: this is an act of creation, not derivation!] # Therefore, one defines the squared norm or length to be: # |x|^2 = x_1^2 +...+ x_p^2 # This is the squared distance of the point at x from the origin, (0,...,0). # For measuring distance between the points given by two arbitrary vectors, # x and y=(y_1,...,y_p), say, we need to compute: # dist(x,y)^2 = |x-y|^2 = (x_1-y_1)^2 +...+ (x_p-y_p)^2 # which calls for algebraic expansion: # = (x_1^2 +...+ x_p^2) + (y_1^2 +...+ y_p^2) - 2*(x_1*y_1 +...+ x_p*y_p) # This in turn calls for the introduction of the inner product, as a pure convenience: # = x_1*y_1 +...+ x_p*y_p # We can now define squared distances in terms of inner products: # |x|^2 = and |x-y|^2 = + - 2* # but we can also write inner products in terms of squared distances: # = ( |x|^2 + |y|^2 - |x-y|^2 )/2 # So the concepts are one-to-one: # every inner product defines a squared distance, # every squared distance defines an inner product, # but inner products are more versatile. # The inner product is a "bilinear" function, the squared norm a "quadratic" function. # The bilinear function is needed for the expansion of quadratic expressions. # # # - Computations relating to inner products and norms: # # The norm or length of a vector is obtained with the function N.norm = lambda x: N.sqrt(N.i(x,x)) # [Recall: lambda = anonymous single-expression function] # Because of the Pythagorean relation, one often needs the squared norm, # which is worth a function of its own: N.sqnorm = lambda x: N.i(x,x) # [We added "norm" and "sqnorm" to Numeric to have it all under one roof.] # Examples: N.sqnorm([1,1,1]) N.sqnorm([1,1,1,1]) N.norm([1,1,1]) N.norm([1,1,1,1]) # A frequently necessary operation is standardization to unit norm: N.std = lambda x: N.a(x) / N.norm(x) # Examples: N.std([1,1]) N.std([1,1,1,1]) # # Statistical interpretation of the inner product and the squared norm: # covariance and variance, second-order cross-moment and moment # modulo centering of the vectors and division by n or n-1. # Example: # We generate normal data, center them, and compute the variances and covariance. R = RandomArray # comes with Numeric; creates random arrays, not just numbers dat1 = R.normal(0,1,10) # n=10 in what follows. dat2 = R.normal(0,1,10) N.center = lambda x: N.a(x) - N.sum(x)/N.size(x) # function to center a vector "x" dat1 = N.center(dat1) dat2 = N.center(dat2) # These are the variances and covariance (using the n-1 convention): N.sqnorm(dat1)/9 N.sqnorm(dat2)/9 N.i(dat1, dat2)/9 # # # - Angle: # # In the Pythagorean heuristic is an intuitive notion of one particular angle: the right angle. # The general notion of angle follows again a Pythagorean heuristic: # If the vector with coordinates (a_1,a_2) in the plain is of unit length: a_1^2 + a_2^2 = 1, # we call a_1 the cosine and a_2 the sine of the "angle" between (1,0) and (a_1,a_2). # We note: <(1,0),(a_1,a_2)> = a_1 # Generalizing this to angles between arbitrary unit-length vectors x and y in p-space, # we define # cos(angle(x,y)) = # If the vectors are not of unit length, they need to be standardized first: # cos(angle(x,y)) = = /(|x|*|y|) # # Statistical interpretation of the cosine: CORRELATION, # again only if applied to centered vectors: N.cor = lambda x,y: N.i(x,y) / (N.norm(x) * N.norm(y)) N.cor(dat1, dat2) # The angle is often a more intuitive measure of angular closeness than its cosine: N.ang = lambda x,y: N.arccos( N.i(x,y) / (N.norm(x) * N.norm(y)) ) # which is the 2*pi-based angle. [More cautious programming would intercept zero norms...] # 360-degree-based angles are given by N.ang360 = lambda x,y: N.arccos( N.i(x,y) / (N.norm(x) * N.norm(y)) ) /math.pi * 180. # Example: N.ang360(dat1, dat2) # Special cases are: N.ang360([1,0,0,0],[2,0,0,0]) # parallel vectors N.ang360([1,0,0,0],[0,3,0,0]) # orthogonal vectors # # # - Some special vectors: # # 1) canonical basis vectors N.a([0,1,0,0]) N.i([0,1,0,0], [1,2,3,4]) # = i-th coordinate of "vec" # # 2) Sum and mean vectors: N.a([1,1,1,1]) # "sum vector" in 4D N.a([1,1,1,1])/4. # "mean vector" in 4D # Why these names? Because N.i([1,1,1,1], [1,2,3,4]) # is the sum of the elements in the second vector; N.i(N.a([1,1,1,1])/4., [1,2,3,4]) # is the mean of the elements in the second vector. # Note that the mean vector does NOT have unit norm: N.norm(N.a([1,1,1,1])/4.) N.norm(N.a([1,1,1,1,1])/5.) N.norm(N.a([1,1,1,1,1,1])/6.) # Its norm goes to zero as the dimension goes to infinity. # # 3) The following are examples of "contrast" or generalized difference vectors: N.a([1,-1,0,0]) N.a([-.5,-.5,.5,.5]) N.a([1,-1./3,-1./3,-1./3]) # Contrast vectors are defined as the vectors orthogonal to the mean/sum vector: N.i([1,1,1,1], [-.5,-.5,.5,.5]) # Equivalently: N.sum([-.5,-.5,.5,.5]) == 0 # Statistical interpretations of contrast vectors: # a) coefficients whose inner product with a set of numbers forms a generalized difference; # b) a set of numbers with zero mean; # c) residual vectors after fitting a mean. # # # - Special matrices: # # 1) The identity matrix "I" of size p (=5): N.identity(5) # # 2) Diagonal matrices "D" of size p (=5): diagelements = [1,2,3,4,5] N.identity(5)*diagelements # The list is promoted to a vector. # Left-multiplication with a diagonal matrix multiplies the rows with the diagonal elements: mat = N.a([[1,2],[3,4],[5,6]]) diag = N.identity(3)*[1,2,3] N.m(diag, mat) # Right-multiplication with a diagonal matrix multiplies the cols with the diagonal elements: mat = N.a([[1,2],[3,4],[5,6]]) diag = N.identity(2)*[2,3] N.m(mat, diag) # This is often used to represent the correlation matrix in terms of the covariance matrix: # cormat = diag * covmat * diag, where "diag" has 1/std.dev(columns) in the diagonal. # # 3) The "one" or summation matrix "E" of size pxq (=5x3): summat = N.ones([5,3]) N.m(onemat, [1,2,3]) # a constant vector of sums # # 4) The averaging matrix of size pxq (=5x3): avemat = N.ones([5,3])/3. N.m(avemat, [1,2,3]) # a constant vector of "fitted means" # # 5) Rank-one matrices, or outer products: # Given two vectors a and b of length p and q, respectively, # their outer product is a matrix C of size pxq with elements: # C_ij = a_i * b_j # If the vectors are interpreted as px1 and qx1 matrices, # the outer product in terms of matrix multiplication is: # C = a b^t # which is a matrix of size pxq. # Outer product matrices are of rank one, i.e., their column/row spaces are 1-dim: # C = (b_1*a, b_2*a,..., b_q*a) # Note that the sum matrix is the outer product of two sum vectors. # Computations: avec = N.a([1,2,3,4]) bvec = N.a([1,2,4]) N.outerproduct(avec, bvec) # # # - Interpretations of matrix multiplication: # # If A is pxq and B is qxr, the matrix product C = AB is pxr and has entries # C_ik = sum_j A_ij B_jk # There are two interpretations of matrix products that are both useful: # # 1) C is the matrix of all INNER PRODUCTS of the ROWS of A with the COLUMNS of B. # 2) C is the sum of OUTER PRODUCTS of COLUMNS of A and ROWS of B. # # The two interpretations derive from different settings of parentheses: # # C_ik = (sum_j A_ij B_jk) = sum_j (A_ij B_jk) # ^^^^^^^^^^^^^^^ ^^^^^^^^^ # inner product outer product # of A_i* and B_*k of A_*j and B_j* # # We denoted by A_i* and B_j* the rows, and by A_*j and B_*k the columns, respectively. # We have not said anything yet about outer products, but this will change below # when we talk about projections and eigen/singular value decompositions. # Here are a couple of illustrations: amat = N.a([[1,1,1],[1,1,-1],[1,-1,1],[1,-1,-1]]) bmat = N.a([[1,2],[3,4],[5,6]]) # matrix product: cmat = N.m(amat, bmat) cmat # matrix of inner products: cmat = N.zeros([4,2]) for i in range(4): for k in range(2): cmat[i,k] = N.i(amat[i,:], bmat[:,k]) cmat # sum of outer products: cmat = N.zeros([4,2]) for j in range(3): cmat += N.outerproduct(amat[:,j], bmat[j,:]) cmat # # When the inner-product interpretation is used, one often sees matrix products # in which the first matrix is transposed, as in: R = RandomArray amat = N.resize(R.normal(0,1,24), [8,3]) # 8x3 bmat = N.resize(R.normal(0,1,32), [8,4]) # 8x4 N.m(N.t(amat), bmat) # 3x4 matrix of inner products of columns of both # The product of transposed "mat1" and untransposed "mat2" is # the matrix of all inner products of the COLUMNS of "mat1" and the COLUMNS of "mat2". # # In statistics we use this interpretation to form the covariance matrix of two data matrices: N.m(N.t(amat), bmat)/(8-1) # which would be the covariance matrix if the columns of both matrices were centered. # As it stands, it is the matrix of 2nd order cross-moments. # # # - Pythagorean decompositions, again: # # If two vectors are orthogonal, such as N.i([1,1,2], [1,-1,0]) # the squared length of their sum is the sum of their squared lengths: N.sqnorm(N.a([1,1,2]) + N.a([1,-1,0])) N.sqnorm([1,1,2]) + N.sqnorm([1,-1,0]) # This generalizes to arbitrarily many vectors that are pairwise orthogonal. # In this case, it is convenient to collect the vectors as columns of a so-called "frame", # i.e., a matrix of size pxq where p>=q (taller than wide), and check # orthogonality of the columns with a matrix multiplication. # Example: mat = N.t([[1,1,2], [1,-1,0], [1,1,-1]]) # obviously orthogonal N.m(N.t(mat), mat) # which is indeed diagonal, i.e., the columns are pairwise orthogonal. # # Hence we have the Pythagorean relation: sum of squares == square of sum N.sqnorm(mat[:,0]) + N.sqnorm(mat[:,1]) + N.sqnorm(mat[:,2]) # sum of squares N.sqnorm(mat[:,0] + mat[:,1] + mat[:,2]) # square of sum # Since the squared norm of a vector is just the sum of the squares of its elements, # the first line is just the sum of squares of all elements of the matrix "mat". # This quantity is called the (squared) "Frobenius norm" of "mat": N.sqfrob = lambda mat: N.sqnorm(N.reshape(mat, [N.product(N.shape(mat))])) N.sqfrob(mat) N.frob = lambda mat: N.norm(N.reshape(mat, [N.product(N.shape(mat))])) N.frob(mat) # Interpretations of the Frobenius norm: # 1) It is just the regular vector norm if the matrix is strung out to a vector. # It can be used to define the Euclidean distance between two matrices: # dist(A,B) = |A-B|_F # 2) Statistics: # The squared Frobenius norm is often thought of as (proportional to) # the total variance of the matrix, i.e., the sum of variances of the columns, # modulo division by n or n-1, if the matrix is of size nxp, and modulo centering. # # # - Orthogonal projections: # # Generalizing the intuition behind <(1,0),(a_1,a_2)> = a_1 # we call the value a_1 the orthogonal projection of the vector (a_1,a_2) onto (1,0). # Strictly speaking, a_1 is only the value of the projection. # The projected vector is # P((a_1,a_2)) = a_1 * (1,0) = (a_1,0) # For an arbitrary unit-norm vector x, the orthogonal projection of y on x is: # P_x(y) = x # If x is not unit-norm, the projection is obtained by standardizing x to unit-norm: # P_x(y) = x/|x| = / |x|^2 * x # which we recognize as the formula for simple regression without intercept of y on x. # Indeed, the solution of the Least Squares problem # |y - t * x|^2 = min_t # is # t = / |x|^2 # or if x has unit norm: # t = # # P_x is a linear mapping from p-space to p-space, so it must be representable as # a matrix multiplication. Indeed, for unit-norm x we have: # P_x(y) = x = x (x^t y) = (x x^t) y # where (x x^t) is the outer product of x with itself: # (x x^t)_ij = x_i*x_j # This can be literally understood as the matrix product of # the px1 matrix (= column vector) x with the 1xp matrix (= row vector) x^t, # where the inner products collapse to simple products of the numbers x_i and x_j. # # Example: projection onto the mean vector x = N.std([1]*10) # standardized mean vector y = R.normal(0,1,10) # "data" vector Px = N.outerproduct(x, x) # projection matrix N.m(Px, y) # which is the same as: N.i(y, x) * x # # So far we have looked at orthogonal projections onto one dimensions only. # A general projection onto a subspace of more than one dimension is a linear map that # 1) leaves the subspace fixed, and # 2) maps orthogonally onto the subspace. # These two properties can be cast as follows: # 1) P P = P (idempotence) N.m(Px, Px) # 2) P^t = P (symmetry) N.t(Px) # We prove the equivalence: # 1) Idempotence is just another way of saying that the image space is fixed. # 2) Given idempotence, symmetry is equivalent to orthogonal mapping: # = (Py)^t (y-Py) = y^t P^t (y-Py) = y^t P^t y - y^t P^t P y = 0 # because P^t = P^t P. # Comment: Idempotence is the defining property of a projection. # Symmetry makes it an orthogonal projection. # [Symmetry means = for all y and z, # hence symmetry is really w.r.t. an inner product.] # # Construction of orthogonal projections: # Obtain an orthonormal basis of the subspace onto which you want to project. # Collect the basis vectors as columns of a nxp frame U (n>=p). # Orthonormality means: # U^t U = I (size pxp) # Then the orthogonal projection is # P = U U^t # One shows easily idempotence and symmetry: # 1) (U U^t) (U U^t) = U (U^t U) U^t = U U^t and 2) (U U^t)^t = U U^t # # Other properties of orthogonal projections: # a) P = sum_j (u_j u_j^t) where u_j are the columns of U = (u_1,...,u_p). # That is, a p-dim projection is the sum of p 1-dim projections. # [Proof: matrix product = sum of outer products.] # b) If P is the orthogonal projection onto the p-dim subspace V, # then I-P is the orthogonal projection onto the orthogonal complement of V. # [Proof: (I-P)(I-P) = I-P, (I-P)^t = (I-P), (I-P)P = 0.] # Statistical interpretation: # P projects onto the space of fits, # I-P projects onto residual space. # # Q: How do we obtain orthonormalized basis vectors # if the space is given as the span of a set of oblique vectors? # A: singular value decomposition (svd), or Q-R decomposition, or Gram-Schmidt. # (See below for the svd.) # This question asks really for the solution of the Least Squares problem: # |y - x|^2 = min_x subject to x in subspace V; or # |y - Xb|^2 = min_b where the columns of X span the subspace V. # If you have an orthonormal frame U whose columns also span V, # the solution of the first formulation is # x = (U U^t) y = U (U^t y) # and the solution of the second problem is obtained by solving for b in # U^t X b = U^t y # There are two ways of making the solution easy: # 1) Find U such that R = U^t X is triangular. # This is the point of the Q-R decomposition, whose name stems from # the decomposition X = Q R, where Q is our U, and R is triangular. # 2) Find U such that R = D V, where D is diagonal and V is orthogonal (pxp). # Recall that orthogonal matrices are easy to invert: V^(-1) = V^t # We will only introduce the svd below, not the Q-R decomposition. # Take note that the usual formula, b = (X^t X)^(-1) X^t y , is not considered # a numerically stable way of solving the LS problem. # # # - Orthogonal matrices: # # Orthogonal matrices are square matrices that define linear maps n-D -> n-D # that preserve the norm, and hence the inner product: # |U x| = |x| or equivalently = (prove the equivalence). # In particular, if x and y are orthogonal, so are Ux and Uy. # Equivalent to the defining property of an orthogonal matrix is # U^(-1) = U^t, that is, the inverse is just the transpose. # # Often one uses orthogonal matrices to "change the coordinate system" without # changing the metric/norm. In this interpretation writing, x = U x' does # not mean that we map x' to x, but that x and x' coordinate vectors for # the same abstract point in different bases. # For x the (old) basis is given by the columns of the identity matrix I. # For x' the (new) basis is given by vectors whose coordinates in the old basis are # the column vectors of U. ################ # # CHOLESKI DECOMPOSITION: # # - If S is a square symmetric non-negative definite matrix, the Choleski decomposition is # S = C C^t # where C is lower (or upper) triangular. # - Choleski is a sort of a matrix square root. # Like regular square roots of negative numbers it bombs if S is not non-negative definite. # - Choleski is relatively cheap to obtain. # - Choleski is useful because it yields a stable inversion w/o computing a matrix inverse. # Reason: Systems of equations with coeffs that form triangular matrices are trivial to solve. # - Example of appliation: X^t X in LS is non-negative definite, hence has a Choleski. # [Choleski is NOT the preferred way to compute LS, though. # Forming X^t X can incur loss of significant digits, compared to working off X directly. # Modern software uses a so-called Q-R decomposition: # X = Q R, where Q has orthonormal columns and R is triangular. # A more expensive but also more informative alternative is the singular value decomposition # introduced below.] # # Example: s = N.array([[10,2,3],[2,30,4],[3,4,50]]) c = L.cholesky_decomposition(s) N.m(c, N.t(c)) # Indeed. # Here is an example that bombs: s = N.array([[1,2,3],[2,3,4],[3,4,5]]) c = L.cholesky_decomposition(s) # # # EIGENDECOMPOSITION: # # - This is a decomposition of a general symmetric matrix, not necessarily definite.. # - If S is symmetric of size nxn, the eigendecomposition is # S = U D U^t, where U is nxn orthogonal: U^t U = I; D is nxn with real diagonal D_i. # - A more parsimonious eigendecomposition would use r = rank(S), and let # U of size nxr, U^t U = I (rxr), D of size rxr, and D_i != 0 (i=1...r). # - Rewriting the eigendecomposition as # U^t S U = D # shows that U changes coordinates such that S becomes diagonal if interpreted # as a quadratic form: Q(x) = x^t S x; x = U y => Q*(y) = Q(x) = y^t (U^t S U) y. # - The columns of U contain the eigenvectors, the diagonal of D the eigenvalues: # S U = U D, or: S u_i = u_i D_i # - The eigenvectors u_i are the stationary points of the Rayleigh quotient: # R(x) = (x^t S x) / (x^t x) # and the eigenvalues are the stationary values: # R(u_i) = D_i # Hence the eigenvectors are the extremal/stationary directions of the quadratic form # x^t S x subject to x^t x = const (=1, e.g.). # - Another interpretation of the eigendecomposition: # S = U D U^t <=> S = sum_i D_i (u_i u_i^t) # where (u_i u_i^t) is a rank-one matrix that represents the orthogonal projection onto # the direction of u_i. # [Recall: P = u u^t => P x = (u u^t) x = u (u^t x) = u ] # # Example: s = N.array([[1,2,3],[2,3,4],[3,4,5]]) e = L.eigenvectors(s) e[0] # eigenvalues e[1] # eigenvector matrix # Unfortunately, e[1] is the transpose of the conventional U: e[1] = U^t, not U, # hence e[1] stores the eigenvectors in the rows, not columns. # Confirm this in an example: N.m(N.m(e[1], s), N.t(e[1])) # ~ U^t S U = D # For confirmation that e[1] is orthogonal, it does not matter: U^t U = I <=> U U^t = I N.m(e[1], N.t(e[1])) # ~ I, hence U orthogonal # # Another example: s = N.array([[1,1,1],[1,1,1],[1,1,1]]) # Guess at least one eigenvector before you compute the answer... e = L.eigenvectors(s) e[0] # eigenvalues e[1] # eigenvector matrix # We notice another oddity of the implementation: the eigenvalues are NOT sorted. # # # SINGULAR VALUE DECOMPOSITION (svd): # # - The svd is a decomposition of an ARBITRARY sized matrix. # - If A is of size nxp, n>=p, and r = rank(A), then the svd of A is # A = U D V^T, where U^t U = I, V^t V = I, D is diagonal and D_i > 0. # and U is nxr, V is pxr, D is rxr. # - Without loss of generality one can assume the singular values D_i are sorted # in descending order: D_1 >= ... >= D_r # - The singular vectors u_i and v_i and the singular values D_i satisfy # A u_i = D_i v_i # - An equivalent statement to the svd is: A V = U D # - The pairs of singular vectors (u_i,v_i) are the stationary/extremal points of # u^t A v subject to u^t u = 1 and v^t v = 1. # - The left-singular vectors u_i are the eigenvectors of A A^t, # the right-singular vectors v_i are the eigenvectors of A^t A. # - The squares of the singular values, D_i^2, are the non-zero eigenvalues # of both A A^t and A^t A. # - Two geometric facts: # columnspace(A) = columnspace(U) # rowspace(A) = columnspace(V) # - Corollary: If you have to find an o.n. basis of the columnspace of a matrix A, # the columns of the U-matrix of the svd of A will do the job. # - [More geometry: Let b_1,...,b_p be o.n. vectors in n space, and dito c_1,...,c_q. # Let B = (b_1,...,b_p) and C = (c_1,...,c_q) be matrices with the respective columns. # Note: B^t B = I_p, C^t C = I_q (orthonormality of the vectors). # Fact: The largest singular value of A = B^t C is the cosine of the smallest angle # between span(b_1,...,b_p) and span(c_1,...,c_q).] # - The svd of a symmetric matrix S ALMOST yields the eigendecomposition of S: # S symmetric nxn => u_i = (+-)v_i, both nxr (r=rank(S)), and # D_i = |i-th eigenvalue of S| (absolute value). # - Assuming A is full rank (r = p), the following is called the condition number of A: # cond(A) = D_1 / D_p # If cond(A) is very large (1E10, say), A is nearly degenerate (nearly not of full rank). # Intuition: Compared to D_1, D_p is virtually zero, hence A u_p ~ 0. # - Application of the condition number: # Numerical algorithms for the svd will rarely be able to find a perfect zero singular value. # You detect it by inspection of the condition number. # Subsequently you can reduce the svd to its proper rank by tossing the columns of # U and V belonging to the smallest singular values. # - Plug: The svd is an EXTREMELY useful tool!!! # # Example: a = N.array([[1,2,3], [4,5,6], [7,8,9], [10,11,12]]) L.singular_value_decomposition(a) u,di,vt = L.singular_value_decomposition(a) # (Recall unpacking of tuples.) # Unfortunately, again a matrix is returned violating a convention: # the last is not V, but V^t, as we will see. u # 4x3, ~ U di # Length 3, sorted, >0, the last singular value is almost zero. vt # 3x3 # Reshape: d = N.identity(len(di)) * di # Turns di into a diagonal matrix ~ D v = N.t(vt) # ~ V # Verification of this example: N.m(N.t(u), u) # U^t U = I_r N.m(N.t(v), v) # V^t V = I_r N.m(a, v) - N.m(u, d) # A V - U D # Pretty good. Our interpretation of what is U, D, V seems to be accurate. # # ################ # # Application of the svd to collinearity detection in multiple linear regression # # New data set, which you will see a lot next semester: Boston Housing data boston = N.array([[float(word) for word in line.split()] for line in file("boston.dat","r").readlines()]) N.shape(boston) boston[:5,] # The variable names are: print file("boston.col","r").read() # Last variable, "MEDV" (Median Housing Value), is the response. # Many multiple linear regressions have been run on these data. # In any multiple linear regression, one has to examine the degree # of collinearity among the predictors. # If X is the predictor matrix, collinearity analysis looks for # directions v in predictor space for which X v is smallest in some sense. # Obviously one has to impose a constraint on v, such as v^t v = 1, unit length, # and the columns of X have to scaled so they are comparable in some sense. # One usually standardizes the columns of X to zero mean and unit variance. # In summary, this sounds like an svd problem, where we look for the small singular values: # X v = d u # If the singular value is very small, d ~ 0, the matrix X is nearly degenerate: X v ~ 0. # # Computations: # # 0) Remove the response: boston_2svd = boston[:,:-1] N.shape(boston_2svd) boston_2svd[:2,] boston[:2,] # # 1) Centering of the variables. means = reduce(N.add, boston_2svd)/len(boston_2svd) # Note that "add" acts on rows: mktg = list of rows, and len = #rows boston_centered = boston_2svd - means # Again, this acts on rows. # Check that the columns sum to zero: ok. reduce(N.add, boston_centered) # # 2) Standardization to unit variance: sdevs = N.sqrt(reduce(N.add, boston_centered**2)/len(boston)) # (We use variance with "1/N" instead of "1/(N-1)", which is irrelevant here.) a = boston_centered / sdevs # Check unit variance: reduce(N.add, a**2)/len(a) # # 3) Perform an svd on a, the standardized version of the predictor matrix X: u,di,vt = L.singular_value_decomposition(a) d = N.identity(len(di)) * di # Turns di into a diagonal matrix ~ D v = N.t(vt) # ~ V # So we have u, v, and d that satisfy: # a = u d v^t, or: # a v = u d max(N.m(a, v) - N.m(u, d)) # Virtually zero -- ok. # # Now for the singular values: di # Not bad: a condition number of 10 is quite acceptable. di[0]/di[-1] # Here is a 1970s style printer plot profile of the singular values: for dval in di: print "*"*int(dval) # Sometimes one reports the condition number of a^t a, which looks a lot worse di**2 (di[0]/di[-1])**2 # # Geometric interpretation: # The singular values are proportional to the standard deviations # of the projections of the predictor vectors onto the singular directions. # The singular values describe the "thicknesses" of an ellipsoid fitted to the # data cloud in the direction of the major axes of the ellipsoid. # The condition number therefore indicates the extreme ratio of thicknesses or, # in other words, the "thinness" of the data cloud compared to its longest dimension. # # When the condition number is very small, one is interested in what directions # the thinness takes place. Hence one examines the coefficients that describe # the "thin" singular direction. # For the Boston Housing predictors, the coefficients are: v[:,-1] # or more readable: N.around(v[:,-1], 2) # It appears that much of the near-collinearity is due to two predictors, x_9 and x_10, # whose coefficients are the largest ones by far: 0.67 and -0.68. # If we interpret the smallest singular value as near-zero, a v ~ = 0, we get roughly # 0.67*x_9 - 0.68*x_10 ~ 0 # after setting the small v-coeffs to zero. In other words: # x_9 ~ x_10 # Their correlation is: N.sum(a[:,8]*a[:,9])/N.sqrt(N.sum(a[:,8]**2) * N.sum(a[:,9]**2)) # Which is very high indeed! ################ # # Application of the svd to dimension reduction: Principal Components Analysis (PCA) # # New data set, from a marketing problem in telecom: mktg = N.array([[float(word) for word in line.split()] for line in file("mktg_grp.dat","r").readlines()]) N.shape(mktg) mktg[:5,] # The variable names are: print file("mktg_grp.col","r").read() # The last variable was a grouping variable found with so-called "k-means clustering" # for the purpose of market segmentation. My problem was the interpretation of these # "clusters". # Idea: Look at a 2-dimensional reduction of the data; if the 2 dimensions are chose # properly, it should show the action of the clustering algorithm. # Q: What is a good reduction of dimension? # A: The dimensions should grab as much variance as possible. # Method: PCA # Derivation: # - What is a dimension? # dimension = linear combination of the variables, z[n] = sum_j v[j] * x[n,j] # = projection onto a direction v=(v[j])_j # - What is variance of a dimension? # Var(dimension) = sum_n z[n]**2 / N, assuming the variables have been centered. # - Problem: max_v Var(dimension) = infinity if the coefficients are unconstrained. # Solution: constrain the u[j] somehow. # How? sum_j v[j]**2 = 1, assuming the variables have been standardized to unit variance. # Why? Projections require unit length direction vectors. # Standardization is needed to make the marginal projections equal in size. # (We are not interested in differences between variances of variables.) # - Var(Z) = v^t X^t X v / N = max_v subject to v^t v = 1 # => v = eigenvector of largest eigenvalue of the covariance matrix X^t X / N. # - In this sense, the two most informative dimensions in the data are the two # eigenvectors belonging to the largest two eigenvalues. # - Goal: Plotting the dimensions (=projections onto the two eigenvectors) against each other. # Computations: # # 0) We don't want the clustering variable for PCA: mktg_2pca = mktg[:,:-1] N.shape(mktg_2pca) mktg_2pca[:5,] mktg[:5,] # # 1) Centering of the variables. means = reduce(N.add, mktg_2pca)/len(mktg_2pca) # Note that "add" acts on rows: mktg = list of rows, and len = #rows mktg_centered = mktg_2pca - means # Again, this acts on rows. # Check that the columns sum to zero: ok. reduce(N.add, mktg_centered) # # 2) Standardization to unit variance: sdevs = N.sqrt(reduce(N.add, mktg_centered**2)/len(mktg)) # (We use variance with "1/N" instead of "1/(N-1)", which is irrelevant here.) a = mktg_centered / sdevs reduce(N.add, a**2)/len(a) # # 3) Instead of an eigendecomposition of X^t X/N, we perform an svd on X, # which entails an eigendecomposition of X^t X: u,di,vt = L.singular_value_decomposition(a) d = N.identity(len(di)) * di # Turns di into a diagonal matrix ~ D v = N.t(vt) # ~ V # So we have u, v, and d that satisfy: # a = u d v^t, or: # a v = u d # The columns of v are the eigenvectors of a^t a, # but a^t a = N * covariance matrix of a, # hence the columns of v that go with the largest eigenvalues # are the directions that give us the most informative dimensions. # Here are all the dimensions, that is, all projections of the rows of a # onto all eigenvectors: z = N.m(a, v) # But according to a v = u d this is the same as u d. True? max(z - N.m(u, d)) # ("max" acts on the rows as entities, and seems to know that max of a number of rows # is componentwise max.) # Yes, a v and u d are the same modulo rounding. # That means, u essentially is the projected data, up to standardization. # The columns of u are unit length, which follows from u^t u = I. # => The colums of u*sqrt(N) are unit variance. # From a v = u d = (u*sqrt(N) (d/sqrt(N)), # it follows that d/sqrt(N) are the standard deviations of the dimensions: di/N.sqrt(len(a)) # These values compare reasonably with the standardization of the columns of a: # The largest values are larger than 1, the smallest smaller than 1. # # We write out the projected data and look at them with a graphics software. file("mktg_pca.dat","w").writelines( [("%11.8f "*8+"%1d\n") % (tuple(z[i,])+(int(mktg[i,8]),)) for i in range(len(z))]) # We want to color a scatterplot of the largest principal components # according to the clustering variable, so we can interpret the clusters # in the principal component coordinates. ################################################################ # # Data Visualization: ggobi # # The graphics software we use is "ggobi". It can be downloaded from "www.ggobi.org". # Download and self-install should be no problem. If you use the defaults, # put a line such as # PATH=/cygdrive/c/Program\ Files/ggobi:$PATH # in your .bashrc. # # Then you can say: os.system("ggobi mktg_grp.dat&") os.system("ggobi mktg_pca.dat&") # Or you click on the ggobi icon that gets installed by default. # Do the usual File -> Open -> ... to read the data file into ggobi. #### ggobi bugs #### # in scatterplot matrices give an option # 1) not to show the variables, # 2) to show them only in the diagonal # 3) a dynamic tool tip that shows the variable pair for the plot under the mouse (?) # => all would be useful in very large matrices # in scatterplot matrices allow choice of variables to be displayed # (probably already implemented) # there should be a default for lines/edges: consecutive pairs # and deletion of lines # (another useful default: minimum spanning tree from given coordinates) ################################################################ SUMMARY: * freeware: - emacs editor, syntax help with highlighting and indentation - cygwin: Unix under Windows, including shells, utilities, interpreted languages - xgobi/ggobi: graphics for data analysis (only touched on) - general point: to show that some of the most powerful computational tools are not commercial but can be found in the freeware/open source/gnu license movement * your own webpage: html * shell programming: bash - file/directory handling: commands such as ls, wc, cd, mkdir, rm, cat, head, tail, more, less, diff (see also dired in emacs) - environment variables - use for scripting, i.e., writing shell commands, with arguments and options - filename expansion - pipes, redirection - grouping, looping, conditionals - general points: data handling when data is structured as a set of files setting up environments (variables, aliases) often needed to run specific software * processing text data: books, dictionaries, tabular data,... - pattern search: egrep - expressing patterns: regular expressions - large-scale simple editing: sed, the stream editor - large-scale simple editing: tr, letter-to-letter translations, squeezing repeats, deleting - sorting, uniques, tabulation: sort, sort -u, uniq -c - general points: solving "bag of words" problems data cleaning/extracting/reformatting * an interpreted language for text processing: gawk - primarily sequential processing like sed - splitting of files into lines (records), of lines into words (tokens) - many useful one-liners, e.g., for checking data consistency such as uniform number of fields - searching/substituting regular expressions - general purpose programming language by (ab)use of the BEGIN clause - many language features directly imported from C, w/o the overhead of type declaration: for-loops, if-conditionals, formatted I/O,... - use for scripting - case studes: data massaging of files from . an FCC auction, . a social network of co-authorships, including a breadth-first search for connected components in a graph - general point: By now we have amassed a considerable set of tools for handling text data. * a full-fledged high-level general-purpose language: Python - working inside an interpreter (gawk used the shell as interpreter) - clean syntax - high-level data structures: lists, strings, tuples, dictionaries; mutable vs immutable; list comprehension for filtering, mapping,... - I/O - more powerful regular expressions, searches and substitutions - writing functions: passing parameters by position-vs-keyword, by reference-vs-value, scopes of variables - modules for specific classes of functionality: string, math, sys, os, pickle, random, operator, re, array (with the Python distribution); Numeric with LinearAlgebra and RandomArray (download from Vaults of Parnassus) - object-oriented programming: bundling data structures and functions (methods), its use for expressing statistical models (regression) - general points: Python is general purpose and can serve for writing solutions or for prototyping solutions that will be implemented in part or all in low-level C. Python includes important programming paradigms and constructs. * low-level programming: the C language - In view of shell/gawk/Python, low-level programming can be limited to time critical tasks. - Examples: tabulating large files feature extraction on large files ascii conversion from/to binary * large datasets: - for a small number of variables, we can handle a few ten thousand cases even in a high-level language such as Python but we may need occasional low-level filters written in C - space/time limitations: memory constrains the size of data structures ascii conversions (not disk access) can be insurmountably slow => use C filters for ascii-to-binary conversion; then read binary into Python - algorithmic aspects: sorting is fast queries are fast with bisection on sorted data sort the data on attributes with indexes - space-saving data structures: arrays check details in Python: short ints, for example * stochastic simulation: techniques and uses - random numbers from specific univariate distributions - sampling multivariate distributions with dependence: bivariate normal, order statistics,... - simulating a discrete random walk and its crossing times - permutation tests of independence - bootstrap sampling for nonparametric standard errors and confidence intervals * numerical linear algebra: using the Python modules "array", "Numeric", "numarray" - arrays: requirement of uniform type for elements - high-level operations on vectors, matrices, general arrays - aspects of basic linear algebra: high-dim geometry, statistical meaning, computation - advanced linear algebra: Cholesky, eigen, singular value decompositions **** guiding principles: - languages, languages, languages,... as tools, and to lower the threshold to future languages - solve computational data problems along the way, from file handling to the svd... **** what is missing? - optimization however: statistics is rich in special-purpose optimization algorithms examples: Gauss-Newton for non-linear LS problems Fisher scoring for likelihood optimization (a variant of Newton) EM algorithm for likelihood optimization with missing data or latent variables k-means algorithm for clustering alternating optimization (backfitting) for fitting additive models power algorithms for nonparametric multivariate analysis these are best taught in regular stats courses - quadrature: next time I teach this, I will include it as part of computing for analytics - more data viz: see Stat 541 - the R language: see Stat 541 ################################################################ ################################################################ # C/C++ from within Python: the 'weave' module import weave weave.inline("""printf("\n\n------- Hello World! ------ %d \n\n", 10000);""") ################################################################ # Playing w Tcl: from Tkinter import * w = Button(text="List files", command="ls") w.pack() w.mainloop() del w ################################################################ # scipy and related modules: lapack, fft,... # # problem installing: line 36 of scipy_base/__init__.py wants module fastumath.py # search google for fastumath.py