STATISTICS 961, Fall Semester 2017, Course Web Page
- Instructor's office hours:
- Goals and Non-Goals:
- There is just one goal -- prepare 1st year PhD students in statistics for research.
- This is not an applied statistics course.
- This is not an R course. Fluency in R is assumed.
- If you are not a 1st year PhD student in statistics, please, note:
- Undergraduate and MBA students require an interview with the
- The course grade is heavily based on class participation (35%).
- PhD students from programs other than statistics
should not count on completing this class and therefore sign
up for sufficient credit from other courses.
- The Course Selection Period ends on Monday, 2017/09/18.
- The Drop Period ends on Monday, 2017/10/09.
- First day of class:
- Friday, 3pm, 2017/09/01, room 1201 SH-DH
General honor code: You may discuss the problems with each
other in general terms, but you must write your own solution. All
sources, including friends and colleagues, must be cited. It is
important to get used to a stringent code of conduct in scientific
writing. On the other hand, use commonsense and attribute where
honesty requires it. Two points worth special mention:
*** If you received an extension for a homework, do not consult
*** An offense would be consulting solutions of homeworks from
*** An exception is with regard to LaTex and English language help:
avail yourself to as much as you need from whichever source.
- Homework 1 : Linear algebra (1) and Latex practice.
Due: Fri, 2017/09/22, 7pm
Edit the LaTex source and submit a PDF file by email.
If you have never used LaTex, you can first install some free software:
In the manual, pay special attention to Section 1.3.2 (special
characters) and Chapter 3 for math typesetting (math symbols:
Section 3.10). To produce PDF from LaTex, the LEd environment
requires you to click the green and blue right arrows in the tool
bar. Feel free to check out other free software and other
documents. If you find something particularly useful, please, let
Unless instructed otherwise, homeworks should be e-mailed
in attachments to stat961.at.wharton[at-sign]gmail.com.
The format should be .R or or .pdf or .doc depending on
Your checked and graded solutions are returned in e-mail attachments.
Search '#AB' to find comments.
A score such as 8/10 at the end means '8 out of 10 points'.
A deduction of 2 points does not mean you got two questions wrong; it is
only a relative measure of how much below optimal your solutions are.
- LA homeless data (courtesy of Richard Berk)
- FTSE Data 1991-05-13 through 2006-05-11
- A small book
- Spelling dictionary
- Speech fragment (3.3 mB!)
- Boston housing data
with a description of the variables
Geographical names are in this file.
- Pima Indians diabetes data
with a description of the variables
- Laser data
with a description of the variables
- Titanic survival data
- Marketing data
- Tips data; analysis
- Detergent data
- Places Rated data with a description
In a form suitable for xgobi or ggobi:
Say 'xgobi places_ggobi' or 'ggobi places_ggobi', depending on which you're using.
- Accounting-Market Rate data (Myers, p.16ff)
- Odd regression data
- Fabric failure data (Myers, p.329f)
- Vaso-restriction data (Myers, p.331f)
- FedEx data, reduced for elasticity demo
- Blocks data, reduced for fixed/variable cost demo
- Algal bloom data; some background is in the
- Webserver data
- Wine displayspace data
- Car Models 2003-4
- Body Fat data, variable names,
IMPORTANT: If a function in the class notes does not work or is not
found in your R session, check whether the function is in one
of the R code files below. If so, download and read the file into R
one more time, even if you thought you had done so earlier.
I allow myself to update the code all the time.
- The paper on tree-based regression and classification
is in this PDF file.
- The paper on additive principal components is in this
Syllabus STAT 961
- Instructor: Andreas Buja
Email: stat961.at.wharton[at-sign]gmail.com (urgent: buja.at.wharton[at-sign]gmail.com)
Office hours by appointment.
Office: JMHH 471
Class Time: Fri, 3:00-5:50pm
Class Room: initially 1201 SH-DH
- Goal of this course: Prepare statistics
Ph.D. students for independent research.
Accordingly, the demands on conceptual thinking and quick uptake
will be considerable as the course progresses.
What this course is not: not an applied statistic course,
not a R course, and not a service course to other departments.
For a graduate level applied statistic course, see Paul
Rosenbaum's 500 level statistical methods.
- Homework will be extensive.
It will be the heart of what you retain from this course.
- In-Class Quizzes will be held sometimes announced and
sometimes unannounced. Students who miss a class with a quiz will
make up at a later date. All students are under honor code not to
exchange information about the quiz with a student who will make up
- Grades will be computed from homeworks, in-class quizzes
and class participation alone.
There will be no midterm or final exams.
- Topics (not necessarily in this order):
- Linear Models and inference
- Exploratory data analysis, data visualization
- Statistical testing exemplified with permutation tests
- Confidence intervals with bootstrap
- Tree-based regression
- ACE and additive models
- Principal Components
- Non-parametric curve fitting
- Bias-variance trade-offs
- Cross-validation for estimating prediction errors and selecting bandwidths and models
- Writing for research, including style and clarity, typesetting, web publishing
- Computing: As the tool of choice for the execution of data
analyses and simulations, we use the
R programming language.
Note: This is not an R class. R will not even
be taught in light of the computational literacy of this year's
statistics Ph.D. students.
- A course in linear algebra:
basis changes and associated coordinate changes, linear maps,
inner product spaces, orthogonal projections, eigen
- Probability at the level of Stat 430:
thinking in random variables, limit theorems
- Statistical inference at the level of Stat 431:
statistical tests, confidence intervals, linear models,
- Programming experience in a high-level language
Examples of high-level languages: R, Splus, Matlab, Perl,
Python, Visual Basic
Examples of low-level languages: C, C++, Fortran, Java
Outright detrimental for learning R is exclusive knowledge
of SAS due to its very different computational model.
- Undergraduate students contemplating this course: If you
do not have a solid background in statistics and linear algebra
already as well as some programming experience, you should not take
this class or, at a minimum, not rely on credit from it for
graduation. As mentioned above, the goal is to prepare students for
statistics research, and there will be only one standard of
performance for all students.
- Publication quality writing and mathematical typesetting are of
utmost importance for statistics research. To get used to the
standards of writing research papers in statistics, some homeworks
will be required to be typeset in LaTex and submitted by e-mail
as a PDF file. You will have to learn LaTex on your own with the help
of other graduate students, but getting started with Latex
will be facilitated by templates provided by the instructor, so all
you need to do is cannibalize the templates by filling in your
- The only required text for the course is a book about the art of writing:
- Required web documents:
- If you need more reading about R, look up the numerous
books about R
or the numerous free
web documents about R.
Yet another way to find R introductions is to do a search for
"Introduction to R". If you find something particularly
useful, please, let the instructor know.
- Recommended texts:
As we go along, special topics books will be recommended.
- For regression: Seber and Lee, "Linear Regression Analysis"
(Wiley Series in Probability and Statistics)
- For linear algebra: Strang, "Linear Algebra and its Applications" (Academic Press)
Strangely, the most fundamental material is no longer in the recent edition:
"Linear Transformations, Matrices, and Change of Basis."
In older editions this used to be tucked away in the appendix.
For this reason, the material is now included in Homework 2:
You get to derive it yourself by following instructions.
- Venables and Ripley,
"Modern Applied Statistics with S-Plus" (Springer) a
recommended, terse book on a broad array of
appl. stats. topics, based on the S/R language.
- Becker, Chambers, Wilks,
"The New S Language"
the original S book, also called "the blue book";
even for R users still a good place to start.
- Chambers, Hastie (eds.)
"Statistical Models in S"
on the statistical modeling language in S/R, also called "the white book"
- R. L. Harris,
an excellent overview of useful and common data visualizations.
More recent developments can be followed at the TED talks, several
another by McCandless.
- Supplemental material suggested by previous
participants in Stat 961:
- A fabulous and fun article on R by Patrick Burns:
Print it and keep it on your bedside!
- Here is a pointer to R