Andreas Buja's Home Page

The Liem Sioe Liong/First Pacific Company Professor of Statistics
Department of Statistics [Poster]
The Wharton School
University of Pennsylvania
Philadelphia, PA 19104-6340

Room: 471 JMHH
Office: (215) 898-8222 (Leave a note with the administrator.)
Email: Click here for an image of the address. (The address 'buja@wharton...' is obsolete.)
Curriculum vitae: [.pdf] (a more fun alternative from the 2004/5 MBA guide))

Doctoral Program in Statistics:
If you are interested in our Ph.D. program, please, visit our program website.
Once you decide to apply, start your application at this website .

Teaching:

STATISTICS 101, Spring 2014: Introductory Business Statistics (I)
Here is the preliminary Syllabus.
STATISTICS 541, Fall 2014: Statistical Methodology, for first year statistics Ph.D. students. (Class web page.)
STATISTICS 926, Fall 2014: Multivariate Methodology, for statistics Ph.D. students. (Class web page.)

Basic Probability in R: a small set of functions that implement discrete random variables. They can be loaded into R with the following line:

  source("http://stat.wharton.upenn.edu/~buja/STAT-101/src-probability.R")

Here are things that can be done:

  X <- make.RV(1:6, rep("1/6",6))                  # Create a fair die   (class: "RV")
  Y <- make.RV(1:6, c(0.1,0.1,0.1,0.1,0.2,0.4))    # Create a loaded die (   ''      )
  P(X>3);  P(Y>3)                                  # Probabilities of events
  E(X);  E(Y)                                      # Expected values
  V(X);  V(Y)                                      # Variances
  SD(X);  SD(Y)                                    # Standard deviations
  par(mfrow=c(2,1));  plot(X);  plot(Y)            # Plot as pin graphs
  S <- SofI(X,Y);  par(mfrow=c(1,1));  plot(S)     # Sum of two independent RVs
  S10 <- SofIID(X,10);  plot(S10)                  # Sum of 10 iid copies of X (works for many more => CLT)
  qqnorm(S10)                                      # Normal quantile plot for RVs to check the CLT effect
  X.sim <- rsim(1000, X)                           # Simulate from X  (class: "RVsim")
  plot(X.sim)                                      # Plot simulated data as pin graph
  probs(X);  props(X.sim)                          # Compare probabilites and simulated proportions
  E(X);  mean(X.sim)                               # Compare expected value and mean
  SD(X);  sd(X.sim)                                # Compare theoretical and observed std.dev.
  X2 <- X^2;  X2;  plot(X2)                        # univariate analytical transformation
  Yexp <- exp(Y);  Yexp;  plot(Yexp)               #             ''
  Yfair <- Y - E(Y);  Yfair                        # Centering a RV: creates a fair game from a loaded die
  Z <- (X - E(X))/SD(X);  Z;  plot(Z)              # z-scoring/standardizing a random variable
  Ybern <- con(ifelse(Y>3,1,0))                    # Create a Bernoulli variable; 'con()' contracts values/probs

Check the header of the source file for more explanations and examples.

STATISTICS 102, Spring 2003: Introductory Business Statistics (II)
STATISTICS 540, Fall 2002: Statistical Computing (class web page).

Current Interests:

A Conspiracy of Random X and Nonlinearity against Classical Inference in Linear Regression

The submitted version, by Andreas Buja, Richard Berk, Lawrence Brown, Ed George, Emil Pitkin, Mikhail Traskin, Kai Zhang and Linda Zhao.
When predictors are random, statisticians seem comfortable to condition on them and treat them as fixed. The underlying argument is that the predictors form an ancillary statistic. This argument is flawed, however, because it assumes the correctness of the model before even examining it. We reconstruct in our own way a piece of econometric theory to sort out the effects of model violations in the presence of random predictors.
The talk given at ETH 2014/11/25.
Two R animations to demonstrate the conspiracy effect of nonlinearity and random X: Copy/paste one of the following lines into an R interpreter
```
	source("http://stat.wharton.upenn.edu/~buja/PAPERS/src-conspiracy-animation.R")
	source("http://stat.wharton.upenn.edu/~buja/PAPERS/src-conspiracy-animation2.R")
      
```
The source code is in two files: the simple version showing nonlinearity only, and the augmented version showing nonlinearity and linearity.

Statistical Inference after Model Selection:

Valid Post-Selection Inference [ The Annals of Statistics, 2013, pdf] [2013 talk, pdf]
by Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang and Linda Zhao.
When predictors for statistical models are selected by looking at the data, statistical inference based on these models is in danger of being invalid. We show that confidence intervals may need to be widened considerably to protect against invalidation. This is a fundamental difficulty with statistical inference that has implication all the way down to how we teach statistics in introductory courses.
Software for computing PoSI constants: R source code and sample code.
Play with the sample code, then modify and apply to your predictor data.

Inference for EDA and Diagnostics: It is commonly thought that the visualization methods used in exploratory data analysis and model diagnostics are beyond, and even adverse to, statistical inference. This is not so. With some simple protocols it is possible to assign p-values to visual discoveries.

"Graphical Inference for Infovis" [pdf] (Wickham, Cook, Hofmann, Buja, 2010; IEEE Trans. on Visualization and Computer Graphics, Vol. 16, No. 6, Nov/Dec; Best Paper Award)
"Statistical inference for exploratory data analysis and model diagnostics" [pdf] with supplementary materials [pdf] (Buja, Cook, Hofmann, Lawrence, Lee, Swayne, Wickham, 2009; Philosophical Transactions of The Royal Society, A)
A precursor talk given at the Joint Statistics Meetings 1999, with Di Cook [.pdf].
A related topic is model checking with parametric bootstrap: I brought this up back in 2004 in a discussion of a paper by Andrew Gelman who does the same with a posterior predictive approach. Here are Gelman's JCGS article, followed by my discussion, and his rejoinder.

Multidimensional Scaling:

Lisha Chen's thesis resulted in the following two articles:
Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing (Chen and Buja; JMLR 2013) [.pdf]
Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis (Chen and Buja; JASA 2009) [.pdf]
Data Visualization With Multidimensional Scaling (Buja, Swayne, Littman, Dean, Hofman and Chen; JCGS 2008) [.pdf]
Visualization Methodology for Multidimensional Scaling (Buja and Swayne; JoC 2002) [.pdf]

Visualization of Large Correlation Tables:

A Tool for Mining Large Correlation Tables: 'Association Navigator' [pdf]
This is a report (joint with Abba Krieger and Ed George) written for the Simons Foundation - Autism Research Initiative (SFARI). The work under a SFARI grant was the reason why we created an interactive tool for visualizing correlation tables for many hundreds of variables. The report draws its examples from the 'Simons Simplex Collection' (SSC), a large database of autism phenotype data.
Association Navigator [R software] written in the R language (currently only for MS Windows). You can load the software into an R interpreter by executing the following expression:
source("http://stat.wharton.upenn.edu/~buja/association-navigator.R")
Then follow the simple instructions on page 35 of the above report to apply the software to your own numeric data matrix.

Business Topics:

Quasi-Darwinian Selection in Marketing Relationships [.pdf] joint with N. Eyuboglu;
(Journal of Marketing, Oct 2007, featured JM blog article and a finalist for JM's 2007 Harold H. Maynard Award)
Along with the paper go a few scenario calculations that are not included in the article: [.pdf]
Different Worlds: Do Recommender Systems Fragment Consumers' Interest? [article] In his Ph.D. thesis, Dan Fleder devised a scheme whereby observational data about consumers of music before and after joining a recommender service could be interpreted as describing a natural experiment. As a consequence, he got as close as conceivable to causal inference about the effects of recommender systems on consumers in a particular setting. Here is a short version in Knowledge@Wharton:

Penalized Singular Value Decompositions with Jianhua Huang and Haipeng Shen:

The Analysis of Two-Way Functional Data Using Two-Way Regularized Singular Value Decompositions (Huang, Shen and Buja, JASA, 2009) [.pdf], with supplementary material [.pdf]
Functional principal components analysis via penalized rank one approximation (Huang, Shen and Buja; Electronic Journal of Statistics, 2008)[.pdf]

High-Dimensional Data Visualization with grand tours and guided tours, joint with Deborah Swayne, Di Cook, Dan Asimov, and Catherine Hurley:

Theory of Dynamic Projections in High-Dimensional Data Visualization [pdf] Describes the invariant Riemannian geometries on Stiefel manifolds.
Computational Methods for High-Dimensional Rotations in Data Visualization [pdf]
Appeared in "Handbook of Statistics" (eds. E. Wegman, C. R. Rao; 2005). (An older version that had both papers in one should be considered out of date.)
Differential geometry for dynamic projections: invariant Riemannian geometries on Stiefel manifolds [.pdf]
Computational Methods for High-Dimensional Rotations in Data Visualization [.pdf]
Software architecture for interactive dynamic data visualization systems [.pdf]
Methodology for viewing high-dimensional data with dynamic projections [.pdf]
XGobi manual [.pdf]
Grand tours and projection pursuit [.pdf]
Projection pursuit indices [.pdf]
``Prosections'': Theory of the synthesis of projecting and sectioning multivariate data clouds [.pdf]
After Asimov's seminal article on grand tours, here is the first proposal for extending the idea to correlation and regression tours [.pdf]

Multivariate Analysis: A talk I gave at an econometrics workshop at Stanford. It tries to answer the question of how to choose the ``reference metric'' or the constraint in multivariate methods based on eigendecompositions. [.pdf]

Loss Functions for Binary Class Probability Estimation: Structure and Applications. (Former title: Degrees of Boosting) [.pdf] This paper that started out as work on boosting but turned into something different, joint with Werner Stuetzle and Yi Shen.
Yi Shen's 2005 Ph.D. thesis on cost-weighted class probability estimation [pdf]

Cost-Weighted Boosting with Jittering and Over/Under-Sampling: JOUS-Boost [pdf]
(Journal of Machine Learning Research 8 (Mar), 409-439, 2007). On a simple modification of boosting, joint with David Mease and Adi Wyner.

Observations on Bagging [pdf], joint with Werner Stuetzle
(Statistica Sinica 2006, Special Issue on Machine Learning, 16 (2), 323--352 (2006))
A preliminary version and a companion paper which I keep posted because others have started referring to them: The Effect of Bagging on Variance, Bias, and Mean Squared Error [.pdf] PPT slides,
Smoothing Effects of Bagging [.pdf]

Calibration for Simultaneity: (Re)Sampling Methods for Simultaneous Inference with Applications to Function Estimation and Functional Data [.pdf, 1.7MB], with Wolfgang Rolke.

Visual Comparison of Datasets using Mixture Decompositions [.pdf]
Alan Gous and Andreas Buja; Journal of Computational and Graphical Statistics, 13 (1), 1-19 (2004).
(We are permitted to post the color version of this paper. The printed version is b/w with gray-scale figures.)

Data Mining Criteria for Tree-Based Regression and Classification [.ps.gz]
A. Buja and Y.-S. Lee; Proceedings of KDD 2001, 27--36.

On writing:

Here is an article everybody should read:

The Science of Scientific Writing by Gopen and Swan [HTML] [.pdf]]

originally published in the American Scientist, retyped and posted with permission.
It's the single best piece on writing in the sciences--no exaggeration!