ISB
SMMD: Statistical Methods for Management Decisions
Term 1, 2003

Instructors

Prof. Robert Stine	stine@wharton.upenn.edu
			444 Huntsman Hall, Statistics Department
			University of Pennsylvania
			Philadelphia, PA USA  19104-6340

Prof. Richard Waterman	waterman@wharton.upenn.edu
			400 Huntsman Hall, Statistics Department
			University of Pennsylvania
			Philadelphia, PA USA  19104-6340

Overview. Classes 1-7.

This initial portion of Statistical Methods for Management Decisions introduces you to statistical ideas as they apply to managers. I expect that students taking the course are familiar with basic statistical concepts as covered by Prof. Sukumar in the pre-term basic statistics course (e.g., measures of location and variation, descriptive graphics such as histograms, and the foundations of probability). In these first three weeks, we will consider two basic themes: first, recognizing and describing the variation in everything around us, and then modeling and making decisions in the presence of this variation. The concepts in this course reappear in many other classes and are fundamental in the remaining lectures that describe models that delve into explanations for the presence of variation. It will be worth your effort to understand the ideas, as well as the computational methods and software, from the class. Statistical Methods for Management Decisions is separated into 14 class sessions. Professor Stine will deliver the first 7 lectures, and Prof Waterman the second set of 7. In the class sessions, I will assume that you have read the assigned sections of the texts. You should also read the introductions of the examples to be covered in the coming class. Our focus is interpreting the meaning of the results in the management context. You will be using a PC to describe and analyze data using the JMP-IN software introduced by Prof Sukumar. You will find the software manual quite useful. This program resembles a spreadsheet in some ways but has many specialized graphical features not found in Excel and its cousins; working with JMP-IN should have some carry-over value for other courses.

Materials

Foster, Stine & Waterman (1998) Basic Business Statistics. Springer, New York.
Foster, Stine & Waterman (1998) Business Analysis Using Regression. Springer, New York.
Supplemental Materials
- Freedman, D., R. Pisani, and R. Purves. Statistics, Third Edition. Norton, New York.
- Hildebrand, D. and L. Ott. Basic Statistical Ideas for Managers. Duxbury, Belmont, CA.

Class	      Date			Lecture Topic				
1	Mon  May 12	Sources of variation.
2	Tues May 13	Quality control and standard error.
3	Wed  May 14	Confidence intervals.	
4	Thur May 15	Sampling (and introduction to tests).	
5	Mon  May 19	Hypothesis testing and making decisions.
6	Wed  May 21	Covariance and portfolios.
7	Fri  May 23	Correlation, covariance and regression.
8	Mon  May 26	Regression assumptions & diagnostics.
9	Wed  May 28	Introduction  to multiple regression.	
10	Fri  May 30	Understanding collinearity.	
11	Mon  June 2	Categorical variables (two levels).	
12	Wed  June 4	Multilevel categorical variables.	
13	Mon  June 9	Logistic regression.            
14	Wed  June 11	Time series.

Exams and Grades

There will be an exam that covers the material though Lecture 7, and another exam at the end of this course. While the second exam emphasizes the material in Lectures 8-14, this material is inherently cumulative and the ideas of the first seven lectures reappear frequently in later classes.

Classes

Class 1. Sources of variation. Monday May 12

This class introduces simple, effective ways of describing data. All of the computations and graphing will be done by software. Our task – and it is the important task – is to learn how to interpret the results and communicate them to others. The priorities are: first, displaying data in useful, clear ways; second, interpreting summary numbers sensibly. The examples of this class illustrate that data from diverse applications often share characteristic features, such as a bell-shaped (or normal) histogram. When data have this characteristic, we can relate various summary measures to the concentration of the data, arriving at the so-called empirical rule. The empirical rule is our first example of a useful consequence of a statistical model, in this case the normal model for random data. Whenever we rely upon a model such as this, we need to consider diagnostics that can tell us how well our model matches the observed data. Most of the diagnostics, like the normal quantile plot introduced in this lecture, are graphical. Why do data vary? When we measure GMAT scores, stock prices, or executive compensation, why don't we get a constant value for each? Variation arises for many reasons. Much of the statistical analysis of data focuses upon discovering sources of variation in data. Often groups of values can be clustered into subsets identified by some qualitative factor such as industry type, and we can see that the resulting clusters have different properties. Quantitative factors also can be used to explain the presence of variation using methods like regression analysis. Several examples in this class analyze data that flow in over time. Quite often, a crucial source of variation is the passage of time. Much of the total variation over a period of time often arises from a time trend or seasonal pattern. In the context of quality control, variation arises from the inherent capability of the process as well as from special factors that influence the properties of the output. We will continue our analysis of control charts in the next class.

Topics

Normal distribution, empirical rule, skewness and outliers.
Multiple boxplots for comparison.
Trends and seasonality.
Statistical independence; random variation versus systematic pattern.

Examples

Skewness and variation in executive compensation.
Patterns in early international airline passenger data.
Monitoring an automotive manufacturing process.

Readings

FSW, Basic Business Statistics, Class 2-3

Class 2. Quality control and standard error. Tuesday, May 13

Control charts are important tools for insuring quality in manufacturing. This class shows how control charts are constructed and used to monitor various types of processes. The key idea in developing these charts combines the empirical rule from Class 1 with our observation that summary measures such as the mean vary less than the original data. How much less? Today we show how variation in the average, as measured by its standard error, is related to variation in the individual values. We then determine, with the help of the normal model and the empirical rule, how to set one type of control limits. This is a hard class with some very important concepts that will appear frequently in the rest of this course.

Topics

Populations and samples.
Parameters (population mean $\mu$ and variance $\sigma^2$ ) and their estimators.
Standard error of the mean ( $s/\sqrt{n}$ and $\sigma/\sqrt{n}$ ).
Control limits in quality control charts; alternative rules.
Central Limit Theorem.

Examples

Control charts for motor shafts (continued).
Control chart analysis of car trunk seam variation.
Analysis of production of computer chips (supplemental)

Readings

FSW, Class 4

Class 3. Confidence intervals. Wednesday, May 14

We all sense that large samples are more informative than small ones, because of something vaguely called the “law of averages.” This class considers how to take samples, and explains the basic mathematical results that justify our intuition. The normal distribution plays a key role. The standard error of the sample average measures the precision of the average and is related to how close one might expect the average to approximate the unknown true process mean. A confidence interval allows us to quantify just how close we expect the sample average to be to the process mean. After discussing confidence intervals, we introduce several issues in sample survey design. Often, if we are dissatisfied with the width of the confidence interval and want to make it smaller, we have little choice but to reconsider the sample – larger samples produce shorter intervals. Larger samples also cost more and may not be the most efficient way to gather information.

Topics

Confidence vs. precision; confidence coefficient.
Checking assumptions.
Simple random sample.
Questionnaire design and sample validity.
Randomization and controlled experiments.

Examples

Interval estimates of the process mean (continued).
Purchases of consumer goods.

Readings

FSW, Class 5
Review Sections 6.1-6.3; skim Section 7.1; read Sections 7.2-7.4, 7.7.

Class 4. Sampling. Thursday, May 15

This class introduces the basic issues concerned with collecting information. The problem of making valid inference about a population based on information obtained in a sample will be the main focus. For this inference to be valid the sample needs to be representative. Though this sounds like common sense, subtle biases often creep in when sampling is done, and when this is the case, the inference may be very misleading. We will discuss some of these biases and explain sampling methods designed to avoid bias. A key idea is the concept of a random sample. The planned introduction of randomness into the sample allows valid and reliable inferences to be made. Naturally this randomness introduces some uncertainty, but the measures of variability introduced in previous classes provide a means for quantifying this uncertainty. As time permits, we’ll also start our study of statistical tests during the last hour in this class, so you should also look at the first readings for Lecture 5.

Topics

Sampling: the cost effective way of gathering information about a population.
Inference about the population; why the sample must be representative.
Random sampling schemes that ensure a representative sample.
Sample biases (as the sample size grows the estimates stay bad).
Non-response and missing data, confounding, length biased samples.
Hypothesis tests; null and alternative hypotheses.
False positives and false negatives (a.k.a. Type I and Type II errors).
One-sided and two-sided comparisons.

Example

Hotel satisfaction survey.

Reading

FSW, Class 6-7

Class 5. Hypothesis testing and making decisions. Monday, May 19

In quality problems, we need to be assured that the process is functioning as designed, and ideas related to the confidence interval for the process mean are fundamental. In making a decision between two competing alternatives, new issues arise. Statistics provides a number of tools for reaching an informed choice. Which tool, or statistical method, to use depends upon various aspects of the problem. In making a choice between two alternatives, questions such as “Is there a standard approach?”, “Is there a status quo?”, “What are the costs of incorrect decisions?”, “Are such costs balanced?” become paramount. Often, specially designed tests based on experiments can be more informative at lower cost (i.e., smaller sample size). As one might expect, using these more sophisticated procedures introduces trade-offs, but the costs are typically small relative to the gain in information. When faced with a comparison of two alternatives, a test based on paired data is often much better than a test based on two distinct (independent) samples. Why? If we have done our experiment properly, the pairing lets us eliminate background variation that otherwise hides meaningful differences. Getting proper samples is often simply not possible. Particularly when making two-sample comparisons, one often discovers that the two samples differ in more ways than one expects. In this case study, we consider how such confounding factors influence the outcome of statistical tests and show how, with a little luck and the right questions, we can avoid the worst mistakes.

Topics

Hypothesis tests; null and alternative hypotheses.
Two-sample and paired tests; experimental design.
Checking assumptions.
False positives and false negatives (a.k.a. Type I and Type II errors).
One-sided and two-sided comparisons.
Evidence vs. importance; statistical significance versus substantive importance.
Power of a test.
Confounding.

Examples

Selecting a painting process.
Taste test comparison of teas.
Pharmaceutical sales force comparison.
Wage discrimination.
Effects of re-engineering a food processing line. (supplemental)
Analysis of time for service calls. (supplemental)

Readings

FSW, Class 7-9

Class 6. Covariance, correlation and portfolios. Wednesday, May 21

Tests based on paired comparisons discussed in Class 5 often are able to detect differences that would require much larger samples without the pairing. Why does this work? Owners of stock often want to reduce the ups and downs of the value of their holdings – at least they want to reduce the downs! One method for reducing the risk associated with their holdings is to diversify these holdings by buying various stocks. How should the stocks for a portfolio be chosen? The notion of covariance as a measure of dependence provides a means to answer both of these questions and provides important insights that are fundamental to regression analysis, the key tool developed in the second half of this course.

Topics

Negative and positive association.
Independent and dependent measurements.
Covariance and correlation.
Scatterplot matrix.
Effect of dependence on the variance of an average.
Risk reduction via portfolios.

Examples

Building a portfolio.
Additional handout on portfolios.

Reading

FSW, Class 10

Class 7. Regression. Friday, May 23

This final class reviews the ideas covered in the first 6 lectures, and introduces regression in the context of finance. In particular, we consider how to fit regression lines that model the dependence between two variables, and then relate the slope and intercept of this line to the “beta” of a stock.

Examples

Performance of Mutual Funds
Value of diamonds

Reading

FSW, Class 11
FSW, Business Analysis Using Regression, Class 1-3.

Class 8. Regression assumptions and diagnostics Monday, May 26

Fitting lines and curves to data. Regression model assumptions. Here we will quantify relationships between variables, both linear and possibly non-linear. The benefits of fitting a model to data will be discussed. Prediction and confidence intervals in regression. Among the main goals of regression analysis are to predict new observations (to forecast) and to quantify the uncertainty in these predictions. This class will discuss the sources of uncertainty in forecasts.

Topics

Plan: second half of Case Book Chapter 2 + Chapter 3
Review assumptions in regression
Understanding outliers in regression
Prediction and confidence intervals

Examples

Display
Utopia
Direct
Cottages

Reading

FSW, Business Analysis Using Regression, Class 2-3.

Class 9. Introduction to multiple regression. Wednesday, May 28

Many real world relationships are driven by more than a single input variable. For example, costs may be driven by both labor and transportation. Multiple regression extends the idea of simple regression to the situation where there are two or more explanatory variables in the model.

Topics

Prediction and confidence intervals
Introduction to multiple regression, p.105
Collinearity - correlated x-variables, p.133

Examples

Car89

Reading

FSW, Business Analysis Using Regression, Class 4.

Class 10. Understanding collinearity. Friday, May 30

Multiple regression offers the potential to isolate the impact of one variable on an outcome variable, controlling for other factors. However, dependence between the explanatory variables, collinearity, means that care must be taken in the interpretation of the regression coefficients. We will study the impact of collinearity, ways to identify its presence in an analysis, and strategies for dealing with it.

Topics

Hypothesis testing for regression slopes - chapter 5 in case book
Using categorical X's in regression (2 groups) - chapter 6 in casebook

Examples

Stocks
Parcel

Reading

FSW, Business Analysis Using Regression, Class 5.

Class 11. Two level categorical variables in regression. Monday, June 2

A key business problem is to compare the average of one variable across divisions of a company. Creating models with these categorical predictor variables allows for comparative analyses such as benchmarking studies. This class will discuss how to include two level categorical predictors in a regression model.

Topics

Hypothesis test examples - chapter 5 in case book
Using categorical X's in regression (2 groups) - chapter 6 in casebook

Examples

Manager
Salary

Reading

FSW, Business Analysis Using Regression, Class 6.

Class 12. Multi-level categorical variables in regression. Wednesday, June 4

Not surprisingly, many problems have a categorical variable with more than two levels, for examples sales territories could be divided in North, East, South and West - a four level categorical. Including this sort of variable in a regression is the focus of this class.

Topics

Using multi-level categorical X's in regression (more than 2 groups) - chapter 7 in casebook
Notice the hyperlink to the example that shows the custom test dialog for a partial-F test

Examples

Prodtime

Reading

FSW, Business Analysis Using Regression, Class 7.

Class 13. An introduction to logistic regression. Monday, June 9

Logistic regression is concerned with modeling a binary outcome variable. A classic example is ``buy'' or ``don't buy'' a product in a marketing study. This technique is extremely useful, but the interpretations are a little harder than in regular regression models. The class will provide an introduction to this sort of modeling and use logistic regression as an example of a ``classifier'', providing for a discussion of classification problems, as opposed to regression problems.

Topics

Regression for a categorical response.

Examples

Orings

Reading

FSW, Business Analysis Using Regression, Class 11.

Class 14. Techniques for modeling time-series. Wednesday, June 11

Time series data are ubiquitous in business problems. These data-sets often violate the assumptions behind regular regression models. For this reason additional techniques need to be used in order to create a valid and useful model. These time-series techniques will be the subject of this class.

Topics

Modeling time series
Graphical for autocorrelation
Durban Watson statistic
Lags and differencing

Examples

Cellular
Compsale

Reading

FSW, Business Analysis Using Regression, Class 12.

Instructors
Overview. Classes 1-7.
Materials
Exams and Grades
Classes

Richard P. Waterman
2003-05-09