STATISTICS 926 - HOMEWORK 1 Due Mon 2014/09/22, midnight DATA VIZ PROJECT: INFERENTIAL DENSITY COMPARISON ---------------------------------------------------------------- Email an R package by the deadline to: stat926.at.wharton@gmail.com The subject line should contain the following: Subject: HW 01, 2014, LastName, FirstName ---------------------------------------------------------------- STATEMENT OF THE PROJECT: Create an R package that generates plots as shown in the four frames of http://stat.wharton.upenn.edu/~buja/STAT-926/fig-inferential-density-comparison.pdf The plots show density comparisons endowed with statistical inference. As shown they are intended for one quantitative variable and one two-group factor variable. You should, however, also provide for a multi-group factor variable with three or more groups (see instructions at the end). DETAILS: Inference is based on permutation testing, calibrated for simultaneity on a grid consisting of a large but finite number of locations on the quantitative axis. The black curve below the axis shows the magnitude of the difference of the densities, i.e., abs(d1.hat-d2.hat). The gray areas are the "insignificance zones". Wherever the black curve drops below the zones, there is statistically significant difference between the densities. The different shades of gray indicate 50%, 10% and 5% simultaneous (FWER) statistical significance. You should also provide for a simple spaghetti version of the insignificance zone whereby null density difference curves are plotted on top of each other. How is simultaneity for the density differences on a grid achieved? The idea is to generate a sensible 1-parameter nested family of insignificance zones and select from this family those zones that have an estimated 50%, 90% and 95% simultaneous coverage under the null hypothesis. The one-parameter family of insignificance zones is given by a family of functions f(x,t) >= 0, where x is in the grid points {x1,...,xn} and t is the parameter of the family. The function f(x,t) must be monotone increasing in t, which grants nestedness of the insignificance zones: I(t) = { d=(d1...dn) in R^n : di <= f(xi,t) forall i=1...n } so that I(t1) is a subset of I(t2) if t1 <= t2. (Note: 'n' is not the sample size but the number of grid points on which the densities are evaluated for plotting.) How do we construct insignificance zones? Here is a slight change from what was said in class: We are not going to use exact estimated pointwise quantiles of the permutation distribution; instead, we use multiples of the pointwise estimated sdev: f(xi,t) = t * sqrt(E[ (d1.hat(xi) - d2.hat(xi))^2 | H0,... ]) The makes sense because under H0 the population densities are the same: d1 == d2, hence d1.hat(xi) - d2.hat(xi) are asymptotically centered under H0. Therefore f(xi,t=1) is an estimate of the sdev of d1.hat(xi) - d2.hat(xi). If these differences are asymptotically normal (which they are), then f(xi,t) for t>0 are estimates of POINTWISE quantiles for 2-sided null coverage Phi(t)-Phi(-t) = 2*Phi(t)-1. Your task will be (1) to estimate f(xi,t=1) from the permutation distribution and (2) to find the values of t.50, t.90 and t.95 for which I(t.50) has estimated SIMULTANEOUS null coverage 0.50, and analogously for I(t.90) and I(t.95). Operationally, here is what needs to be done: (1) Simulate N.sim random permutations of the group labels, compute density estimates using 'density()' for both random groups, making sure that all calls to this function evaluate the density estimates at the same grid points. Store the absolute values of the evaluated density differences in a matrix of size N.sim x N.grid. Estimate f(xi,t=1) and store it in a vector of size N.grid. Determine the smallest t.max such that I(t.max) has 100% simultaneous null coverage under the simulation. Then step down on a rough ladder of multiples t.max*.99, t.max*.98, ... to find approximate values of t.95, t.90 and t.50, computing the estimated simultaneous coverage for each of these multiples. No need to be smarter about the search. (2) Draw the whole thing as in the four frames of the figure. Provide parameters in your function so the user can choose between the depiction of I(t.95), I(t.90) and I(t.50) as gray areas as shown in the left and bottom frames of the figure, or a crude spagghetti plot as shown in the top right frame of the figure (this does not require the ladder search of (1)). (3) Write two wrapper functions for categorical variables with more than two values as follows: (3a) One wrapper should produce a matrix of plots for all pairwise group comparisons. Since plots (i,j) and (j,i) should show the same plot, find out how to draw frames out of order so you are not computing the same thing twice. (3b) The second wrapper should produce a series of plots for the comparison of each group against the union of the rest. Try to be minimally smart about how you arrange the series of plots. You may provide a parameter for the maximum number of plots in a row and default it to 5. Use good aesthetics: All frames should be approximately square or only slightly flat Margins should not be wasteful. Titles should be informative. Use a dataset of your choice for illustration and put it in your package if it is not generally available in R distributions. ================================================================