Stat 7770, Module 06

Richard Waterman

February 2023

Objectives

Graphics.
Why look at graphs?
Matplotlib and the seaborn libraries.
Univariate summaries.
- Histograms.
- Kernel Density estimates.
- Boxplots.
- Bar charts.
- Pie charts.

Objectives (ctd.)

Bivariate summaries
- Scatterplots.
- KDE in 2 dimensions.
- Mosaic plots.
- Univariate plots, over the levels of a categorical variable.
- Cutting (binning) a continuous variable to view it as a categorical.
The ‘pairs’ plot.

Why graph the data?

It’s a cliché, but true: “ A picture is worth a thousand words ”.
The human eye and brain are very well developed to spot patterns and relationships.
If you have geographic data, it only makes sense to map it.
Reasons for graphical activities:
- Exploratory data analysis.
  - Speed and simplicity of the tools are of value here.
- Assumption checking from models.
  - Usually these are “canned” plots, a set of standards that we always do and follow.
- Presentation graphics:
  - For presentations, papers or books, these plots make your work stand out.
  - You want fine control over the output and you spend a long time tweaking the details.

The matplotlib and seaborn libraries

These are two very popular libraries.
Matplotlib is a scientific visualization library that takes advantage of numpy so it can deal with large data sets.
It provides very fine control of the output graph; fonts, axes and so on.
It was developed to provide “Matlab” like functionality in Python.
Seaborn sits atop matplotlib and provides a high level framework, to make good quality graphics, with minimal code.
Seaborn is integrated with pandas’ data structures (Series, DataFrame).
The overall plan would be to make decent graphics with seaborn, and if required, tweak and customize them by passing arguments down to matplotlib.

Univariate graphics

Start off by setting up the libraries and the data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime, date

# Read in some data
os.chdir('C:\\Users\\water\\Dropbox (Penn)\\Teaching\\7770s2023\\DataSets') 
op_data = pd.read_csv("Outpatient.csv", parse_dates=['SchedDate', 'ApptDate'])
op_data['ScheduleLag']  = op_data['ApptDate'] - op_data['SchedDate']
op_data.columns

Index(['PID', 'SchedDate', 'ApptDate', 'Dept', 'Language', 'Sex', 'Age',
       'Race', 'Status', 'ScheduleLag'],
      dtype='object')

Setting a theme

Seaborn comes with themes that provide an overall aesthetic for the plots.
We will start by using the default, then change it later.
Note that the boxplot easily identifies the gross outlier(s) and shows the skewness of the distribution.

sns.set() # The default theme.
sns.set(rc={'figure.figsize':(4.5,3.18)}) # A default plot size for axes level plots.
op_data['SL'] = op_data['ScheduleLag'].dt.days # Create a new variable that has schedule lag in days.
sns.boxplot(x='SL', data=op_data); # Create a default boxplot. The semi-colon is a trick to stop unwanted output on the terminal.
pass # Another way to surpress unwanted text output from the plot commands.

Changing some of the boxplot parameters

Below we change the boxplot color and the size of the data points.

# Note below that the data argument is not quoted, but the x argument is quoted.
# "Flier" is the name for the outliers.
sns.boxplot(x='SL', data = op_data, color='red', fliersize=1.0);  
pass

Passing arguments to matplotlib

The underlying boxplot command in matplotlib has the rather complicated form:

Axes.boxplot(self, x, notch=None, sym=None, vert=None, whis=None, positions=None, widths=None, patch_artist=None,
             bootstrap=None, usermedians=None, conf_intervals=None, meanline=None, showmeans=None, showcaps=None, showbox=None,
             showfliers=None, boxprops=None, labels=None, flierprops=None, medianprops=None, meanprops=None, capprops=None, 
             whiskerprops=None, manage_ticks=True, autorange=False, zorder=None, *, data=None)

We will add a ‘notch’ that shows a 95% confidence interval for the median (notch=True) and remove the outlier points (sym=““).
The key idea is that you can simply pass these extra arguments down from seaborn to matplotlib.

The tweaked boxplot

g = sns.boxplot(x='SL', data = op_data, color='red', fliersize=1.0, notch=True, sym=""); 
print(type(g))

<class 'matplotlib.axes._subplots.AxesSubplot'>

Figure and axes level plots

The type of plot we just did is called an “axes-level” plot.
There are also some high level plots, called “figure-level”, that are meant for fast exploratory data analysis.
You can change the output of the figure-level plots, simply by changing the “kind” option.
These high-level figure plotting functions do most of the work for you, choosing good defaults for the parameters.

Using the catplot command

You can also make the box plot, from the high-level (figure-level) plotting function, ‘catplot’.
The plot size can be controlled with the height and aspect arguments.

g = sns.catplot(x='SL', data = op_data, kind="box", height = 2, aspect=2) # A boxplot, from the "catplot" figure level command. 
g.set_xlabels("Schedule lag") # You can tweak these plots using built in methods.
g.fig.suptitle('Distribution of schedule lags')
print(type(g));

<class 'seaborn.axisgrid.FacetGrid'>

Changing the ‘kind’ argument

A violin plot shows the distribution of a variable, in a way that makes it easier to compare distributions across levels of a categorical variable. For now, we will do a single plot.
All we have to do here is switch out the ‘kind=“box”’ to ‘kind=“violin”’:

g = sns.catplot(x='SL', data = op_data, kind="violin",height=4) # A boxplot, from the "catplot" figure level command. 
g.set_xlabels("Schedule lag") # You can tweak these plots using built in methods.
g.fig.suptitle('Distribution of schedule lags');
pass

Histograms and distributions

The figure-level command for a histogram is displot(), which by default will create a histogram of a numeric variable.
It can also add a kernel density estimate (KDE) of the distribution. Basically, the KDE smooths the tops of the bins.

sns.displot(op_data['SL'], kde=True, height=4,aspect=2);
pass

Fine tuning the plot

Below we remove the KDE and add a “rug plot”, that marks the individual observations.

sns.displot(op_data['SL'], kde=False, rug=True,height=4, aspect=2); # Remove the kernel density estimate and add a rug plot.
pass

Saving a plot to a file

There are various graphic file formats available, such as .png, .jpeg, .svg.
You simply have to include the file extension in the name to choose between formats.

import os
os.chdir('C:\\Users\\water\\Dropbox (Penn)\\Teaching\\7770s2023\\Images') 
sns.displot(op_data['SL'], kde=True, height=4, aspect=2); 
plt.savefig("output_{0}.png".format('OP'))  # savefig method for png format.
plt.savefig("output_{0}.jpeg".format('OP')) # savefig method for jpeg format.
plt.savefig("output_{0}.svg".format('OP'))  # savefig method for svg format.

Plotting a categorical variable

Using the kind=“count” argument on the figure level “catplot” we get a barplot showing the frequency of each value.
But it needs some work!

sns.catplot(x='Dept', kind="count", data=op_data, height=4, aspect = 2);
pass

Creating a new version of the plot

Below we pull out the top 5 departments, and create a new data frame with just rows from these Departments.
Setting ‘set_xticklabels’ to rotate 45 degrees, greatly improves the readability.
When saving the plot, we adjust the lower margin to accommodate the long labels.

#### Prep the data
top_dept = op_data['Dept'].value_counts()[:5] # Identify the top 5 Departments
new_dept = op_data.loc[op_data['Dept'].isin(top_dept.index)]['Dept'] # Subset using '.isin'.
new_dept = pd.DataFrame(data=new_dept) # Cast the Series to a DataFrame.

Plot the data

We can change the size of the catplot directly through the ‘height’ and aspect ‘parameters’.

#### Build the plot
g = sns.catplot(x='Dept', kind="count", palette="ch:.25", data=new_dept, height=3, aspect=2) #ch stands for a "cube-helix" color palette.
g.set_xticklabels(rotation=45, horizontalalignment='right')
plt.subplots_adjust(bottom=0.4) # This adds more white space to the bottom of the plot.
plt.savefig("output_{0}.png".format('Dept')) # savefig method.

Faceting

Faceting is essentially the term for creating a grid of plots, showing conditional relationships (the distribution of one variable, over the levels of another).
You can do this by row and/or columns.

Faceting by the Sex column

new_dept = op_data.loc[op_data['Dept'].isin(top_dept.index)] # Keep all of the columns now.
g = sns.catplot(x='Dept', kind="count", palette="ch:.25", data=new_dept, 
                col="Sex", height=4, aspect=2) # Note the 'col' argument.
g.set_xticklabels(rotation=45, horizontalalignment='right');
pass

Row and column facets

Here we condition on Sex (rows) and Status (columns).

g = sns.catplot(x='Dept', kind="count", palette="ch:.25", data=new_dept, 
                row = 'Sex', col="Status", height=2, aspect=1.5) # Note the 'col' argument.
g.set_xticklabels(rotation=45, horizontalalignment='right')
pass

Making a pie chart

There isn’t a pie chart type in seaborn, but we can use one from pandas.

dept_counts = op_data['Dept'].value_counts()[:5] # Just work with the  frequencies here.
dept_counts.plot.pie(figsize=(6, 6)); # This is a pandas plot.
pass

Color in Seaborn

Seaborn comes with built in color palettes and unless you are artistically talented, it is probably best to stick with them.
You can view the default palette easily.

current_palette = sns.color_palette()
sns.palplot(current_palette)

sns.palplot(sns.color_palette("Paired")) # The paired palette.

A continuous color palette

When representing continuous or sequential, rather than categorical data, these may be more appropriate.

sns.palplot(sns.color_palette("Blues"))

A list of possible colormaps

Possible values are: Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, 
Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r,
Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples,
Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2,
Set2_r, Set3, Set3_r, Spectral, Spectral_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd,
YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cividis, cividis_r, cool,
cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray,
gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg,
gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, icefire, icefire_r, inferno,
inferno_r, jet, jet_r, magma, magma_r, mako, mako_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma,
plasma_r, prism, prism_r, rainbow, rainbow_r, rocket, rocket_r, seismic, seismic_r, spring, spring_r, summer, summer_r, tab10,
tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, twilight, twilight_r, twilight_shifted,
twilight_shifted_r, viridis, viridis_r, vlag, vlag_r, winter, winter_r

Colorbrewer

There is function called colorbrewer that you can use to help create palettes.
See color brewer for more information.

custom_palette = sns.color_palette("Reds", 4)
sns.palplot(custom_palette)

custom_palette = sns.color_palette("Greens", 6)
sns.palplot(custom_palette)

Check out the green graphic

g = sns.catplot(x='Dept', kind="count", palette=custom_palette, data=new_dept, height=4, aspect=2)
g.set_xticklabels(rotation=45, horizontalalignment='right');
pass

Cubehelix

Yet another way of creating sequential palettes is with the “cubehelix” command.
It takes many potential parameters and below is an example.

sns.palplot(sns.cubehelix_palette(n_colors = 8, start=0.8, rot=.4))

Defining your own colors

Computers view colors as made up of red green and blue components.
How much of each there is determines the exact color.
Usually each RGB value goes between 0 and 255.
Often they are represented in hexadecimal notation (base 16) as \(16^2\) = 256.
Below are four Wharton colors:
- Color 1: blue_one is (red= 0, green = 71, blue = 133).
- Color 2: blue_two is (red = 38, green = 36, blue = 96).
- Color 3: red_one is (red = 169, green = 5, blue = 51).
- Color 4: red_two is (red = 168, green = 32, blue = 78).

In hexadecimal notation these are

Note the “#” character to indicate the hexadecimal.
blue_one = “#004785”
blue_two = “#262460”
red_one = “#A90533”
red_two = “#A8204E”
You can use a ‘color dropper’ and then a decimal to hex converter to find the appropriate representation.
Decimal to hex converter: converter . Or use the built-in Python hex() function.

The color dropper in MS Paint

Note the RGB code for the light blue color in the bottom right of the “Edit Colors” window.

wharton_colors = ["#004785", "#262460", "#A90533", "#A8204E"]
sns.set_palette(sns.color_palette(wharton_colors)) # Set the custom color palette.

Build the new graph with custom colors

We only have 4 colors, but 5 levels to the Department variable, so the colors get “recycled”.

g = sns.catplot(x='Dept', kind="count", palette=wharton_colors, data=new_dept, height=3, aspect=2)
g.set_xticklabels(rotation=45, horizontalalignment='right');
pass

Relationships between variables

It is very common to want to look at the distribution of a continuous variable over the levels of a categorical variable.
The seaborn command for this is ‘catplot’ which we saw before to do a box plot, but we will now do comparison boxplots.
I will work with the “top 5 departments” data frame:

Relationships between variables

sns.set_palette(sns.color_palette("flare")) # Use a different palette.
new_dept = op_data.loc[op_data['Dept'].isin(top_dept.index)] #Subset the data.
g = sns.catplot(x = 'Dept', y = "SL", data = new_dept, height=4, aspect=2) # The default "catplot"
g.set_xticklabels(rotation=45, horizontalalignment='right');
pass

Comparison boxplots

Here are box plots again, but now with one for each department.

sns.set_palette(sns.color_palette("Set2")) # Use a different palette.
g = sns.catplot(x='Dept',y="SL", kind="box", data=new_dept, height=3, aspect=2) # The comparison boxplots
g.set_xticklabels(rotation=45, horizontalalignment='right');
pass

Comparison violin plots

All we have to do is change the ‘kind’ variable.

g = sns.catplot(x ='Dept', y="SL", kind="violin", data=new_dept, height=3, aspect=2) # The default "catplot"
g.set_xticklabels(rotation=45, horizontalalignment='right');
pass

Adding a second categorical variable to the plot

By using the “hue” argument, we can do the plot over the levels of another variable, to see how consistent the relationship is.

sns.catplot(x = 'Dept', y = "SL", hue ='Status', kind="box", data = new_dept, height=3, aspect=3); # Note the 'hue' argument.
pass

Another way to control the size of the plot

If you are using an axes level plot, you can set up the plot size in the following way:
- Use the boxplot command and the axes argument.

f, ax = plt.subplots(1, 1, figsize = (10, 5)) # Set the size of the plot.
sns.boxplot(x = 'Dept', y = "SL", hue = 'Status', data = new_dept, ax=ax); # Note we are back to boxplot.
plt.savefig("output_{0}.png".format('Comps')) # savefig method for png format.

The ‘bar’ type

Using a bar plot, by default shows the average Schedule Lag, by Status, within each Department.
The bars (red lines) at the top are confidence intervals for the mean.

g = sns.catplot(x='Dept', y="SL", kind="bar", hue='Status', data=new_dept, 
                height=4, aspect=3, errcolor="red", errorbar=('ci', 95)) # The 'bar' kind.
g.set_ylabels("Average Schedule lag");
pass

Plotting the association between two continuous variables

The classic plot here is a scatterplot.
There is the option to do a KDE of the joint distribution.
As the outpatient data only has one continuous variable, we will use the car dataset instead.

os.chdir('C:\\Users\\water\\Dropbox (Penn)\\Teaching\\7770s2023\\DataSets') 
car_data = pd.read_csv("Car08_just_499.csv")
print(car_data.columns)

Index(['Make/Model', 'MPG_City', 'MPG_Hwy', 'Weight(lb)', 'Seating',
       'Horsepower', 'HP/Pound', 'Displacement', 'Cylinders', 'Origin',
       'Transmission', 'EPA_Class', 'Length', 'Fuel', 'HEV', 'Turbocharger',
       'Make', 'Model', 'GP1000M_City', 'GP1000M_Hwy'],
      dtype='object')

The default plot

This is a simple plot of the two variables.

sns.relplot(x="Weight(lb)", y="GP1000M_City", data=car_data,color="red");
pass

Coloring by a third variable

The ‘hue’ argument makes the points different colors according to another variable.

sns.relplot(x="Weight(lb)", y="GP1000M_City", hue="Transmission", data=car_data);
pass

Adding different markers

The ‘style’ argument changes the plotting character (which may be overkill).

sns.relplot(x="Weight(lb)", y="GP1000M_City", hue = "Transmission", style="Cylinders", data=car_data);
pass

Change the color palette

As usual, we can add a palette argument.

sns.relplot(x="Weight(lb)", y="GP1000M_City", hue="Transmission", style="Cylinders", palette="copper", data=car_data);
pass

Adding a size based component

We could potentially use another variable to determine the size of the points, via the ‘size’ argument.

sns.relplot(x="Weight(lb)", y="GP1000M_City", hue = "Transmission", size = "Horsepower", 
            sizes=(20, 200), style="Cylinders", palette="Reds", data=car_data, height=3, aspect=2);
pass

Plotting joint and marginal distributions

The ‘jointplot’ will also add the univariate distributions on the ‘margins’ (edges) of the scatterplot plot.

sns.jointplot(x="Weight(lb)", y="GP1000M_City", data=car_data, color="red", height=5);
pass

Using a KDE

We can add one and 2 dimensional KDE’s by using the ‘kind’ equal “kde” option.

sns.jointplot(x="Weight(lb)", y="GP1000M_City", data=car_data, color="red", kind="kde", height=5);
pass

The scatterplot matrix

A scatterplot matrix shows bivariate relationships and in seaborn uses the ‘pairplot’ command.

sns.set(style="whitegrid", font_scale=0.75) # Change the style
tmp_data = car_data[['GP1000M_City', 'Weight(lb)', 'Horsepower', 'Length', 'Transmission']]
sns.pairplot(tmp_data, hue="Transmission", height=2);
plt.savefig("output_{0}.png".format('Cars')) # savefig method for png format

Plotting two categorical variables

The usual plot for two categorical variables is called a “Mosaic plot” and plots the proportion in each level of a y-variable, over the levels of an x-variable.
Seaborn doesn’t have this plot, but we can find one in the statsmodels package.

from statsmodels.graphics.mosaicplot import mosaic
mosaic(op_data, ['Sex', 'Status']);
pass

Plotting a categorical (y) against a numeric variable (x)

One approach to this is to “discretize” the numeric variable.
This means to creates buckets from it.
Once the buckets are made, we can do a mosaic plot again.
Pandas has two functions for this. One is “cut” and the other “qcut”.
We will use qcut which will discretize the data into buckets with equal numbers of observations.

A mosaic plot for the discretized schedule lag

We will do some fine tuning of the plot, with font size and color.

os.chdir('C:\\Users\\water\\Dropbox (Penn)\\Teaching\\7770s2023\\Notes\images') # This is where the plot will be saved.

op_data['SL_8Cut'] = pd.qcut(op_data['SL'], 8) # Create 8 levels with equal numbers in each category.
fig, ax1 = plt.subplots(figsize=(18, 9)) # Controlling plot size using matplotlib. This returns a figure to plot on.
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.rcParams['font.size'] = 12  
plt.rcParams['text.color'] = 'black'  
mosaic(op_data, ['SL_8Cut', 'Status'],  ax=ax1); # Plots the mosaic plot on the axes "ax1".
plt.savefig("mosaic_{0}.png".format('OP'))  # savefig method for png format.
plt.close() # This will stop the plot being displayed here.
pass

The finished mosaic plot

The structure of your data

Seaborn assumes that your data will be “well” structured.
Well structured is sometimes called “tidy”, and means that there is one row for every observation and a column for each variable.
In practice, a lot of datasets do not follow this structure and you may have to manipulate the data (reshape) before the seaborn graphics will work as expected.
My suggestion is that you find a working example, look at how the data is structured, and then mimic that for your particular use case.
To read a bit more detail, here’s a link to a paper that lays out the ideas in detail: tidy data

Summary

Graphics.
Why look at graphs?
Matplotlib and the seaborn libraries.
Univariate graphics.
Bivariate graphics.
The pairs plot.

Seaborn figure-level plots

displot: for the distribution of a variable.
catplot: for the realtionship between a numerical and a categorical variable(s).
relplot: for relationship between two variables.
jointplot: a single pair-wise relationship plus marginal distributions.
pairplot: for all pairwise relationships and marginal distributions.

Next time

Statistical analysis and models