# The inline flag will use the appropriate backend to make figures appear inline in the notebook.
%matplotlib inline
import pandas as pd
import numpy as np
# `plt` is an alias for the `matplotlib.pyplot` module
import matplotlib.pyplot as plt
# import seaborn library (wrapper of matplotlib)
import seaborn as sns
# Load car loan data into a pandas dataframe from a csv file
filename = 'table_i702t60.csv'
df = pd.read_csv(filename)
# Checking to make sure we dont have nans in our dataframe
# It is not easy to directly plot data that contains nans
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 60 entries, 0 to 59 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 60 non-null int64 1 starting_balance 60 non-null float64 2 interest_paid 60 non-null float64 3 principal_paid 60 non-null float64 4 new_balance 60 non-null float64 5 interest_rate 60 non-null float64 6 car_type 60 non-null object dtypes: float64(5), int64(1), object(1) memory usage: 3.4+ KB
# Converting these columns into numpy arrays to make them readily graphable
# using .loc() to get all rows for just these particular columns as a Pandas Series
# .values turns all of the previous into a numpy array and assigns to the variable
# We are doing this to three columns separately
month_number = df.loc[:, 'month'].values
interest_paid = df.loc[:, 'interest_paid'].values
principal_paid = df.loc[:, 'principal_paid'].values
# Not the prettiest plot
# month number goes to x axis and interest paid to y
plt.plot(month_number, interest_paid)
[<matplotlib.lines.Line2D at 0x7fc8b18fbe20>]
# You can also plot another line on the same graph
plt.plot(month_number, interest_paid)
plt.plot(month_number, principal_paid)
[<matplotlib.lines.Line2D at 0x7fc8c30f6530>]
We will use plt.style.available
to select an appropriate aesthetic styles for our figures.
The default style is not the most aesthetically pleasing.
[MATPLOTLIB STYLE SHEETS REFERENCE](https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html)
# Our style choices
# plt.style.available => shows the styles we can choose from
plt.style.use('classic')
¶# We are currently using default plot
# so this will not change much about our plot
plt.style.use('classic')
plt.plot(month_number, interest_paid)
plt.plot(month_number, principal_paid)
[<matplotlib.lines.Line2D at 0x7fc8c316aa40>]
plt.style.use('fivethirtyeight')
¶plt.style.use('fivethirtyeight')
plt.plot(month_number, interest_paid)
plt.plot(month_number, principal_paid)
[<matplotlib.lines.Line2D at 0x7fc8b1c02890>]
plt.style.use('ggplot')
¶plt.style.use('ggplot')
plt.plot(month_number, interest_paid)
plt.plot(month_number, principal_paid)
[<matplotlib.lines.Line2D at 0x7fc8b1c6df60>]
plt.style.use('tableau-colorblind10')
¶plt.style.use('tableau-colorblind10')
plt.plot(month_number, interest_paid)
plt.plot(month_number, principal_paid)
[<matplotlib.lines.Line2D at 0x7fc8c2e29030>]
plt.style.use('seaborn')
¶plt.style.use('seaborn')
plt.plot(month_number, interest_paid)
plt.plot(month_number, principal_paid)
[<matplotlib.lines.Line2D at 0x7fc8c2e77f40>]
Here are a couple common marker types.
string | description |
---|---|
'.' | point marker |
',' | pixel marker |
'o' | circle marker |
'v' | triangle_down marker |
'^' | triangle_up marker |
'<' | triangle_left marker |
'>' | triangle_right marker |
's' | square marker |
'*' | star marker |
'+' | plus marker |
'x' | x marker |
's' | square marker |
plt.style.use('seaborn')
plt.plot(month_number, interest_paid, marker = '.', markersize = 10)
plt.plot(month_number, principal_paid, marker = '.', markersize = 10)
[<matplotlib.lines.Line2D at 0x7fc8c2ee7430>]
The c
parameter accepts strings.
string | color |
---|---|
'b' | blue |
'blue' | blue |
'g' | green |
'green' | green |
'r' | red |
'red' | red |
'c' | cyan |
'cyan' | cyan |
'm' | magenta |
'magenta' | magenta |
'y' | yellow |
'yellow' | yellow |
'k' | black |
'black' | black |
'w' | white |
'white' | white |
The parameter also accepts hex strings. For instance, green is '#008000'. Additionally you can use rgb tuples.
plt.plot(month_number, interest_paid,c = 'k', marker = '.', markersize = 10)
plt.plot(month_number, principal_paid,c = 'b', marker = '.', markersize = 10)
[<matplotlib.lines.Line2D at 0x7fc8c330a980>]
# Using hex strings
# '#000000' is black
# '#0000FF' is blue
plt.plot(month_number, interest_paid,c = '#000000', marker = '.', markersize = 10)
plt.plot(month_number, principal_paid,c = '#0000FF', marker = '.', markersize = 10)
# Using rgb tuples
# (0, 0, 0) is black
# (0, 0, 1) is blue
# plt.plot(month_number, interest_paid,c = (0, 0, 0), marker = '.', markersize = 10)
# plt.plot(month_number, principal_paid,c = (0, 0, 1), marker = '.', markersize = 10)
[<matplotlib.lines.Line2D at 0x7fc8c3381570>]
Matplotlib has two different types of syntax.
MATLAB-style
This is a scripted interface designed to feel like MATLAB. Matplotlib maintains a pointer to the current (active) figure and sends commands to it.
Object-oriented
This is more often used in situations where you want more control over your figure.
Important Note You can and often will have plots that will be created through a combination of MATLAB-style and object-oriented syntax.
plt.style.use('seaborn')
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
[<matplotlib.lines.Line2D at 0x7fc8b1fe05b0>]
# plt.subplots returns a tuple, the figure and the axes, which are unpacked
# give it the number of rows and columns for figure
fig, axes = plt.subplots(nrows = 1, ncols = 1)
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
fig, axes = plt.subplots(nrows = 1, ncols = 1)
plt.plot(month_number, interest_paid, c= 'k')
axes.plot(month_number, principal_paid, c = 'b')
[<matplotlib.lines.Line2D at 0x7fc8b20929b0>]
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
[<matplotlib.lines.Line2D at 0x7fc8b2101a80>]
# This isn't the most practical use of changing ylim
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.xlim(left=1,right=70)
plt.ylim(bottom=0,top=1000)
(0.0, 1000.0)
# Obviously this isnt the most practical use of changing xlim and ylim
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.xlabel('Month')
plt.ylabel('Dollars')
Text(0, 0.5, 'Dollars')
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.xlabel('Month')
plt.ylabel('Dollars')
plt.title('Interest and Principal Paid Each Month')
Text(0.5, 1.0, 'Interest and Principal Paid Each Month')
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.xlabel('Month', fontsize = 15)
plt.ylabel('Dollars', fontsize = 15)
plt.title('Interest and Principal Paid Each Month', fontsize = 15)
Text(0.5, 1.0, 'Interest and Principal Paid Each Month')
# Changing tick font size
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.xlabel('Month', fontsize = 15)
plt.ylabel('Dollars', fontsize = 15)
plt.title('Interest and Principal Paid Each Month', fontsize = 15)
plt.xticks(fontsize = 15) # This is the size for the increments on the axis
plt.yticks(fontsize = 15)
(array([ 0., 100., 200., 300., 400., 500., 600., 700.]), [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
fig, axes = plt.subplots(nrows = 1, ncols = 1)
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
fig, axes = plt.subplots(nrows = 1, ncols = 1)
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.set_xlim(left =1 , right = 70)
axes.set_ylim(bottom = 0, top = 1000)
(0.0, 1000.0)
fig, axes = plt.subplots(nrows = 1, ncols = 1)
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.set_xlabel('Month')
axes.set_ylabel('Dollars')
Text(0, 0.5, 'Dollars')
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.set_xlabel('Month');
axes.set_ylabel('Dollars');
axes.set_title('Interest and Principal Paid Each Month');
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.set_xlabel('Month', fontsize = 22);
axes.set_ylabel('Dollars', fontsize = 22);
axes.set_title('Interest and Principal Paid Each Month', fontsize = 22);
# Changing tick font size
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.set_xlabel('Month', fontsize = 22);
axes.set_ylabel('Dollars', fontsize = 22);
axes.set_title('Interest and Principal Paid Each Month', fontsize = 22);
axes.tick_params(axis = 'x', labelsize = 20)
axes.tick_params(axis = 'y', labelsize = 20)
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
[<matplotlib.lines.Line2D at 0x7fc8907a3790>]
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.grid("both")
# only horizontal grid lines
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.grid(axis='y')
# only vertical grid lines
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.grid(axis = 'x')
# change color of grid lines, transparency, and linestyle
plt.plot(month_number, interest_paid, c= 'k')
plt.plot(month_number, principal_paid, c = 'b')
plt.grid(c = 'g',
alpha = .9,
linestyle = '-')
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.grid('both')
# only horizontal grid lines
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.grid(axis='y')
# only vertical grid lines
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.grid(axis='x')
# change color of grid lines, transparency, and linestyle
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
axes.grid(c = 'g',
alpha = .9,
linestyle = '-')
# if you are finding setting grids to be tedious, use a style that has grids
plt.style.use('seaborn')
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k');
axes.plot(month_number, principal_paid, c = 'b');
The loc
(legend location) parameter accepts strings, ints, and tuples
string | int |
---|---|
'best' | 0 |
'upper right' | 1 |
'upper left' | 2 |
'lower left' | 3 |
'lower right' | 4 |
'right' | 5 |
'center left' | 6 |
'center right' | 7 |
'lower center' | 8 |
'upper center' | 9 |
'center' | 10 |
The parameter accepts a 2 element tuple x, y
where (0, 0) is the of the lower-leftcorner of the legend in axes coordinates.
# Obviously the legend is not in an ideal location
plt.plot(month_number, interest_paid, c= 'k', label = 'Interest')
plt.plot(month_number, principal_paid, c = 'b', label = 'Principal')
plt.legend()
<matplotlib.legend.Legend at 0x7fc8b2933b80>
# At least the legend is not overlapping with the graph
plt.plot(month_number, interest_paid, c= 'k', label = 'Interest')
plt.plot(month_number, principal_paid, c = 'b', label = 'Principal')
plt.legend(loc="center right")
<matplotlib.legend.Legend at 0x7fc8b25b15d0>
# You can move the legend outside of the plotting area.
# At least the legend is not overlapping with the graph
plt.plot(month_number, interest_paid, c= 'k', label = 'Interest')
plt.plot(month_number, principal_paid, c = 'b', label = 'Principal')
plt.legend(loc=(1.02,0))
<matplotlib.legend.Legend at 0x7fc8907a3bb0>
# Obviously the legend is not in an ideal location
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k', label = 'Interest');
axes.plot(month_number, principal_paid, c = 'b', label = 'Principal');
axes.legend()
<matplotlib.legend.Legend at 0x7fc8b264c9d0>
# At least the legend is not overlapping with the graph
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k', label = 'Interest');
axes.plot(month_number, principal_paid, c = 'b', label = 'Principal');
axes.legend(loc="center right")
<matplotlib.legend.Legend at 0x7fc8c3e34d30>
# At least the legend is not overlapping with the graph
fig, axes = plt.subplots(nrows = 1, ncols = 1);
axes.plot(month_number, interest_paid, c= 'k', label = 'Interest');
axes.plot(month_number, principal_paid, c = 'b', label = 'Principal');
axes.legend(loc=(1.02,0))
<matplotlib.legend.Legend at 0x7fc8c3ba5930>
Saving your visualizations outside your jupyter notebook is important as it allows you to show your visualizations to others. Equally important is checking your saved visualization as there is always the possibility the graph doesnt look the same in the notebook as in the image file.
plt.style.use('seaborn')
# an image may good in the notebook, but it may not save that way
plt.figure(figsize=(10, 5))
plt.plot(month_number, principal_paid, c = 'b', label = 'Principal')
plt.plot(month_number, interest_paid, c= 'k', label = 'Interest')
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
plt.xlim(left =1 , right = 61)
plt.ylim(bottom = 0, top = 700)
plt.xlabel('Month', fontsize = 22);
plt.ylabel('Dollars', fontsize = 22);
plt.title('Interest and Principal Paid Each Month', fontsize = 24)
plt.legend(loc=(1.02,0), borderaxespad=0, fontsize = 20)
plt.savefig('mslegendcutoff.png', dpi = 300)
# tight_layout()
# automatically adjusts subplot params so that the subplot(s) fits in to the figure area
plt.figure(figsize=(10, 5))
plt.plot(month_number, principal_paid, c = 'b', label = 'Principal')
plt.plot(month_number, interest_paid, c= 'k', label = 'Interest')
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
plt.xlim(left =1 , right = 61)
plt.ylim(bottom = 0, top = 700)
plt.xlabel('Month', fontsize = 22);
plt.ylabel('Dollars', fontsize = 22);
plt.title('Interest and Principal Paid Each Month', fontsize = 24)
plt.legend(loc=(1.02,0), borderaxespad=0, fontsize = 20)
plt.tight_layout()
plt.savefig('mslegend.png', dpi = 300)
# an image may good in the notebook, but it may not save that way
fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize=(10, 5) )
axes.plot(month_number, principal_paid, c = 'b', label = 'Principal')
axes.plot(month_number, interest_paid, c= 'k', label = 'Interest')
axes.tick_params(axis = 'x', labelsize = 20)
axes.tick_params(axis = 'y', labelsize = 20)
axes.set_xlim(left =1 , right = 61)
axes.set_ylim(bottom = 0, top = 700)
axes.set_xlabel('Month', fontsize = 22);
axes.set_ylabel('Dollars', fontsize = 22);
axes.set_title('Interest and Principal Paid Each Month', fontsize = 24)
axes.legend(loc=(1.02,0), borderaxespad=0, fontsize = 20)
fig.savefig('objectlegendcutoff.png', dpi = 300)
# tight_layout()
# automatically adjusts subplot params so that the subplot(s) fits in to the figure area
fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize=(10, 5) )
axes.plot(month_number, principal_paid, c = 'b', label = 'Principal')
axes.plot(month_number, interest_paid, c= 'k', label = 'Interest')
axes.tick_params(axis = 'x', labelsize = 20)
axes.tick_params(axis = 'y', labelsize = 20)
axes.set_xlim(left =1 , right = 61)
axes.set_ylim(bottom = 0, top = 700)
axes.set_xlabel('Month', fontsize = 22);
axes.set_ylabel('Dollars', fontsize = 22);
axes.set_title('Interest and Principal Paid Each Month', fontsize = 24)
axes.legend(loc=(1.02,0), borderaxespad=0, fontsize = 20)
fig.tight_layout()
fig.savefig('objectlegend.png', dpi = 300)
Matplotlib is a very popular visualization library, but it definitely has flaws.
In this video, we are going to make a more complicated visualization called a boxplot to show how helpful it is to work with the matplotlib wrappers pandas and seaborn.
A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed. If you want to learn more about how boxplots, you can learn more here.
The data used to demonstrate boxplots is the Breast Cancer Wisconsin (Diagnostic) Data Set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The goal of the visualization is to show how the distributions for the column area_mean
differs for benign versus malignant diagnosis
.
# Load wisconsin breast cancer dataset
# either benign or malignant
cancer_df = pd.read_csv('wisconsinBreastCancer.csv')
cancer_df.head()
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
# Looking at the Distribution of the Dataset in terms of Diagnosis
cancer_df['diagnosis'].value_counts(dropna = False)
B 357 M 212 Name: diagnosis, dtype: int64
# Creating filters for the data to compare and plotting based on those filters
# Use .values to turn into a NumPy array
malignant = cancer_df.loc[cancer_df['diagnosis']=='M','area_mean'].values
benign = cancer_df.loc[cancer_df['diagnosis']=='B','area_mean'].values
plt.boxplot([malignant,benign], labels=['M', 'B']);
Pandas can be used as a wrapper around Matplotlib. One reason why you might want to plot using Pandas is that it requires less code.
We are going to create a boxplot to show how much less syntax you need to create the plot with pandas vs pure matplotlib.
# Getting rid of area_mean
cancer_df.boxplot(column = 'area_mean', by = 'diagnosis');
Sometimes you will find it useful to use Matplotlib syntax to adjust the final plot output. The code below removes the suptitle and title using pure matplotlib syntax.
# Same plot but without the area_mean subtitle and title
cancer_df.boxplot(column = 'area_mean', by = 'diagnosis');
plt.title('');
plt.suptitle('');
Seaborn can be seen as a wrapper on top of Matplotlib. Seaborn's website lists a bunch of advantages of using Seaborn including
import seaborn as sns
sns.boxplot(x='diagnosis', y='area_mean', data=cancer_df)
<AxesSubplot:xlabel='diagnosis', ylabel='area_mean'>