Lesson 5 Visualization Basics
In this lesson, you will use the wrangled dataset from Lesson 04 to explore feature distributions and relationships.
Visualization is a core data-science skill because well-designed plots help reveal clusters, trends, and outliers.
5.1 Lesson Overview
By the end of this lesson, you will be able to:
- Create univariate plots (histograms, boxplots)
- Create scatterplots to inspect relationships between features
- Use categorical coloring for pattern detection
- Generate pairwise plots using Seaborn
- Save figures for reports or presentations using the CDI pipeline
5.2 Notebook Setup
We use CDI publishing helpers so that figures are:
- saved incrementally to
figures/
- embedded into the notebook output as PNGs
- safe for the pipeline
ipynb → md → Rmd → GitBook
5.3 Load the Wrangled Dataset
We load the dataset saved in Lesson 04.
import pandas as pd
df = pd.read_csv("data/iris_wrangled.csv")
print(df.head())
print(df.shape)
print(df.columns.tolist()) sepal_length sepal_width petal_length petal_width species sepal_area \
0 5.1 3.5 1.4 0.2 setosa 17.85
1 4.9 3.0 1.4 0.2 setosa 14.70
2 4.7 3.2 1.3 0.2 setosa 15.04
3 4.6 3.1 1.5 0.2 setosa 14.26
4 5.0 3.6 1.4 0.2 setosa 18.00
petal_ratio petal_size
0 7.0 small
1 7.0 small
2 6.5 small
3 7.5 small
4 7.0 small
(149, 8)
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species', 'sepal_area', 'petal_ratio', 'petal_size']
5.4 Univariate Distributions
Univariate plots help you understand the distribution of a single feature.
import matplotlib.pyplot as plt
df.hist(figsize=(10, 7), bins=12)
plt.suptitle("Iris — Feature Distributions", y=1.02)
plt.tight_layout()
show_and_save_mpl()Saved PNG → figures/05_001.png

5.5 Boxplots by Category
Boxplots make it easier to compare distributions across groups.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 4))
# Build grouped arrays for a clean Matplotlib boxplot
groups = []
labels = []
for sp, subset in df.groupby("species"):
groups.append(subset["petal_length"].values)
labels.append(sp)
# Use tick_labels (Matplotlib 3.9+) to avoid deprecation warnings
ax.boxplot(groups, tick_labels=labels, showfliers=True)
ax.set_title("Petal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("petal_length")
ax.grid(True, axis="y", alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/05_002.png

5.6 Bivariate Scatterplots
Scatterplots help you inspect relationships between two numeric features.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 4))
ax.scatter(df["sepal_length"], df["petal_length"], alpha=0.8)
ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.grid(True, alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/05_003.png

5.6.1 Scatterplot Colored by Species
Coloring by category often reveals clusters.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 4))
for sp, subset in df.groupby("species"):
ax.scatter(subset["sepal_length"], subset["petal_length"], label=sp, alpha=0.8)
ax.set_title("Sepal Length vs Petal Length (Colored by Species)")
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.legend(title="species")
ax.grid(True, alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/05_004.png

5.7 Pairwise Relationships (Optional)
A pairplot is a fast way to view many relationships at once.
import seaborn as sns
g = sns.pairplot(
df[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]],
hue="species",
corner=True,
plot_kws={"alpha": 0.7},
)
g.fig.suptitle("Iris — Pairplot by Species", y=1.02)
show_and_save_mpl(g.fig)Saved PNG → figures/05_005.png

5.8 Customizing Visual Styles (Optional)
Matplotlib offers built-in styles. In this course, we keep styling consistent, so use styles carefully and reset after experimentation.
Below is an optional demo using a style context, which automatically resets after the block.
import matplotlib.pyplot as plt
# Optional style demo (auto-resets after the context)
with plt.style.context("ggplot"):
fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(df["sepal_length"], bins=12)
ax.set_title("Histogram — sepal_length (ggplot style)")
ax.set_xlabel("sepal_length")
ax.set_ylabel("count")
ax.grid(True, alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/05_006.png

5.9 Exercise
- Create a boxplot of
sepal_widthby species
- Create a scatter plot of
petal_widthvspetal_length, colored bypetal_size(if available)
- Save both plots using
show_and_save_mpl()
import matplotlib.pyplot as plt
# 1) Boxplot: sepal_width by species
fig, ax = plt.subplots(figsize=(7, 4))
groups = []
labels = []
for sp, subset in df.groupby("species"):
groups.append(subset["sepal_width"].values)
labels.append(sp)
ax.boxplot(groups, tick_labels=labels, showfliers=True)
ax.set_title("Sepal Width by Species")
ax.set_xlabel("species")
ax.set_ylabel("sepal_width")
ax.grid(True, axis="y", alpha=0.2)
show_and_save_mpl(fig)
# 2) Scatter: petal_width vs petal_length, colored by petal_size (if present)
if "petal_size" in df.columns:
fig, ax = plt.subplots(figsize=(7, 4))
for size, subset in df.groupby("petal_size"):
ax.scatter(subset["petal_width"], subset["petal_length"], label=size, alpha=0.8)
ax.set_title("Petal Width vs Petal Length (Colored by petal_size)")
ax.set_xlabel("petal_width")
ax.set_ylabel("petal_length")
ax.legend(title="petal_size")
ax.grid(True, alpha=0.2)
show_and_save_mpl(fig)
else:
print("petal_size column not found. Create it in Lesson 04 or skip this exercise.")Saved PNG → figures/05_007.png

Saved PNG → figures/05_008.png

5.10 Summary
- You used histograms and boxplots to inspect feature distributions
- You used scatterplots to explore relationships between features
- You used categorical coloring to reveal clusters
- You created a Seaborn pairplot for a high-level relationship scan
- You saved figures using the CDI publishing pipeline
Continue to Lesson 06 — Summary Statistics and Insights.