Lesson 5 Visualization Basics

In this lesson, you will use the wrangled dataset from Lesson 04 to explore feature distributions and relationships.

Visualization is a core data-science skill because well-designed plots help reveal clusters, trends, and outliers.

5.1 Lesson Overview

By the end of this lesson, you will be able to:

Create univariate plots (histograms, boxplots)
Create scatterplots to inspect relationships between features
Use categorical coloring for pattern detection
Generate pairwise plots using Seaborn
Save figures for reports or presentations using the CDI pipeline

5.2 Notebook Setup

We use CDI publishing helpers so that figures are:

saved incrementally to figures/
embedded into the notebook output as PNGs
safe for the pipeline ipynb → md → Rmd → GitBook

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

# Lesson ID drives figure naming (e.g., figures/05_001.png)
_ = cdi_notebook_init(chapter="05", title_x=0)

5.3 Load the Wrangled Dataset

We load the dataset saved in Lesson 04.

import pandas as pd

df = pd.read_csv("data/iris_wrangled.csv")

print(df.head())
print(df.shape)
print(df.columns.tolist())

   sepal_length  sepal_width  petal_length  petal_width species  sepal_area  \
0           5.1          3.5           1.4          0.2  setosa       17.85   
1           4.9          3.0           1.4          0.2  setosa       14.70   
2           4.7          3.2           1.3          0.2  setosa       15.04   
3           4.6          3.1           1.5          0.2  setosa       14.26   
4           5.0          3.6           1.4          0.2  setosa       18.00   

   petal_ratio petal_size  
0          7.0      small  
1          7.0      small  
2          6.5      small  
3          7.5      small  
4          7.0      small  
(149, 8)
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species', 'sepal_area', 'petal_ratio', 'petal_size']

5.4 Univariate Distributions

Univariate plots help you understand the distribution of a single feature.

import matplotlib.pyplot as plt

df.hist(figsize=(10, 7), bins=12)
plt.suptitle("Iris — Feature Distributions", y=1.02)
plt.tight_layout()

show_and_save_mpl()

Saved PNG → figures/05_001.png

5.5 Boxplots by Category

Boxplots make it easier to compare distributions across groups.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))

# Build grouped arrays for a clean Matplotlib boxplot
groups = []
labels = []
for sp, subset in df.groupby("species"):
    groups.append(subset["petal_length"].values)
    labels.append(sp)

# Use tick_labels (Matplotlib 3.9+) to avoid deprecation warnings
ax.boxplot(groups, tick_labels=labels, showfliers=True)
ax.set_title("Petal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("petal_length")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)

Saved PNG → figures/05_002.png

5.6 Bivariate Scatterplots

Scatterplots help you inspect relationships between two numeric features.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))
ax.scatter(df["sepal_length"], df["petal_length"], alpha=0.8)
ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.grid(True, alpha=0.2)

show_and_save_mpl(fig)

Saved PNG → figures/05_003.png

5.6.1 Scatterplot Colored by Species

Coloring by category often reveals clusters.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))

for sp, subset in df.groupby("species"):
    ax.scatter(subset["sepal_length"], subset["petal_length"], label=sp, alpha=0.8)

ax.set_title("Sepal Length vs Petal Length (Colored by Species)")
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.legend(title="species")
ax.grid(True, alpha=0.2)

show_and_save_mpl(fig)

Saved PNG → figures/05_004.png

5.7 Pairwise Relationships (Optional)

A pairplot is a fast way to view many relationships at once.

import seaborn as sns

g = sns.pairplot(
    df[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]],
    hue="species",
    corner=True,
    plot_kws={"alpha": 0.7},
)
g.fig.suptitle("Iris — Pairplot by Species", y=1.02)

show_and_save_mpl(g.fig)

Saved PNG → figures/05_005.png

If Seaborn is missing on your system, install it inside the notebook with:

!pip install seaborn

5.8 Customizing Visual Styles (Optional)

Matplotlib offers built-in styles. In this course, we keep styling consistent, so use styles carefully and reset after experimentation.

Below is an optional demo using a style context, which automatically resets after the block.

import matplotlib.pyplot as plt

# Optional style demo (auto-resets after the context)
with plt.style.context("ggplot"):
    fig, ax = plt.subplots(figsize=(7, 4))
    ax.hist(df["sepal_length"], bins=12)
    ax.set_title("Histogram — sepal_length (ggplot style)")
    ax.set_xlabel("sepal_length")
    ax.set_ylabel("count")
    ax.grid(True, alpha=0.2)

    show_and_save_mpl(fig)

Saved PNG → figures/05_006.png

5.9 Exercise

Create a boxplot of sepal_width by species
Create a scatter plot of petal_width vs petal_length, colored by petal_size (if available)
Save both plots using show_and_save_mpl()

import matplotlib.pyplot as plt

# 1) Boxplot: sepal_width by species
fig, ax = plt.subplots(figsize=(7, 4))
groups = []
labels = []
for sp, subset in df.groupby("species"):
    groups.append(subset["sepal_width"].values)
    labels.append(sp)

ax.boxplot(groups, tick_labels=labels, showfliers=True)
ax.set_title("Sepal Width by Species")
ax.set_xlabel("species")
ax.set_ylabel("sepal_width")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)

# 2) Scatter: petal_width vs petal_length, colored by petal_size (if present)
if "petal_size" in df.columns:
    fig, ax = plt.subplots(figsize=(7, 4))
    for size, subset in df.groupby("petal_size"):
        ax.scatter(subset["petal_width"], subset["petal_length"], label=size, alpha=0.8)
    ax.set_title("Petal Width vs Petal Length (Colored by petal_size)")
    ax.set_xlabel("petal_width")
    ax.set_ylabel("petal_length")
    ax.legend(title="petal_size")
    ax.grid(True, alpha=0.2)
    show_and_save_mpl(fig)
else:
    print("petal_size column not found. Create it in Lesson 04 or skip this exercise.")

Saved PNG → figures/05_007.png

Saved PNG → figures/05_008.png

5.10 Summary

You used histograms and boxplots to inspect feature distributions
You used scatterplots to explore relationships between features
You used categorical coloring to reveal clusters
You created a Seaborn pairplot for a high-level relationship scan
You saved figures using the CDI publishing pipeline

Continue to Lesson 06 — Summary Statistics and Insights.