Lesson 5 Visualization Basics

In this lesson, you will use the wrangled dataset from Lesson 04 to explore feature distributions and relationships.

Visualization is a core data-science skill because well-designed plots help reveal clusters, trends, and outliers.

5.1 Lesson Overview

By the end of this lesson, you will be able to:

  • Create univariate plots (histograms, boxplots)
  • Create scatterplots to inspect relationships between features
  • Use categorical coloring for pattern detection
  • Generate pairwise plots using Seaborn
  • Save figures for reports or presentations using the CDI pipeline

5.2 Notebook Setup

We use CDI publishing helpers so that figures are:

  • saved incrementally to figures/
  • embedded into the notebook output as PNGs
  • safe for the pipeline ipynb → md → Rmd → GitBook
from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

# Lesson ID drives figure naming (e.g., figures/05_001.png)
_ = cdi_notebook_init(chapter="05", title_x=0)

5.3 Load the Wrangled Dataset

We load the dataset saved in Lesson 04.

import pandas as pd

df = pd.read_csv("data/iris_wrangled.csv")

print(df.head())
print(df.shape)
print(df.columns.tolist())
   sepal_length  sepal_width  petal_length  petal_width species  sepal_area  \
0           5.1          3.5           1.4          0.2  setosa       17.85   
1           4.9          3.0           1.4          0.2  setosa       14.70   
2           4.7          3.2           1.3          0.2  setosa       15.04   
3           4.6          3.1           1.5          0.2  setosa       14.26   
4           5.0          3.6           1.4          0.2  setosa       18.00   

   petal_ratio petal_size  
0          7.0      small  
1          7.0      small  
2          6.5      small  
3          7.5      small  
4          7.0      small  
(149, 8)
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species', 'sepal_area', 'petal_ratio', 'petal_size']

5.4 Univariate Distributions

Univariate plots help you understand the distribution of a single feature.

import matplotlib.pyplot as plt

df.hist(figsize=(10, 7), bins=12)
plt.suptitle("Iris — Feature Distributions", y=1.02)
plt.tight_layout()

show_and_save_mpl()
Saved PNG → figures/05_001.png

5.5 Boxplots by Category

Boxplots make it easier to compare distributions across groups.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))

# Build grouped arrays for a clean Matplotlib boxplot
groups = []
labels = []
for sp, subset in df.groupby("species"):
    groups.append(subset["petal_length"].values)
    labels.append(sp)

# Use tick_labels (Matplotlib 3.9+) to avoid deprecation warnings
ax.boxplot(groups, tick_labels=labels, showfliers=True)
ax.set_title("Petal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("petal_length")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)
Saved PNG → figures/05_002.png

5.6 Bivariate Scatterplots

Scatterplots help you inspect relationships between two numeric features.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))
ax.scatter(df["sepal_length"], df["petal_length"], alpha=0.8)
ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.grid(True, alpha=0.2)

show_and_save_mpl(fig)
Saved PNG → figures/05_003.png

5.6.1 Scatterplot Colored by Species

Coloring by category often reveals clusters.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))

for sp, subset in df.groupby("species"):
    ax.scatter(subset["sepal_length"], subset["petal_length"], label=sp, alpha=0.8)

ax.set_title("Sepal Length vs Petal Length (Colored by Species)")
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.legend(title="species")
ax.grid(True, alpha=0.2)

show_and_save_mpl(fig)
Saved PNG → figures/05_004.png

5.7 Pairwise Relationships (Optional)

A pairplot is a fast way to view many relationships at once.

import seaborn as sns

g = sns.pairplot(
    df[["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]],
    hue="species",
    corner=True,
    plot_kws={"alpha": 0.7},
)
g.fig.suptitle("Iris — Pairplot by Species", y=1.02)

show_and_save_mpl(g.fig)
Saved PNG → figures/05_005.png

If Seaborn is missing on your system, install it inside the notebook with:

!pip install seaborn

5.8 Customizing Visual Styles (Optional)

Matplotlib offers built-in styles. In this course, we keep styling consistent, so use styles carefully and reset after experimentation.

Below is an optional demo using a style context, which automatically resets after the block.

import matplotlib.pyplot as plt

# Optional style demo (auto-resets after the context)
with plt.style.context("ggplot"):
    fig, ax = plt.subplots(figsize=(7, 4))
    ax.hist(df["sepal_length"], bins=12)
    ax.set_title("Histogram — sepal_length (ggplot style)")
    ax.set_xlabel("sepal_length")
    ax.set_ylabel("count")
    ax.grid(True, alpha=0.2)

    show_and_save_mpl(fig)
Saved PNG → figures/05_006.png

5.9 Exercise

  • Create a boxplot of sepal_width by species
  • Create a scatter plot of petal_width vs petal_length, colored by petal_size (if available)
  • Save both plots using show_and_save_mpl()
import matplotlib.pyplot as plt

# 1) Boxplot: sepal_width by species
fig, ax = plt.subplots(figsize=(7, 4))
groups = []
labels = []
for sp, subset in df.groupby("species"):
    groups.append(subset["sepal_width"].values)
    labels.append(sp)

ax.boxplot(groups, tick_labels=labels, showfliers=True)
ax.set_title("Sepal Width by Species")
ax.set_xlabel("species")
ax.set_ylabel("sepal_width")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)

# 2) Scatter: petal_width vs petal_length, colored by petal_size (if present)
if "petal_size" in df.columns:
    fig, ax = plt.subplots(figsize=(7, 4))
    for size, subset in df.groupby("petal_size"):
        ax.scatter(subset["petal_width"], subset["petal_length"], label=size, alpha=0.8)
    ax.set_title("Petal Width vs Petal Length (Colored by petal_size)")
    ax.set_xlabel("petal_width")
    ax.set_ylabel("petal_length")
    ax.legend(title="petal_size")
    ax.grid(True, alpha=0.2)
    show_and_save_mpl(fig)
else:
    print("petal_size column not found. Create it in Lesson 04 or skip this exercise.")
Saved PNG → figures/05_007.png

Saved PNG → figures/05_008.png

5.10 Summary

  • You used histograms and boxplots to inspect feature distributions
  • You used scatterplots to explore relationships between features
  • You used categorical coloring to reveal clusters
  • You created a Seaborn pairplot for a high-level relationship scan
  • You saved figures using the CDI publishing pipeline