Visualization Foundations

Published

Jun 2026

  • ID: DS-L05
  • Type: Lesson
  • Audience: Beginner / Intermediate
  • Theme: Foundational data visualization using clear, reproducible workflows

Visualization is one of the most important tools in data science.

A good plot helps reveal structure, compare groups, identify unusual values, and communicate findings clearly.

But visualization is not only about producing charts.

It is also about learning how to read evidence from patterns.

In this lesson, we focus on foundational plot types that appear in real analytical workflows. We use the wrangled Iris dataset from Chapter 04 so that the visualization step builds on the outputs already produced by the system.


Lesson overview

By the end of this lesson, you will be able to:

  • create histograms, boxplots, scatter plots, and pairplots
  • compare numeric distributions across groups
  • use color to represent categories clearly
  • interpret patterns, spread, overlap, and separation
  • save figures as reusable analysis outputs
  • run a reusable plotting script from the command line

Chapter workflow

This chapter introduces the fourth reusable Python script in the system:

05-visualization-basics.qmd
        ↓
scripts/python/plot_example_data.py
        ↓
data/iris_wrangled.csv
results/figures/
results/figures/figure-index.tsv

The figures produced here support interpretation and reporting in later chapters.


Load the wrangled dataset

We use the wrangled dataset prepared in Chapter 04.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/iris_wrangled.csv")
df.head()

Inspect the variables

Before plotting, confirm the structure of the dataset.

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Interpretation

Before making plots, ask:

  • which variables are numeric?
  • which variable defines the groups?
  • which comparisons are likely to be meaningful?
  • which derived features are available from the wrangling step?

For the Iris dataset, the numeric flower measurements, the derived petal_area, and the categorical species variable support both distribution plots and grouped comparisons.


Histogram

Histograms help us understand the distribution of a single numeric variable.

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="sepal_length",
    bins=12,
    kde=True,
    ax=ax
)

ax.set_title("Distribution of Sepal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Count")

plt.show()

Interpretation

When reading a histogram, look for:

  • the center of the distribution
  • the spread of values
  • skewness
  • possible multiple peaks

If a variable shows multiple peaks, this may suggest the presence of meaningful subgroups.


Boxplot

Boxplots summarize a distribution using the median, quartiles, and possible outliers.

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="sepal_length",
    ax=ax
)

ax.set_title("Sepal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Sepal Length")

plt.show()

Interpretation

A boxplot helps compare groups by showing:

  • differences in typical values through the median
  • variation within each group
  • possible outliers
  • overlap across groups

If two groups overlap strongly, that variable alone may not separate them well.


Scatter plot

Scatter plots show the relationship between two numeric variables.

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_length",
    y="petal_length",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Petal Length")

plt.show()

Interpretation

Scatter plots help you evaluate:

  • whether two variables move together
  • whether groups form distinct clusters
  • whether the pattern appears linear or non-linear
  • whether any observations appear unusual

In many datasets, scatter plots are among the fastest ways to detect group structure.


Grouped histogram

A grouped histogram helps compare distributions across categories.

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="petal_length",
    hue="species",
    bins=15,
    kde=True,
    ax=ax
)

ax.set_title("Petal Length Distribution by Species")
ax.set_xlabel("Petal Length")
ax.set_ylabel("Count")

plt.show()

Interpretation

This plot helps compare whether groups differ in:

  • location
  • spread
  • overlap
  • shape of the distribution

If one group’s values occupy a clearly different range, that variable may be useful for distinguishing between groups.


Derived feature plot

Because Chapter 04 created petal_area, we can now visualize it directly.

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="petal_area",
    ax=ax
)

sns.stripplot(
    data=df,
    x="species",
    y="petal_area",
    color="black",
    alpha=0.55,
    size=4,
    jitter=0.22,
    ax=ax
)

ax.set_title("Petal Area by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Area")

plt.show()

Interpretation

This plot connects wrangling to visualization.

A derived feature is only useful if it helps answer a question or clarify a pattern. Here, petal_area gives a compact petal-size measure that separates species more clearly than many sepal-based comparisons.


Pairplot overview

Pairplots provide a compact view of multiple variable relationships at once.

g = sns.pairplot(
    df[[
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width",
        "petal_area",
        "species"
    ]],
    hue="species",
    corner=True,
    plot_kws={"alpha": 0.7}
)

g.fig.suptitle("Iris — Pairwise Relationships by Species", y=1.02)

plt.show()

Interpretation

Pairplots help answer broader questions such as:

  • which variables best separate species?
  • which features appear strongly related?
  • which measurements appear redundant?
  • where groups overlap and where they separate clearly?

This is often one of the most useful first multivariate views of a dataset.


Reading visual evidence carefully

A plot is only useful if it is interpreted carefully.

When reading a figure, ask:

  • what question does this plot help answer?
  • what pattern is visible?
  • how strong is the pattern?
  • is there overlap, uncertainty, or ambiguity?
  • does this align with earlier summaries?

Visualization should support reasoning, not replace it.


Visualization principles

Strong foundational plots share a few key qualities:

  • clear titles and labels
  • readable axes
  • purposeful use of color
  • minimal clutter
  • plot choice matched to the question

A histogram is useful for one-variable distributions.
A boxplot is useful for comparing distributions across groups.
A scatter plot is useful for relationships between two variables.
A pairplot is useful for quick multivariate exploration.


Validation through visualization

Plots can also function as validation tools.

Use them to check:

  • whether distributions look plausible
  • whether outliers need review
  • whether grouped differences are real or mostly overlap
  • whether patterns are consistent with earlier cleaning and wrangling steps

Visualization often reveals issues that summary tables alone can miss.


Run the reusable plotting script

The manual plotting examples above explain the logic. The reusable script creates and saves a standard figure set.

Run this from the project root:

python scripts/python/plot_example_data.py data/iris_wrangled.csv results/figures

Expected outputs:

results/figures/
├── histogram-sepal-length.png
├── boxplot-sepal-length-by-species.png
├── scatter-sepal-length-vs-petal-length.png
├── histogram-petal-length-by-species.png
├── boxplot-petal-area-by-species.png
├── pairplot-iris-by-species.png
└── figure-index.tsv

What the plotting script does

The script:

  • reads the wrangled input table
  • validates required columns
  • creates a standard set of exploratory figures
  • saves figures as .png files
  • writes a figure index table with filenames and descriptions

The figure index helps later reporting chapters refer to plots consistently.


Exercise

Try the following:

  1. Open results/figures/figure-index.tsv.
  2. Open results/figures/boxplot-petal-area-by-species.png.
  3. Create a scatter plot of sepal_width versus petal_width.
  4. Plot a histogram of petal_width.
  5. Write one sentence describing which feature seems most useful for separating species.

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_width",
    y="petal_width",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Width vs Petal Width")
ax.set_xlabel("Sepal Width")
ax.set_ylabel("Petal Width")

plt.show()

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="petal_width",
    bins=12,
    kde=True,
    ax=ax
)

ax.set_title("Distribution of Petal Width")
ax.set_xlabel("Petal Width")
ax.set_ylabel("Count")

plt.show()

print("Petal-related features appear more useful than sepal-related features for separating species because their group differences are more distinct.")

CDI Insight

Visualization is not about producing more plots.

It is about choosing the right view of the data to support understanding.

A clear plot reduces uncertainty. A poor plot can introduce it.

In CDI systems, figures should be reproducible outputs, not temporary screenshots. A saved figure, a figure index, and the script that produced them make visual evidence easier to review and reuse.


Summary

In this lesson, you:

  • used foundational plot types to explore the Iris dataset
  • compared distributions within and across species
  • examined relationships between numeric variables
  • visualized the derived petal_area feature
  • used visual patterns to support interpretation
  • saved reusable figures with plot_example_data.py
  • created results/figures/figure-index.tsv

Looking Ahead

In the next chapter, we summarize the analysis more formally. The wrangled tables and saved figures produced here become evidence for written interpretation.