Visualization Foundations

Published

Mar 2026

ID: DS-L05
Type: Lesson
Audience: Beginner / Intermediate
Theme: Foundational data visualization using clear, reproducible workflows

Visualization is one of the most important tools in data science.

A good plot helps reveal structure, compare groups, identify unusual values, and communicate findings clearly.

But visualization is not only about producing charts.

It is also about learning how to read evidence from patterns.

In this lesson, we focus on foundational plot types that appear in real analytical workflows.

We use the cleaned Iris dataset to practice both creating figures and interpreting what they show.

Lesson Overview

By the end of this lesson, you will be able to:

create histograms, boxplots, scatter plots, and pairplots
compare numeric distributions across groups
use color to represent categories clearly
interpret patterns, spread, overlap, and separation
build plots that are readable and analytically useful

Load the Dataset

We use the cleaned dataset prepared in the previous lesson.

Code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/iris_clean.csv")
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Inspect the Variables

Before plotting, confirm the structure of the dataset.

Code

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Shape: (149, 5)

Columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species             str
dtype: object

Interpretation

Before making plots, ask:

which variables are numeric?
which variable defines the groups?
which comparisons are likely to be meaningful?

For the Iris dataset, the numeric flower measurements and the categorical species variable support both distribution plots and grouped comparisons.

Histogram

Histograms help us understand the distribution of a single numeric variable.

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="sepal_length",
    bins=12,
    kde=True,
    ax=ax
)

ax.set_title("Distribution of Sepal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Count")

plt.show()

Interpretation

When reading a histogram, look for:

the center of the distribution
the spread of values
skewness
possible multiple peaks

If a variable shows multiple peaks, this may suggest the presence of meaningful subgroups.

Boxplot

Boxplots summarize a distribution using the median, quartiles, and possible outliers.

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="sepal_length",
    ax=ax
)

ax.set_title("Sepal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Sepal Length")

plt.show()

Interpretation

A boxplot helps compare groups by showing:

differences in typical values through the median
variation within each group
possible outliers
overlap across groups

If two groups overlap strongly, that variable alone may not separate them well.

Scatter Plot

Scatter plots show the relationship between two numeric variables.

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_length",
    y="petal_length",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Petal Length")

plt.show()

Interpretation

Scatter plots help you evaluate:

whether two variables move together
whether groups form distinct clusters
whether the pattern appears linear or non-linear
whether any observations appear unusual

In many datasets, scatter plots are among the fastest ways to detect group structure.

Grouped Histogram

A grouped histogram helps compare distributions across categories.

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="petal_length",
    hue="species",
    bins=15,
    kde=True,
    ax=ax
)

ax.set_title("Petal Length Distribution by Species")
ax.set_xlabel("Petal Length")
ax.set_ylabel("Count")

plt.show()

Iris — Petal Length Distribution by Species

Interpretation

This plot helps compare whether groups differ in:

location
spread
overlap
shape of the distribution

If one group’s values occupy a clearly different range, that variable may be useful for distinguishing between groups.

Pairplot Overview

Pairplots provide a compact view of multiple variable relationships at once.

Code

g = sns.pairplot(
    df,
    hue="species",
    corner=True,
    plot_kws={"alpha": 0.7}
)

g.fig.suptitle("Iris — Pairplot by Species", y=1.02)

plt.show()

Interpretation

Pairplots help answer broader questions such as:

which variables best separate species?
which features appear strongly related?
which measurements appear redundant?
where groups overlap and where they separate clearly?

This is often one of the most useful first multivariate views of a dataset.

Reading Visual Evidence Carefully

A plot is only useful if it is interpreted carefully.

When reading a figure, ask:

what question does this plot help answer?
what pattern is visible?
how strong is the pattern?
is there overlap, uncertainty, or ambiguity?
does this align with earlier summaries?

Visualization should support reasoning, not replace it.

Visualization Principles

Strong foundational plots share a few key qualities:

clear titles and labels
readable axes
purposeful use of color
minimal clutter
plot choice matched to the question

A histogram is useful for one-variable distributions.
A boxplot is useful for comparing distributions across groups.
A scatter plot is useful for relationships between two variables.
A pairplot is useful for quick multivariate exploration.

Validation Through Visualization

Plots can also function as validation tools.

Use them to check:

whether distributions look plausible
whether outliers need review
whether grouped differences are real or mostly overlap
whether patterns are consistent with earlier cleaning and wrangling steps

Visualization often reveals issues that summary tables alone can miss.

Summary

you used foundational plot types to explore the Iris dataset
you compared distributions within and across species
you examined relationships between numeric variables
you used visual patterns to support interpretation
you practiced moving from chart creation to analytical reasoning

These are core visualization habits that carry forward into more advanced analysis.

Exercise

Try the following:

Create a scatter plot of sepal_width versus petal_width
Plot a histogram of petal_width
Create a boxplot of petal_width by species
Write one sentence describing which feature seems most useful for separating species

Solution

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_width",
    y="petal_width",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Width vs Petal Width")
ax.set_xlabel("Sepal Width")
ax.set_ylabel("Petal Width")

plt.show()

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="petal_width",
    bins=12,
    kde=True,
    ax=ax
)

ax.set_title("Distribution of Petal Width")
ax.set_xlabel("Petal Width")
ax.set_ylabel("Count")

plt.show()

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="petal_width",
    ax=ax
)

ax.set_title("Petal Width by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Width")

plt.show()

print("Your interpretation:")
print("Petal-related features appear more useful than sepal-related features for separating species because their group differences are more distinct.")

Your interpretation:
Petal-related features appear more useful than sepal-related features for separating species because their group differences are more distinct.

CDI Insight

Visualization is not about producing more plots.

It is about choosing the right view of the data to support understanding.

A clear plot reduces uncertainty. A poor plot can introduce it.