Visualization Foundations

Published

Mar 2026

  • ID: DS-L05
  • Type: Lesson
  • Audience: Beginner / Intermediate
  • Theme: Foundational data visualization using clear, reproducible workflows

Visualization is one of the most important tools in data science.

A good plot helps reveal structure, compare groups, identify unusual values, and communicate findings clearly.

But visualization is not only about producing charts.

It is also about learning how to read evidence from patterns.

In this lesson, we focus on foundational plot types that appear in real analytical workflows.

We use the cleaned Iris dataset to practice both creating figures and interpreting what they show.


Lesson Overview

By the end of this lesson, you will be able to:

  • create histograms, boxplots, scatter plots, and pairplots
  • compare numeric distributions across groups
  • use color to represent categories clearly
  • interpret patterns, spread, overlap, and separation
  • build plots that are readable and analytically useful

Load the Dataset

We use the cleaned dataset prepared in the previous lesson.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/iris_clean.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Inspect the Variables

Before plotting, confirm the structure of the dataset.

Code
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
Shape: (149, 5)

Columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species             str
dtype: object

Interpretation

Before making plots, ask:

  • which variables are numeric?
  • which variable defines the groups?
  • which comparisons are likely to be meaningful?

For the Iris dataset, the numeric flower measurements and the categorical species variable support both distribution plots and grouped comparisons.


Histogram

Histograms help us understand the distribution of a single numeric variable.

Code
fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="sepal_length",
    bins=12,
    kde=True,
    ax=ax
)

ax.set_title("Distribution of Sepal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Count")

plt.show()

Iris — Distribution of Sepal Length

Interpretation

When reading a histogram, look for:

  • the center of the distribution
  • the spread of values
  • skewness
  • possible multiple peaks

If a variable shows multiple peaks, this may suggest the presence of meaningful subgroups.


Boxplot

Boxplots summarize a distribution using the median, quartiles, and possible outliers.

Code
fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="sepal_length",
    ax=ax
)

ax.set_title("Sepal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Sepal Length")

plt.show()

Iris — Sepal Length by Species

Interpretation

A boxplot helps compare groups by showing:

  • differences in typical values through the median
  • variation within each group
  • possible outliers
  • overlap across groups

If two groups overlap strongly, that variable alone may not separate them well.


Scatter Plot

Scatter plots show the relationship between two numeric variables.

Code
fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_length",
    y="petal_length",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Petal Length")

plt.show()

Iris — Sepal Length vs Petal Length

Interpretation

Scatter plots help you evaluate:

  • whether two variables move together
  • whether groups form distinct clusters
  • whether the pattern appears linear or non-linear
  • whether any observations appear unusual

In many datasets, scatter plots are among the fastest ways to detect group structure.


Grouped Histogram

A grouped histogram helps compare distributions across categories.

Code
fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="petal_length",
    hue="species",
    bins=15,
    kde=True,
    ax=ax
)

ax.set_title("Petal Length Distribution by Species")
ax.set_xlabel("Petal Length")
ax.set_ylabel("Count")

plt.show()

Iris — Petal Length Distribution by Species

Interpretation

This plot helps compare whether groups differ in:

  • location
  • spread
  • overlap
  • shape of the distribution

If one group’s values occupy a clearly different range, that variable may be useful for distinguishing between groups.


Pairplot Overview

Pairplots provide a compact view of multiple variable relationships at once.

Code
g = sns.pairplot(
    df,
    hue="species",
    corner=True,
    plot_kws={"alpha": 0.7}
)

g.fig.suptitle("Iris — Pairplot by Species", y=1.02)

plt.show()

Iris — Pairplot by Species

Interpretation

Pairplots help answer broader questions such as:

  • which variables best separate species?
  • which features appear strongly related?
  • which measurements appear redundant?
  • where groups overlap and where they separate clearly?

This is often one of the most useful first multivariate views of a dataset.


Reading Visual Evidence Carefully

A plot is only useful if it is interpreted carefully.

When reading a figure, ask:

  • what question does this plot help answer?
  • what pattern is visible?
  • how strong is the pattern?
  • is there overlap, uncertainty, or ambiguity?
  • does this align with earlier summaries?

Visualization should support reasoning, not replace it.


Visualization Principles

Strong foundational plots share a few key qualities:

  • clear titles and labels
  • readable axes
  • purposeful use of color
  • minimal clutter
  • plot choice matched to the question

A histogram is useful for one-variable distributions.
A boxplot is useful for comparing distributions across groups.
A scatter plot is useful for relationships between two variables.
A pairplot is useful for quick multivariate exploration.


Validation Through Visualization

Plots can also function as validation tools.

Use them to check:

  • whether distributions look plausible
  • whether outliers need review
  • whether grouped differences are real or mostly overlap
  • whether patterns are consistent with earlier cleaning and wrangling steps

Visualization often reveals issues that summary tables alone can miss.


Summary

  • you used foundational plot types to explore the Iris dataset
  • you compared distributions within and across species
  • you examined relationships between numeric variables
  • you used visual patterns to support interpretation
  • you practiced moving from chart creation to analytical reasoning

These are core visualization habits that carry forward into more advanced analysis.


Exercise

Try the following:

  1. Create a scatter plot of sepal_width versus petal_width
  2. Plot a histogram of petal_width
  3. Create a boxplot of petal_width by species
  4. Write one sentence describing which feature seems most useful for separating species
Code
fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_width",
    y="petal_width",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Width vs Petal Width")
ax.set_xlabel("Sepal Width")
ax.set_ylabel("Petal Width")

plt.show()

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.histplot(
    data=df,
    x="petal_width",
    bins=12,
    kde=True,
    ax=ax
)

ax.set_title("Distribution of Petal Width")
ax.set_xlabel("Petal Width")
ax.set_ylabel("Count")

plt.show()

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="petal_width",
    ax=ax
)

ax.set_title("Petal Width by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Width")

plt.show()

print("Your interpretation:")
print("Petal-related features appear more useful than sepal-related features for separating species because their group differences are more distinct.")

Your interpretation:
Petal-related features appear more useful than sepal-related features for separating species because their group differences are more distinct.

CDI Insight

Visualization is not about producing more plots.

It is about choosing the right view of the data to support understanding.

A clear plot reduces uncertainty. A poor plot can introduce it.