Load and Explore a Dataset

Published

Jun 2026

  • ID: DS-L02
  • Type: Lesson
  • Audience: Beginner / Intermediate
  • Theme: Exploration before interpretation

Analytical work does not begin with modeling.

It begins with looking carefully at the data.

Before fitting models or drawing conclusions, we need to understand:

In CDI, this stage is not a formality. It is where analytical judgment begins.

In this lesson, we use the classic Iris dataset (Fisher 1936) as a small example table. The goal is not to study flowers. The goal is to learn a reusable pattern for loading, inspecting, saving, and exploring a tidy dataset.

You will:


Why this step matters

A dataset can appear clean at first glance and still contain issues that affect later analysis.

Early exploration helps answer practical questions:

  • how many rows and columns are present?
  • which variables are numeric and which are categorical?
  • do any values look unusual?
  • are there visible group differences?
  • what should be examined more carefully next?

This is where defensible analysis begins.


Chapter workflow

This chapter introduces the first reusable Python script in the system:

02-load-and-explore-dataset.qmd
        ↓
scripts/python/inspect_table.py
        ↓
data/iris.csv
results/inspection/table-inspection-summary.txt
results/inspection/table-column-summary.tsv
results/inspection/table-missing-values.tsv

The chapter explains the workflow. The script makes it reusable.


Create and save the example dataset

We will use the Iris dataset from scikit-learn, convert it to a pandas DataFrame, standardize the column names, and save it as a CSV file.

Run this from the project root using Python:

from pathlib import Path

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)

df.columns = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "species"
]

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris.csv", index=False)

print("Saved dataset to: data/iris.csv")
print(df.head())

What happened here?

  • iris.data contains the numeric measurements.
  • iris.target contains the species labels encoded as integers.
  • pd.Categorical.from_codes() converts the encoded values into readable species names.
  • the columns are renamed to snake_case so later code stays clean and consistent.
  • the dataset is saved to data/iris.csv so downstream lessons can reuse the same input.

Inspect the dataset manually

Before plotting or summarizing, inspect the shape and variable types.

import pandas as pd

df = pd.read_csv("data/iris.csv")

print("Columns:")
print(df.columns.tolist())

print("\nShape (rows, columns):")
print(df.shape)

print("\nData types:")
print(df.dtypes)

Interpretation

At this stage, you should notice:

  • the dataset has 150 rows
  • there are 4 numeric measurement columns
  • species is a categorical grouping variable

That already tells us this is a small, tidy dataset that is well suited for learning exploratory analysis.


Preview the data more carefully

df.sample(8, random_state=42)

Looking at a few random rows is often more informative than only checking the first five rows.

It helps confirm that values look realistic across the dataset instead of only at the top.


Run the reusable inspection script

The manual checks above are useful for learning. For a reusable CDI system, we also want a script that can inspect any tidy table.

Run:

python scripts/python/inspect_table.py data/iris.csv results/inspection

This creates:

results/inspection/
├── table-inspection-summary.txt
├── table-column-summary.tsv
└── table-missing-values.tsv

These outputs document the structure of the table and can be reused later for reporting or quality checks.


What the inspection script does

The script checks:

  • file path
  • table dimensions
  • column names
  • data types
  • missing values
  • numeric columns
  • categorical columns

This turns early exploration into a repeatable step.

In CDI systems, manual inspection teaches judgment.
Reusable scripts make that judgment repeatable across projects.


Summary statistics

Use pandas to generate a first summary of the numeric columns:

df.describe()

Interpretation

These summary statistics help answer early questions:

  • What is the average size of each measurement?
  • Which variables have wider spread?
  • Are minimum and maximum values plausible?
  • Do some variables appear more variable than others?

For Iris, petal measurements often show stronger separation across species than sepal measurements. We will confirm that visually next.


Species counts

Before comparing groups, verify that each group is represented.

df["species"].value_counts().sort_index()

Interpretation

This dataset is balanced across species, which makes visual comparison easier.


Exploratory visualizations

We will use seaborn and matplotlib for exploratory plotting.

These plots are not final presentation graphics. Their purpose is to help us inspect the data and notice patterns worth explaining.

Distribution of numeric features

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", context="notebook")

df_long = df.melt(
    value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
    var_name="feature",
    value_name="value"
)

g = sns.displot(
    data=df_long,
    x="value",
    col="feature",
    col_wrap=2,
    bins=12,
    height=3.6,
    aspect=1.15,
    facet_kws={"sharex": False, "sharey": False}
)

g.set_titles("{col_name}")
g.set_axis_labels("Value", "Count")
g.fig.suptitle("Iris — Distribution of Numeric Features", y=1.03)

plt.show()

Interpretation

These histograms help us see:

  • the overall spread of each feature
  • whether values are concentrated or dispersed
  • whether some variables may contain overlapping or separated groups

A single histogram does not separate species, but it gives a first impression of the measurement ranges.


Boxplot by species

A boxplot gives a compact summary of group differences using the median, quartiles, and overall spread.

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="petal_length",
    width=0.5,
    fliersize=0,
    ax=ax
)

ax.set_title("Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")

plt.show()

Interpretation

This plot makes it easy to compare:

  • the typical petal length in each species
  • the spread within each group
  • whether groups overlap

However, a boxplot hides the individual observations. That means we cannot directly see how densely values are clustered or how points are distributed within each group.


Boxplot with observed points

To make the distribution more visible, we can overlay the individual observations.

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="petal_length",
    width=0.5,
    fliersize=0,
    ax=ax
)

sns.stripplot(
    data=df,
    x="species",
    y="petal_length",
    color="black",
    alpha=0.6,
    size=4,
    jitter=0.22,
    ax=ax
)

ax.set_title("Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")

plt.show()

Interpretation

Adding the observed points makes the plot more informative.

Now we can see:

  • how values are distributed within each species
  • whether observations are tightly clustered or more dispersed
  • where overlap is limited or more substantial

Petal length differs strongly by species. This makes it a strong candidate for distinguishing between groups in later analysis.


Scatter plot: sepal length vs petal length

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_length",
    y="petal_length",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Petal Length")

plt.show()

Interpretation

Scatter plots help us inspect relationships between two variables.

Here, the species begin to separate into visible clusters, especially when petal length is involved.

This is a strong exploratory signal that some measurements carry more discriminatory information than others.


Pairwise relationships

g = sns.pairplot(
    df,
    hue="species",
    corner=True,
    diag_kind="hist",
    plot_kws={"alpha": 0.7, "s": 45}
)

g.fig.suptitle("Iris — Pairwise Relationships by Species", y=1.02)

plt.show()

Interpretation

The pairplot provides a broader view of the dataset:

  • which variable pairs show clear separation
  • where clusters overlap
  • which variables appear more useful for distinguishing species

This type of overview is valuable before moving to formal modeling.


Create a simple derived feature

Feature creation is often part of exploration.

Here, we define a new variable, petal_area, as a simple combination of petal length and petal width.

df["petal_area"] = df["petal_length"] * df["petal_width"]

df.head()

This does not mean the new feature is automatically better.

It illustrates how exploratory analysis can lead to new candidate variables.


Exercise

Try the following:

  1. Print the last five rows of the dataset.
  2. Compute the mean of petal_area for each species.
  3. Create a scatter plot of sepal_width versus petal_width.
  4. Write one sentence describing which feature seems most useful for separating species.

print("Last 5 rows:")
print(df.tail())

print("\nMean petal_area by species:")
print(df.groupby("species", observed=False)["petal_area"].mean())

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_width",
    y="petal_width",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Width vs Petal Width")
ax.set_xlabel("Sepal Width")
ax.set_ylabel("Petal Width")

plt.show()

Interpretation:

Petal-based features provide clearer separation between species than sepal-based features.


CDI Insight

Exploration is not just about making plots.

It is about understanding how the dataset behaves before it is used to support conclusions.

A responsible analyst does not move directly from data loading to modeling.

They pause, inspect, compare, and question — and only then move forward.

That habit is a foundation of reliable analysis.


Summary

In this lesson, you:

  • created and saved the Iris dataset as data/iris.csv
  • inspected structure, data types, and summary statistics
  • generated reusable inspection outputs with inspect_table.py
  • used exploratory visualizations to examine distributions and group differences
  • created a simple derived feature for further exploration

Looking Ahead

In the next chapter, we clean and prepare the dataset for analysis. The cleaned table will become the starting point for downstream wrangling, visualization, and summary statistics.