Load and Explore a Dataset

Published

Mar 2026

ID: DS-L02
Type: Lesson
Audience: Beginner / Intermediate
Theme: Exploration before interpretation

Analytical work does not begin with modeling.

It begins with looking carefully at the data.

Before fitting models or drawing conclusions, we need to understand:

what the dataset contains
how variables are structured
whether values look plausible
whether early patterns deserve closer attention

In CDI, this stage is not a formality. It is where analytical judgment begins.

In this lesson, we work with the classic Iris dataset (Fisher 1936), which contains measurements of flower characteristics across three species.

You will:

load the dataset into a pandas DataFrame
inspect its structure and column types
generate summary statistics
create exploratory visualizations
save the dataset for reuse in later lessons

Why this step matters

A dataset can appear clean at first glance and still contain issues that affect later analysis.

Early exploration helps answer practical questions:

how many rows and columns are present?
which variables are numeric and which are categorical?
do any values look unusual?
are there visible group differences?
what should be examined more carefully next?

This is where defensible analysis begins.

Load the dataset

We will use the Iris dataset from scikit-learn, convert it to a pandas DataFrame, and standardize the column names.

Code

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)

df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

What happened here?

iris.data contains the numeric measurements
iris.target contains the species labels encoded as integers
pd.Categorical.from_codes() converts those encoded values into readable species names
the columns are renamed to snake_case so later code stays clean and consistent

Save the dataset for future lessons

We will keep reusable datasets inside a local data/ folder.

Code

from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris.csv", index=False)

print("Saved dataset to: data/iris.csv")

Saved dataset to: data/iris.csv

Saving the dataset at this point helps later lessons stay reproducible and consistent.

Inspect the structure

Before plotting or summarizing, inspect the shape and variable types.

Code

print("Columns:")
print(df.columns.tolist())

print("\nShape (rows, columns):")
print(df.shape)

print("\nData types:")
print(df.dtypes)

Columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Shape (rows, columns):
(150, 5)

Data types:
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

Interpretation

At this stage, you should notice:

the dataset has 150 rows
there are 4 numeric measurement columns
species is a categorical grouping variable

That already tells us this is a small, tidy dataset that is well suited for learning exploratory analysis.

Preview the data more carefully

df.sample(8, random_state=42)

	sepal_length	sepal_width	petal_length	petal_width	species
73	6.1	2.8	4.7	1.2	versicolor
18	5.7	3.8	1.7	0.3	setosa
118	7.7	2.6	6.9	2.3	virginica
78	6.0	2.9	4.5	1.5	versicolor
76	6.8	2.8	4.8	1.4	versicolor
31	5.4	3.4	1.5	0.4	setosa
64	5.6	2.9	3.6	1.3	versicolor
141	6.9	3.1	5.1	2.3	virginica

Looking at a few random rows is often more informative than only checking the first five rows.

It helps confirm that values look realistic across the dataset instead of only at the top.

Summary statistics

Code

df.describe()

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Interpretation

These summary statistics help answer early questions:

What is the average size of each measurement?
Which variables have wider spread?
Are minimum and maximum values plausible?
Do some variables appear more variable than others?

For Iris, petal measurements often show stronger separation across species than sepal measurements. We will confirm that visually next.

Species counts

Before comparing groups, verify that each group is represented.

Code

df["species"].value_counts().sort_index()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

Interpretation

This dataset is balanced across species, which makes visual comparison easier.

Exploratory visualizations

We will use modern plotting with seaborn and matplotlib, without CDI theme dependencies.

These plots are not yet final presentation graphics. Their purpose is to help us inspect the data and notice patterns worth explaining.

Distribution of numeric features

Code

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid", context="notebook")

df_long = df.melt(
    value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
    var_name="feature",
    value_name="value"
)

g = sns.displot(
    data=df_long,
    x="value",
    col="feature",
    col_wrap=2,
    bins=12,
    height=3.6,
    aspect=1.15,
    facet_kws={"sharex": False, "sharey": False}
)

g.set_titles("{col_name}")
g.set_axis_labels("Value", "Count")
g.fig.suptitle("Iris — Distribution of Numeric Features", y=1.03)

plt.show()

Interpretation

These histograms help us see:

the overall spread of each feature
whether values are concentrated or dispersed
whether some variables may contain overlapping or separated groups

A single histogram does not separate species, but it gives a first impression of the measurement ranges.

Boxplot by species

A boxplot gives a compact summary of group differences using the median, quartiles, and overall spread.

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="petal_length",
    width=0.5,
    fliersize=0,
    ax=ax
)

ax.set_title("Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")

plt.show()

Iris — Petal Length by Species (Boxplot Only)

Interpretation

This plot makes it easy to compare:

the typical petal length in each species
the spread within each group
whether groups overlap

However, a boxplot hides the individual observations.

That means we cannot directly see how densely values are clustered or how points are distributed within each group.

Boxplot with observed points

To make the distribution more visible, we can overlay the individual observations.

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.boxplot(
    data=df,
    x="species",
    y="petal_length",
    width=0.5,
    fliersize=0,
    ax=ax
)

sns.stripplot(
    data=df,
    x="species",
    y="petal_length",
    color="black",
    alpha=0.6,
    size=4,
    jitter=0.22,
    ax=ax
)

ax.set_title("Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")

plt.show()

Iris — Petal Length by Species (Boxplot with Observed Points)

Interpretation

Adding the observed points makes the plot more informative.

Now we can see:

how values are distributed within each species
whether observations are tightly clustered or more dispersed
where overlap is limited or more substantial

This provides a fuller view than the boxplot alone.

Petal length differs strongly by species.

This makes it a strong candidate for distinguishing between groups in later analysis.

Scatter plot: sepal length vs petal length

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_length",
    y="petal_length",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Petal Length")

plt.show()

Iris — Sepal Length vs Petal Length by Species

Interpretation

Scatter plots help us inspect relationships between two variables.

Here, the species begin to separate into visible clusters, especially when petal length is involved.

This is a strong exploratory signal that some measurements carry more discriminatory information than others.

Pairwise relationships

Code

g = sns.pairplot(
    df,
    hue="species",
    corner=True,
    diag_kind="hist",
    plot_kws={"alpha": 0.7, "s": 45}
)

g.fig.suptitle("Iris — Pairwise Relationships by Species", y=1.02)

plt.show()

Iris — Pairwise Relationships by Species

Interpretation

The pairplot provides a broader view of the dataset:

which variable pairs show clear separation
where clusters overlap
which variables appear more useful for distinguishing species

This type of overview is valuable before moving to formal modeling.

Create a simple derived feature

Feature creation is often part of exploration.

Here, we define a new variable, petal_area, as a simple combination of petal length and petal width.

Code

df["petal_area"] = df["petal_length"] * df["petal_width"]

df.head()

	sepal_length	sepal_width	petal_length	petal_width	species	petal_area
0	5.1	3.5	1.4	0.2	setosa	0.28
1	4.9	3.0	1.4	0.2	setosa	0.28
2	4.7	3.2	1.3	0.2	setosa	0.26
3	4.6	3.1	1.5	0.2	setosa	0.30
4	5.0	3.6	1.4	0.2	setosa	0.28

This does not mean the new feature is automatically better.

It illustrates how exploratory analysis can lead to new candidate variables.

Exercise

Try the following:

Print the last five rows of the dataset.
Compute the mean of petal_area for each species.
Create a scatter plot of sepal_width versus petal_width.
Write one sentence describing which feature seems most useful for separating species.

Solution

Code

print("Last 5 rows:")
print(df.tail())

print("\nMean petal_area by species:")
print(df.groupby("species", observed=False)["petal_area"].mean())

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.scatterplot(
    data=df,
    x="sepal_width",
    y="petal_width",
    hue="species",
    s=70,
    alpha=0.8,
    ax=ax
)

ax.set_title("Sepal Width vs Petal Width")
ax.set_xlabel("Sepal Width")
ax.set_ylabel("Petal Width")

plt.show()

Last 5 rows:
     sepal_length  sepal_width  petal_length  petal_width    species  \
145           6.7          3.0           5.2          2.3  virginica   
146           6.3          2.5           5.0          1.9  virginica   
147           6.5          3.0           5.2          2.0  virginica   
148           6.2          3.4           5.4          2.3  virginica   
149           5.9          3.0           5.1          1.8  virginica   

     petal_area  
145       11.96  
146        9.50  
147       10.40  
148       12.42  
149        9.18  

Mean petal_area by species:
species
setosa         0.3656
versicolor     5.7204
virginica     11.2962
Name: petal_area, dtype: float64

Interpretation:

Petal-based features provide clearer separation between species than sepal-based features.

CDI Insight

Exploration is not just about making plots.

It is about understanding how the dataset behaves before it is used to support conclusions.

A responsible analyst does not move directly from data loading to modeling.

They pause, inspect, compare, and question — and only then move forward.

That habit is a foundation of reliable analysis.

Summary

In this lesson, you:

loaded the Iris dataset into a pandas DataFrame
standardized column names
saved the dataset to data/iris.csv
inspected structure, data types, and summary statistics
used exploratory visualizations to examine distributions and group differences
created a simple derived feature for further exploration

Next Step

Next, we clean and prepare the dataset for analysis.