Lesson 2 Load and Explore a Dataset

Every data science project begins with understanding your data. Before you clean it, visualize it, or build any model, you must first know what the dataset contains, how it is structured, and what patterns already exist.

In this lesson, you will load a dataset into Python, inspect its features, explore basic statistics, and create initial visualizations. We’ll work with the classic Iris dataset, a simple but powerful dataset used in many introductory machine-learning courses.

2.1 Lesson Overview

By the end of this lesson, you will be able to:

  • Load a dataset into a pandas DataFrame
  • Inspect its structure, shape, and datatypes
  • Generate summary statistics
  • Visualize initial feature patterns
  • Save the dataset for later lessons

2.2 Prerequisites

  • Python environment installed (Lesson 01)
  • Jupyter Notebook working

2.3 About the Dataset

The Iris dataset contains 150 flower samples, each described by:

  • sepal_length
  • sepal_width
  • petal_length
  • petal_width
  • species (setosa, versicolor, virginica)

It is widely used for teaching data exploration and modeling because it is clean, simple, and highly visual.

2.3.1 Ways to Load This Dataset

  • Built-in loader → sklearn.datasets.load_iris()
  • Local CSV → pd.read_csv()
  • Public datasets → covered later

For this lesson, we will use the built-in loader and then save a clean CSV to data/iris.csv for the next lessons.

2.4 Notebook Setup

This course uses the CDI publishing helpers so that figures are:

  • saved incrementally to figures/
  • embedded into the notebook output as PNGs
  • safe for the pipeline ipynb → md → Rmd → GitBook
from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

# Lesson ID drives figure naming (e.g., figures/02_001.png)
cdi_notebook_init(chapter="02", title_x=0)
'cdi'

2.5 Load the Dataset

We will load Iris using scikit-learn, convert it into a DataFrame, rename columns to snake_case, and save it to disk.

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Rename columns to snake_case for consistency across lessons
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

print(df.head())
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

2.6 Save the Dataset for Future Lessons

We store datasets in data/ throughout this course.

from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris.csv", index=False)

print("Saved dataset to: data/iris.csv")
Saved dataset to: data/iris.csv

2.7 Inspect the Dataset Structure

Check the dataset shape, column names, and data types.

print("Columns:")
print(df.columns.tolist())

print("\nShape (rows, cols):")
print(df.shape)

print("\nData types:")
print(df.dtypes)
Columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Shape (rows, cols):
(150, 5)

Data types:
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

2.8 Summary Statistics

A quick way to see the typical range of values is describe().

print(df.describe())
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

2.9 Basic Exploratory Visualizations

We will create a few starter plots to understand patterns quickly.

Note:

  • These are not “final” charts yet
  • At this stage, visuals are used to inspect and spot-check data

2.9.1 Histograms of Numeric Features

Histograms show the distribution of each numeric column.

import matplotlib.pyplot as plt

df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].hist(
    figsize=(10, 7),
    bins=12
)
plt.suptitle("Iris — Histograms of Numeric Features", y=1.02)
plt.tight_layout()

show_and_save_mpl()
Saved PNG → figures/02_001.png

2.9.2 Scatter: sepal_length vs petal_length

A scatter plot helps you see relationships between two features.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df["sepal_length"], df["petal_length"], alpha=0.7)
ax.set_title("Iris — Sepal Length vs Petal Length")
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.grid(True, alpha=0.2)

show_and_save_mpl(fig)
Saved PNG → figures/02_002.png

2.9.3 Pairplot by Species

This is a quick “all pairs” view to spot clusters by species.

import seaborn as sns

g = sns.pairplot(df, hue="species", corner=True, plot_kws={"alpha": 0.7})
g.fig.suptitle("Iris — Pairplot by Species", y=1.02)

show_and_save_mpl(g.fig)
Saved PNG → figures/02_003.png

2.10 Exercise

  • Print the last five rows of the dataset
  • Count how many rows exist for each species
  • Create a new feature called petal_area (petal_length × petal_width) and print the first five rows
print("Last 5 rows:")
print(df.tail())

print("\nRows per species:")
print(df["species"].value_counts())

df["petal_area"] = df["petal_length"] * df["petal_width"]

print("\nWith petal_area:")
print(df.head())
Last 5 rows:
     sepal_length  sepal_width  petal_length  petal_width    species
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

Rows per species:
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

With petal_area:
   sepal_length  sepal_width  petal_length  petal_width species  petal_area
0           5.1          3.5           1.4          0.2  setosa        0.28
1           4.9          3.0           1.4          0.2  setosa        0.28
2           4.7          3.2           1.3          0.2  setosa        0.26
3           4.6          3.1           1.5          0.2  setosa        0.30
4           5.0          3.6           1.4          0.2  setosa        0.28

2.11 Summary

  • You loaded Iris using scikit-learn and converted it to a pandas DataFrame
  • You standardized column names to snake_case
  • You saved the dataset to data/iris.csv for future lessons
  • You inspected structure, summary statistics, and created starter visuals