Load and Explore a Dataset

  • ID: DS-L02
  • Type: Lesson
  • Audience: Public
  • Theme: Data loading, structure inspection, and initial exploration

Every data science project begins with understanding the dataset.

In this guide, code is executed inside Quarto chapter files (.qmd). When you render the book, Quarto runs the Python chunks and embeds the results directly into the page.

In this lesson, you will:


Chapter Initialization

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

cdi_notebook_init(chapter="02", title_x=0.5)

Load the Dataset

We load Iris using scikit-learn, convert it into a DataFrame, rename columns to snake_case, then save it to disk.

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)

df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Save the Dataset for Future Lessons

We store datasets in data/ throughout this course.

from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris.csv", index=False)

print("Saved dataset to: data/iris.csv")
Saved dataset to: data/iris.csv

Inspect the Dataset Structure

print("Columns:")
print(df.columns.tolist())

print("\nShape (rows, cols):")
print(df.shape)

print("\nData types:")
print(df.dtypes)
Columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Shape (rows, cols):
(150, 5)

Data types:
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

Summary Statistics

df.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Basic Exploratory Visualizations

At this stage, visuals are used to inspect and spot-check the dataset.

Histograms of Numeric Features

import matplotlib.pyplot as plt

df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].hist(
    figsize=(10, 7),
    bins=12
)
plt.suptitle("Iris — Histograms of Numeric Features", y=1.02)
plt.tight_layout()

show_and_save_mpl()

Iris — Histograms of Numeric Features
'figures/02_001.png'

Scatter: sepal_length vs petal_length

fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df["sepal_length"], df["petal_length"], alpha=0.7)
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.grid(True, alpha=0.2)

show_and_save_mpl(fig)

Iris — Sepal Length vs Petal Length
'figures/02_002.png'

Pairplot by Species

import seaborn as sns

g = sns.pairplot(df, hue="species", corner=True, plot_kws={"alpha": 0.7})
g.fig.suptitle("Iris — Pairplot by Species", y=1.02)

show_and_save_mpl(g.fig)

Iris — Pairplot by Species
'figures/02_003.png'

Exercise

  • Print the last five rows of the dataset
  • Count how many rows exist for each species
  • Create a new feature called petal_area (petal_length × petal_width) and print the first five rows
print("Last 5 rows:")
print(df.tail())

print("\nRows per species:")
print(df["species"].value_counts())

df["petal_area"] = df["petal_length"] * df["petal_width"]

print("\nWith petal_area:")
print(df.head())
Last 5 rows:
     sepal_length  sepal_width  petal_length  petal_width    species
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

Rows per species:
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

With petal_area:
   sepal_length  sepal_width  petal_length  petal_width species  petal_area
0           5.1          3.5           1.4          0.2  setosa        0.28
1           4.9          3.0           1.4          0.2  setosa        0.28
2           4.7          3.2           1.3          0.2  setosa        0.26
3           4.6          3.1           1.5          0.2  setosa        0.30
4           5.0          3.6           1.4          0.2  setosa        0.28

Summary

  • You loaded Iris using scikit-learn and converted it to a pandas DataFrame
  • You standardized column names to snake_case
  • You saved the dataset to data/iris.csv for future lessons
  • You inspected structure, summary statistics, and created starter visuals