from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
cdi_notebook_init(chapter="02", title_x=0.5)Load and Explore a Dataset
Every data science project begins with understanding the dataset.
In this guide, code is executed inside Quarto chapter files (.qmd). When you render the book, Quarto runs the Python chunks and embeds the results directly into the page.
In this lesson, you will:
- Load a dataset into a pandas DataFrame
- Inspect its structure and data types
- Generate summary statistics
- Create initial visualizations
- Save the dataset for reuse in later lessons
Chapter Initialization
Load the Dataset
We load Iris using scikit-learn, convert it into a DataFrame, rename columns to snake_case, then save it to disk.
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
df.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Save the Dataset for Future Lessons
We store datasets in data/ throughout this course.
from pathlib import Path
Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris.csv", index=False)
print("Saved dataset to: data/iris.csv")Saved dataset to: data/iris.csv
Inspect the Dataset Structure
print("Columns:")
print(df.columns.tolist())
print("\nShape (rows, cols):")
print(df.shape)
print("\nData types:")
print(df.dtypes)Columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
Shape (rows, cols):
(150, 5)
Data types:
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species category
dtype: object
Summary Statistics
df.describe()| sepal_length | sepal_width | petal_length | petal_width | |
|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
| mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
| std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Basic Exploratory Visualizations
At this stage, visuals are used to inspect and spot-check the dataset.
Histograms of Numeric Features
import matplotlib.pyplot as plt
df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].hist(
figsize=(10, 7),
bins=12
)
plt.suptitle("Iris — Histograms of Numeric Features", y=1.02)
plt.tight_layout()
show_and_save_mpl()
'figures/02_001.png'
Scatter: sepal_length vs petal_length
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df["sepal_length"], df["petal_length"], alpha=0.7)
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.grid(True, alpha=0.2)
show_and_save_mpl(fig)
'figures/02_002.png'
Pairplot by Species
import seaborn as sns
g = sns.pairplot(df, hue="species", corner=True, plot_kws={"alpha": 0.7})
g.fig.suptitle("Iris — Pairplot by Species", y=1.02)
show_and_save_mpl(g.fig)
'figures/02_003.png'
Exercise
- Print the last five rows of the dataset
- Count how many rows exist for each species
- Create a new feature called
petal_area(petal_length×petal_width) and print the first five rows
print("Last 5 rows:")
print(df.tail())
print("\nRows per species:")
print(df["species"].value_counts())
df["petal_area"] = df["petal_length"] * df["petal_width"]
print("\nWith petal_area:")
print(df.head())Last 5 rows:
sepal_length sepal_width petal_length petal_width species
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
Rows per species:
species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
With petal_area:
sepal_length sepal_width petal_length petal_width species petal_area
0 5.1 3.5 1.4 0.2 setosa 0.28
1 4.9 3.0 1.4 0.2 setosa 0.28
2 4.7 3.2 1.3 0.2 setosa 0.26
3 4.6 3.1 1.5 0.2 setosa 0.30
4 5.0 3.6 1.4 0.2 setosa 0.28
Summary
- You loaded Iris using scikit-learn and converted it to a pandas DataFrame
- You standardized column names to
snake_case
- You saved the dataset to
data/iris.csvfor future lessons
- You inspected structure, summary statistics, and created starter visuals