Lesson 2 Load and Explore a Dataset
Every data science project begins with understanding your data. Before you clean it, visualize it, or build any model, you must first know what the dataset contains, how it is structured, and what patterns already exist.
In this lesson, you will load a dataset into Python, inspect its features, explore basic statistics, and create initial visualizations. We’ll work with the classic Iris dataset, a simple but powerful dataset used in many introductory machine-learning courses.
2.1 Lesson Overview
By the end of this lesson, you will be able to:
- Load a dataset into a pandas DataFrame
- Inspect its structure, shape, and datatypes
- Generate summary statistics
- Visualize initial feature patterns
- Save the dataset for later lessons
2.3 About the Dataset
The Iris dataset contains 150 flower samples, each described by:
sepal_length
sepal_width
petal_length
petal_width
species(setosa, versicolor, virginica)
It is widely used for teaching data exploration and modeling because it is clean, simple, and highly visual.
2.4 Notebook Setup
This course uses the CDI publishing helpers so that figures are:
- saved incrementally to
figures/
- embedded into the notebook output as PNGs
- safe for the pipeline
ipynb → md → Rmd → GitBook
from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
# Lesson ID drives figure naming (e.g., figures/02_001.png)
cdi_notebook_init(chapter="02", title_x=0)'cdi'
2.5 Load the Dataset
We will load Iris using scikit-learn, convert it into a DataFrame, rename columns to snake_case, and save it to disk.
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Rename columns to snake_case for consistency across lessons
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
print(df.head()) sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
2.6 Save the Dataset for Future Lessons
We store datasets in data/ throughout this course.
from pathlib import Path
Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris.csv", index=False)
print("Saved dataset to: data/iris.csv")Saved dataset to: data/iris.csv
2.7 Inspect the Dataset Structure
Check the dataset shape, column names, and data types.
print("Columns:")
print(df.columns.tolist())
print("\nShape (rows, cols):")
print(df.shape)
print("\nData types:")
print(df.dtypes)Columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
Shape (rows, cols):
(150, 5)
Data types:
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species category
dtype: object
2.8 Summary Statistics
A quick way to see the typical range of values is describe().
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
2.9 Basic Exploratory Visualizations
We will create a few starter plots to understand patterns quickly.
Note:
- These are not “final” charts yet
- At this stage, visuals are used to inspect and spot-check data
2.9.1 Histograms of Numeric Features
Histograms show the distribution of each numeric column.
import matplotlib.pyplot as plt
df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].hist(
figsize=(10, 7),
bins=12
)
plt.suptitle("Iris — Histograms of Numeric Features", y=1.02)
plt.tight_layout()
show_and_save_mpl()Saved PNG → figures/02_001.png

2.9.2 Scatter: sepal_length vs petal_length
A scatter plot helps you see relationships between two features.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df["sepal_length"], df["petal_length"], alpha=0.7)
ax.set_title("Iris — Sepal Length vs Petal Length")
ax.set_xlabel("sepal_length")
ax.set_ylabel("petal_length")
ax.grid(True, alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/02_002.png

2.10 Exercise
- Print the last five rows of the dataset
- Count how many rows exist for each species
- Create a new feature called
petal_area(petal_length×petal_width) and print the first five rows
print("Last 5 rows:")
print(df.tail())
print("\nRows per species:")
print(df["species"].value_counts())
df["petal_area"] = df["petal_length"] * df["petal_width"]
print("\nWith petal_area:")
print(df.head())Last 5 rows:
sepal_length sepal_width petal_length petal_width species
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
Rows per species:
species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
With petal_area:
sepal_length sepal_width petal_length petal_width species petal_area
0 5.1 3.5 1.4 0.2 setosa 0.28
1 4.9 3.0 1.4 0.2 setosa 0.28
2 4.7 3.2 1.3 0.2 setosa 0.26
3 4.6 3.1 1.5 0.2 setosa 0.30
4 5.0 3.6 1.4 0.2 setosa 0.28
2.11 Summary
- You loaded Iris using scikit-learn and converted it to a pandas DataFrame
- You standardized column names to
snake_case
- You saved the dataset to
data/iris.csvfor future lessons
- You inspected structure, summary statistics, and created starter visuals
Continue to Lesson 03 — Data Cleaning and Preparation.
