Lesson 4 Data Wrangling Basics
In this lesson, you will take the cleaned Iris dataset from Lesson 03 and transform it to prepare for deeper analysis.
Data wrangling (also called data transformation) means reshaping and enriching your dataset so it better supports the questions you want to answer.
4.1 Lesson Overview
By the end of this lesson, you will be able to:
- Select specific columns and rows
- Filter data using multiple conditions
- Create engineered features
- Create conditional and categorical features
- Summarize data using
groupby()andagg()
- Save a wrangled dataset for later use
4.2 Prerequisites
- Completed Lesson 03 — Data Cleaning and Preparation
data/iris_clean.csvavailable in thedata/folder
4.3 Notebook Setup
We use CDI publishing helpers so that figures are:
- saved incrementally to
figures/
- embedded into the notebook output as PNGs
- safe for the pipeline
ipynb → md → Rmd → GitBook
4.4 Load the Cleaned Dataset
We load the cleaned dataset saved in Lesson 03.
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
(149, 5)
4.5 Selecting Columns
Selecting a subset of columns is useful when you want a smaller working view of the dataset.
sepal_length sepal_width species
0 5.1 3.5 setosa
1 4.9 3.0 setosa
2 4.7 3.2 setosa
3 4.6 3.1 setosa
4 5.0 3.6 setosa
4.6 Filtering Rows
Filtering helps you focus on a subset of observations that match a question.
Example question:
- Among virginica flowers, which ones have
sepal_length > 6.0?
filtered = df[(df["sepal_length"] > 6.0) & (df["species"] == "virginica")]
print(filtered.head())
print(filtered.shape) sepal_length sepal_width petal_length petal_width species
100 6.3 3.3 6.0 2.5 virginica
102 7.1 3.0 5.9 2.1 virginica
103 6.3 2.9 5.6 1.8 virginica
104 6.5 3.0 5.8 2.2 virginica
105 7.6 3.0 6.6 2.1 virginica
(41, 5)
4.7 Feature Engineering
Feature engineering creates new columns that capture useful patterns.
We will create:
sepal_area=sepal_length × sepal_width
petal_ratio=petal_length ÷ petal_width
Note:
- Always protect against division by zero when creating ratios.
import numpy as np
df["sepal_area"] = df["sepal_length"] * df["sepal_width"]
# Protect against division by zero
df["petal_ratio"] = np.where(
df["petal_width"] == 0,
np.nan,
df["petal_length"] / df["petal_width"]
)
print(df[["sepal_length", "sepal_width", "sepal_area", "petal_length", "petal_width", "petal_ratio"]].head()) sepal_length sepal_width sepal_area petal_length petal_width \
0 5.1 3.5 17.85 1.4 0.2
1 4.9 3.0 14.70 1.4 0.2
2 4.7 3.2 15.04 1.3 0.2
3 4.6 3.1 14.26 1.5 0.2
4 5.0 3.6 18.00 1.4 0.2
petal_ratio
0 7.0
1 7.0
2 6.5
3 7.5
4 7.0
4.8 Conditional Columns (Categorical Features)
Sometimes you want a category label derived from a numeric value.
Here we create:
petal_size="large"ifpetal_length ≥ 4.5, else"small"
df["petal_size"] = np.where(df["petal_length"] >= 4.5, "large", "small")
print(df["petal_size"].value_counts())petal_size
small 79
large 70
Name: count, dtype: int64
4.9 Grouping and Aggregation
Grouping summarises the dataset by a category.
We will compute mean measurements per species and count how many rows are in each group.
summary = (
df.groupby("species", as_index=False)
.agg(
sepal_length_mean=("sepal_length", "mean"),
sepal_width_mean=("sepal_width", "mean"),
petal_length_mean=("petal_length", "mean"),
petal_width_mean=("petal_width", "mean"),
n=("species", "count"),
)
.sort_values("sepal_length_mean")
)
print(summary) species sepal_length_mean sepal_width_mean petal_length_mean \
0 setosa 5.006000 3.428000 1.462000
1 versicolor 5.936000 2.770000 4.260000
2 virginica 6.604082 2.979592 5.561224
petal_width_mean n
0 0.246000 50
1 1.326000 50
2 2.028571 49
4.10 (Optional) Quick Plot: Mean Sepal Length by Species
This is a preview of how wrangling outputs feed directly into visualization.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(summary["species"], summary["sepal_length_mean"])
ax.set_title("Mean Sepal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("mean sepal_length")
ax.grid(True, axis="y", alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/04_001.png

4.11 Save the Wrangled Dataset
We will save the updated dataset for later lessons.
from pathlib import Path
Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_wrangled.csv", index=False)
print("Saved wrangled dataset to: data/iris_wrangled.csv")Saved wrangled dataset to: data/iris_wrangled.csv
4.12 Exercise
- Create a new feature called
sepal_ratio(sepal_length ÷ sepal_width)
- Create a new category column called
sepal_sizewith values"wide"or"narrow"based onsepal_width ≥ 3.0
- Create a grouped table that shows the mean
sepal_areaby species
df_ex = df.copy()
df_ex["sepal_ratio"] = np.where(
df_ex["sepal_width"] == 0,
np.nan,
df_ex["sepal_length"] / df_ex["sepal_width"]
)
df_ex["sepal_size"] = np.where(df_ex["sepal_width"] >= 3.0, "wide", "narrow")
grouped = (
df_ex.groupby("species", as_index=False)
.agg(mean_sepal_area=("sepal_area", "mean"))
)
print(grouped) species mean_sepal_area
0 setosa 17.257800
1 versicolor 16.526200
2 virginica 19.766735
4.13 Summary
- You selected columns and filtered rows to answer specific questions
- You engineered new numeric and categorical features
- You built grouped summary tables with
groupby()andagg()
- You saved a wrangled dataset for use in later lessons
Continue to Lesson 05 — Visualization Basics.