Lesson 4 Data Wrangling Basics

In this lesson, you will take the cleaned Iris dataset from Lesson 03 and transform it to prepare for deeper analysis.

Data wrangling (also called data transformation) means reshaping and enriching your dataset so it better supports the questions you want to answer.

4.1 Lesson Overview

By the end of this lesson, you will be able to:

  • Select specific columns and rows
  • Filter data using multiple conditions
  • Create engineered features
  • Create conditional and categorical features
  • Summarize data using groupby() and agg()
  • Save a wrangled dataset for later use

4.2 Prerequisites

  • Completed Lesson 03 — Data Cleaning and Preparation
  • data/iris_clean.csv available in the data/ folder

4.3 Notebook Setup

We use CDI publishing helpers so that figures are:

  • saved incrementally to figures/
  • embedded into the notebook output as PNGs
  • safe for the pipeline ipynb → md → Rmd → GitBook
from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

# Lesson ID drives figure naming (e.g., figures/04_001.png)
_ = cdi_notebook_init(chapter="04", title_x=0)

4.4 Load the Cleaned Dataset

We load the cleaned dataset saved in Lesson 03.

import pandas as pd

df = pd.read_csv("data/iris_clean.csv")

print(df.head())
print(df.shape)
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
(149, 5)

4.5 Selecting Columns

Selecting a subset of columns is useful when you want a smaller working view of the dataset.

selected = df[["sepal_length", "sepal_width", "species"]]
print(selected.head())
   sepal_length  sepal_width species
0           5.1          3.5  setosa
1           4.9          3.0  setosa
2           4.7          3.2  setosa
3           4.6          3.1  setosa
4           5.0          3.6  setosa

4.6 Filtering Rows

Filtering helps you focus on a subset of observations that match a question.

Example question:

  • Among virginica flowers, which ones have sepal_length > 6.0?
filtered = df[(df["sepal_length"] > 6.0) & (df["species"] == "virginica")]

print(filtered.head())
print(filtered.shape)
     sepal_length  sepal_width  petal_length  petal_width    species
100           6.3          3.3           6.0          2.5  virginica
102           7.1          3.0           5.9          2.1  virginica
103           6.3          2.9           5.6          1.8  virginica
104           6.5          3.0           5.8          2.2  virginica
105           7.6          3.0           6.6          2.1  virginica
(41, 5)

4.7 Feature Engineering

Feature engineering creates new columns that capture useful patterns.

We will create:

  • sepal_area = sepal_length × sepal_width
  • petal_ratio = petal_length ÷ petal_width

Note:

  • Always protect against division by zero when creating ratios.
import numpy as np

df["sepal_area"] = df["sepal_length"] * df["sepal_width"]

# Protect against division by zero
df["petal_ratio"] = np.where(
    df["petal_width"] == 0,
    np.nan,
    df["petal_length"] / df["petal_width"]
)

print(df[["sepal_length", "sepal_width", "sepal_area", "petal_length", "petal_width", "petal_ratio"]].head())
   sepal_length  sepal_width  sepal_area  petal_length  petal_width  \
0           5.1          3.5       17.85           1.4          0.2   
1           4.9          3.0       14.70           1.4          0.2   
2           4.7          3.2       15.04           1.3          0.2   
3           4.6          3.1       14.26           1.5          0.2   
4           5.0          3.6       18.00           1.4          0.2   

   petal_ratio  
0          7.0  
1          7.0  
2          6.5  
3          7.5  
4          7.0  

4.8 Conditional Columns (Categorical Features)

Sometimes you want a category label derived from a numeric value.

Here we create:

  • petal_size = "large" if petal_length ≥ 4.5, else "small"
df["petal_size"] = np.where(df["petal_length"] >= 4.5, "large", "small")

print(df["petal_size"].value_counts())
petal_size
small    79
large    70
Name: count, dtype: int64

4.9 Grouping and Aggregation

Grouping summarises the dataset by a category.

We will compute mean measurements per species and count how many rows are in each group.

summary = (
    df.groupby("species", as_index=False)
    .agg(
        sepal_length_mean=("sepal_length", "mean"),
        sepal_width_mean=("sepal_width", "mean"),
        petal_length_mean=("petal_length", "mean"),
        petal_width_mean=("petal_width", "mean"),
        n=("species", "count"),
    )
    .sort_values("sepal_length_mean")
)

print(summary)
      species  sepal_length_mean  sepal_width_mean  petal_length_mean  \
0      setosa           5.006000          3.428000           1.462000   
1  versicolor           5.936000          2.770000           4.260000   
2   virginica           6.604082          2.979592           5.561224   

   petal_width_mean   n  
0          0.246000  50  
1          1.326000  50  
2          2.028571  49  

4.10 (Optional) Quick Plot: Mean Sepal Length by Species

This is a preview of how wrangling outputs feed directly into visualization.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(summary["species"], summary["sepal_length_mean"])
ax.set_title("Mean Sepal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("mean sepal_length")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)
Saved PNG → figures/04_001.png

4.11 Save the Wrangled Dataset

We will save the updated dataset for later lessons.

from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_wrangled.csv", index=False)

print("Saved wrangled dataset to: data/iris_wrangled.csv")
Saved wrangled dataset to: data/iris_wrangled.csv

4.12 Exercise

  • Create a new feature called sepal_ratio (sepal_length ÷ sepal_width)
  • Create a new category column called sepal_size with values "wide" or "narrow" based on sepal_width ≥ 3.0
  • Create a grouped table that shows the mean sepal_area by species
df_ex = df.copy()

df_ex["sepal_ratio"] = np.where(
    df_ex["sepal_width"] == 0,
    np.nan,
    df_ex["sepal_length"] / df_ex["sepal_width"]
)
df_ex["sepal_size"] = np.where(df_ex["sepal_width"] >= 3.0, "wide", "narrow")

grouped = (
    df_ex.groupby("species", as_index=False)
    .agg(mean_sepal_area=("sepal_area", "mean"))
)

print(grouped)
      species  mean_sepal_area
0      setosa        17.257800
1  versicolor        16.526200
2   virginica        19.766735

4.13 Summary

  • You selected columns and filtered rows to answer specific questions
  • You engineered new numeric and categorical features
  • You built grouped summary tables with groupby() and agg()
  • You saved a wrangled dataset for use in later lessons