from cdi_viz.theme import cdi_notebook_init
cdi_notebook_init(chapter="04", title_x=0.5)Data Wrangling Basics
Data rarely arrives in the exact format you need for analysis.
In this lesson, you will learn how to transform, filter, group, and reshape data using pandas.
As in previous chapters, all code runs inside this Quarto (.qmd) file and executes when you render the book.
Lesson Overview
By the end of this lesson, you will be able to:
- Select and filter rows and columns
- Create new derived features
- Sort and reorder data
- Group and aggregate values
- Apply conditional logic
- Prepare structured outputs for analysis
Chapter Initialization
Load the Clean Dataset
We use the cleaned dataset created in Lesson 03.
import pandas as pd
df = pd.read_csv("data/iris_clean.csv")
df.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Selecting Columns
df[["sepal_length", "petal_length"]].head()| sepal_length | petal_length | |
|---|---|---|
| 0 | 5.1 | 1.4 |
| 1 | 4.9 | 1.4 |
| 2 | 4.7 | 1.3 |
| 3 | 4.6 | 1.5 |
| 4 | 5.0 | 1.4 |
Filtering Rows
Example: filter rows where petal_length > 4.
df_filtered = df[df["petal_length"] > 4]
df_filtered.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 50 | 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
| 51 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
| 52 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
| 54 | 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
| 55 | 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
Creating New Features
We create a new feature: petal_area.
df["petal_area"] = df["petal_length"] * df["petal_width"]
df.head()| sepal_length | sepal_width | petal_length | petal_width | species | petal_area | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 0.28 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 0.28 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 0.26 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 0.30 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 0.28 |
Sorting Data
df_sorted = df.sort_values(by="petal_area", ascending=False)
df_sorted.head()| sepal_length | sepal_width | petal_length | petal_width | species | petal_area | |
|---|---|---|---|---|---|---|
| 118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica | 15.87 |
| 109 | 7.2 | 3.6 | 6.1 | 2.5 | virginica | 15.25 |
| 100 | 6.3 | 3.3 | 6.0 | 2.5 | virginica | 15.00 |
| 117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica | 14.74 |
| 143 | 6.7 | 3.3 | 5.7 | 2.5 | virginica | 14.25 |
Grouping and Aggregation
Compute the mean petal_area per species.
grouped = (
df.groupby("species")
.agg(mean_petal_area=("petal_area", "mean"),
mean_sepal_length=("sepal_length", "mean"))
.reset_index()
)
grouped| species | mean_petal_area | mean_sepal_length | |
|---|---|---|---|
| 0 | setosa | 0.36560 | 5.006000 |
| 1 | versicolor | 5.72040 | 5.936000 |
| 2 | virginica | 11.32898 | 6.604082 |
Conditional Logic
Create a simple size category based on petal_area.
df["size_category"] = df["petal_area"].apply(
lambda x: "large" if x > df["petal_area"].median() else "small"
)
df.head()| sepal_length | sepal_width | petal_length | petal_width | species | petal_area | size_category | |
|---|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 0.28 | small |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 0.28 | small |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 0.26 | small |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 0.30 | small |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 0.28 | small |
Reshaping Example (Long Format)
Convert selected columns to long format.
df_long = df.melt(
id_vars="species",
value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
var_name="measurement",
value_name="value"
)
df_long.head()| species | measurement | value | |
|---|---|---|---|
| 0 | setosa | sepal_length | 5.1 |
| 1 | setosa | sepal_length | 4.9 |
| 2 | setosa | sepal_length | 4.7 |
| 3 | setosa | sepal_length | 4.6 |
| 4 | setosa | sepal_length | 5.0 |
Validation Checks
print("Shape:", df.shape)
print("Missing values:", int(df.isna().sum().sum()))Shape: (149, 7)
Missing values: 0
Summary
- You selected and filtered data
- You created derived features
- You grouped and aggregated values
- You applied conditional logic
- You reshaped data into long format
These transformations form the foundation of real-world data preparation workflows.