- ID: DS-L04
- Type: Lesson
- Audience: Beginner / Intermediate
- Theme: Transforming and reshaping data using reproducible workflows
Data rarely arrives in the exact format required for analysis.
In this lesson, you will learn how to transform, filter, group, and reshape data using pandas.
As in previous chapters, all code runs inside this Quarto (.qmd) file.
When the book is rendered, Python code is executed and results are embedded directly into the page.
Lesson Overview
By the end of this lesson, you will be able to:
- select and filter rows and columns
- create derived features
- sort and reorder data
- group and aggregate values
- apply conditional logic
- prepare structured outputs for analysis
Load the Clean Dataset
We use the cleaned dataset created in Lesson 03.
Code
import pandas as pd
df = pd.read_csv("data/iris_clean.csv")
df.head()
| 0 |
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
| 1 |
4.9 |
3.0 |
1.4 |
0.2 |
setosa |
| 2 |
4.7 |
3.2 |
1.3 |
0.2 |
setosa |
| 3 |
4.6 |
3.1 |
1.5 |
0.2 |
setosa |
| 4 |
5.0 |
3.6 |
1.4 |
0.2 |
setosa |
Selecting Columns
Select a subset of columns for focused analysis.
Code
df[["sepal_length", "petal_length"]].head()
| 0 |
5.1 |
1.4 |
| 1 |
4.9 |
1.4 |
| 2 |
4.7 |
1.3 |
| 3 |
4.6 |
1.5 |
| 4 |
5.0 |
1.4 |
Filtering Rows
Filter rows based on a condition.
Example: select rows where petal_length > 4.
Code
df_filtered = df[df["petal_length"] > 4]
df_filtered.head()
| 50 |
7.0 |
3.2 |
4.7 |
1.4 |
versicolor |
| 51 |
6.4 |
3.2 |
4.5 |
1.5 |
versicolor |
| 52 |
6.9 |
3.1 |
4.9 |
1.5 |
versicolor |
| 54 |
6.5 |
2.8 |
4.6 |
1.5 |
versicolor |
| 55 |
5.7 |
2.8 |
4.5 |
1.3 |
versicolor |
Creating New Features
Feature creation extends the dataset with new variables.
Here, we define a new feature, petal_area.
Code
df["petal_area"] = df["petal_length"] * df["petal_width"]
df.head()
| 0 |
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
0.28 |
| 1 |
4.9 |
3.0 |
1.4 |
0.2 |
setosa |
0.28 |
| 2 |
4.7 |
3.2 |
1.3 |
0.2 |
setosa |
0.26 |
| 3 |
4.6 |
3.1 |
1.5 |
0.2 |
setosa |
0.30 |
| 4 |
5.0 |
3.6 |
1.4 |
0.2 |
setosa |
0.28 |
Sorting Data
Sort the dataset to examine extreme or ordered values.
Code
df_sorted = df.sort_values(by="petal_area", ascending=False)
df_sorted.head()
| 118 |
7.7 |
2.6 |
6.9 |
2.3 |
virginica |
15.87 |
| 109 |
7.2 |
3.6 |
6.1 |
2.5 |
virginica |
15.25 |
| 100 |
6.3 |
3.3 |
6.0 |
2.5 |
virginica |
15.00 |
| 117 |
7.7 |
3.8 |
6.7 |
2.2 |
virginica |
14.74 |
| 143 |
6.7 |
3.3 |
5.7 |
2.5 |
virginica |
14.25 |
Grouping and Aggregation
Summarize values across groups.
Here, we compute the mean petal_area per species.
Code
grouped = (
df.groupby("species")
.agg(mean_petal_area=("petal_area", "mean"),
mean_sepal_length=("sepal_length", "mean"))
.reset_index()
)
grouped
| 0 |
setosa |
0.36560 |
5.006000 |
| 1 |
versicolor |
5.72040 |
5.936000 |
| 2 |
virginica |
11.32898 |
6.604082 |
Conditional Logic
Create categorical variables based on defined rules.
Here, we assign a size category based on petal_area.
Code
df["size_category"] = df["petal_area"].apply(
lambda x: "large" if x > df["petal_area"].median() else "small"
)
df.head()
| 0 |
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
0.28 |
small |
| 1 |
4.9 |
3.0 |
1.4 |
0.2 |
setosa |
0.28 |
small |
| 2 |
4.7 |
3.2 |
1.3 |
0.2 |
setosa |
0.26 |
small |
| 3 |
4.6 |
3.1 |
1.5 |
0.2 |
setosa |
0.30 |
small |
| 4 |
5.0 |
3.6 |
1.4 |
0.2 |
setosa |
0.28 |
small |
Validation Checks
After transformation, confirm that the dataset remains consistent and complete.
Code
print("Shape:", df.shape)
print("Missing values:", int(df.isna().sum().sum()))
Shape: (149, 7)
Missing values: 0
Summary
- You selected and filtered data
- You created derived features
- You grouped and aggregated values
- You applied conditional logic
- You reshaped data into long format
These transformations form the foundation of real-world data preparation workflows.