Data Wrangling Basics

Published

Mar 2026

ID: DS-L04
Type: Lesson
Audience: Beginner / Intermediate
Theme: Transforming and reshaping data using reproducible workflows

Data rarely arrives in the exact format required for analysis.

In this lesson, you will learn how to transform, filter, group, and reshape data using pandas.

As in previous chapters, all code runs inside this Quarto (.qmd) file.

When the book is rendered, Python code is executed and results are embedded directly into the page.

Lesson Overview

By the end of this lesson, you will be able to:

select and filter rows and columns
create derived features
sort and reorder data
group and aggregate values
apply conditional logic
prepare structured outputs for analysis

Load the Clean Dataset

We use the cleaned dataset created in Lesson 03.

Code

import pandas as pd

df = pd.read_csv("data/iris_clean.csv")
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Selecting Columns

Select a subset of columns for focused analysis.

Code

df[["sepal_length", "petal_length"]].head()

	sepal_length	petal_length
0	5.1	1.4
1	4.9	1.4
2	4.7	1.3
3	4.6	1.5
4	5.0	1.4

Filtering Rows

Filter rows based on a condition.

Example: select rows where petal_length > 4.

Code

df_filtered = df[df["petal_length"] > 4]
df_filtered.head()

	sepal_length	sepal_width	petal_length	petal_width	species
50	7.0	3.2	4.7	1.4	versicolor
51	6.4	3.2	4.5	1.5	versicolor
52	6.9	3.1	4.9	1.5	versicolor
54	6.5	2.8	4.6	1.5	versicolor
55	5.7	2.8	4.5	1.3	versicolor

Creating New Features

Feature creation extends the dataset with new variables.

Here, we define a new feature, petal_area.

Code

df["petal_area"] = df["petal_length"] * df["petal_width"]
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species	petal_area
0	5.1	3.5	1.4	0.2	setosa	0.28
1	4.9	3.0	1.4	0.2	setosa	0.28
2	4.7	3.2	1.3	0.2	setosa	0.26
3	4.6	3.1	1.5	0.2	setosa	0.30
4	5.0	3.6	1.4	0.2	setosa	0.28

Sorting Data

Sort the dataset to examine extreme or ordered values.

Code

df_sorted = df.sort_values(by="petal_area", ascending=False)
df_sorted.head()

	sepal_length	sepal_width	petal_length	petal_width	species	petal_area
118	7.7	2.6	6.9	2.3	virginica	15.87
109	7.2	3.6	6.1	2.5	virginica	15.25
100	6.3	3.3	6.0	2.5	virginica	15.00
117	7.7	3.8	6.7	2.2	virginica	14.74
143	6.7	3.3	5.7	2.5	virginica	14.25

Grouping and Aggregation

Summarize values across groups.

Here, we compute the mean petal_area per species.

Code

grouped = (
    df.groupby("species")
      .agg(mean_petal_area=("petal_area", "mean"),
           mean_sepal_length=("sepal_length", "mean"))
      .reset_index()
)

grouped

	species	mean_petal_area	mean_sepal_length
0	setosa	0.36560	5.006000
1	versicolor	5.72040	5.936000
2	virginica	11.32898	6.604082

Conditional Logic

Create categorical variables based on defined rules.

Here, we assign a size category based on petal_area.

Code

df["size_category"] = df["petal_area"].apply(
    lambda x: "large" if x > df["petal_area"].median() else "small"
)

df.head()

	sepal_length	sepal_width	petal_length	petal_width	species	petal_area	size_category
0	5.1	3.5	1.4	0.2	setosa	0.28	small
1	4.9	3.0	1.4	0.2	setosa	0.28	small
2	4.7	3.2	1.3	0.2	setosa	0.26	small
3	4.6	3.1	1.5	0.2	setosa	0.30	small
4	5.0	3.6	1.4	0.2	setosa	0.28	small

Reshaping Example (Long Format)

Reshape the dataset into long format for flexible analysis and visualization.

Code

df_long = df.melt(
    id_vars="species",
    value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
    var_name="measurement",
    value_name="value"
)

df_long.head()

	species	measurement	value
0	setosa	sepal_length	5.1
1	setosa	sepal_length	4.9
2	setosa	sepal_length	4.7
3	setosa	sepal_length	4.6
4	setosa	sepal_length	5.0

Validation Checks

After transformation, confirm that the dataset remains consistent and complete.

Code

print("Shape:", df.shape)
print("Missing values:", int(df.isna().sum().sum()))

Shape: (149, 7)
Missing values: 0

Summary

You selected and filtered data
You created derived features
You grouped and aggregated values
You applied conditional logic
You reshaped data into long format

These transformations form the foundation of real-world data preparation workflows.