Data Wrangling Basics

Published

Mar 2026

  • ID: DS-L04
  • Type: Lesson
  • Audience: Beginner / Intermediate
  • Theme: Transforming and reshaping data using reproducible workflows

Data rarely arrives in the exact format required for analysis.

In this lesson, you will learn how to transform, filter, group, and reshape data using pandas.

As in previous chapters, all code runs inside this Quarto (.qmd) file.

When the book is rendered, Python code is executed and results are embedded directly into the page.


Lesson Overview

By the end of this lesson, you will be able to:

  • select and filter rows and columns
  • create derived features
  • sort and reorder data
  • group and aggregate values
  • apply conditional logic
  • prepare structured outputs for analysis

Load the Clean Dataset

We use the cleaned dataset created in Lesson 03.

Code
import pandas as pd

df = pd.read_csv("data/iris_clean.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Selecting Columns

Select a subset of columns for focused analysis.

Code
df[["sepal_length", "petal_length"]].head()
sepal_length petal_length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
3 4.6 1.5
4 5.0 1.4

Filtering Rows

Filter rows based on a condition.

Example: select rows where petal_length > 4.

Code
df_filtered = df[df["petal_length"] > 4]
df_filtered.head()
sepal_length sepal_width petal_length petal_width species
50 7.0 3.2 4.7 1.4 versicolor
51 6.4 3.2 4.5 1.5 versicolor
52 6.9 3.1 4.9 1.5 versicolor
54 6.5 2.8 4.6 1.5 versicolor
55 5.7 2.8 4.5 1.3 versicolor

Creating New Features

Feature creation extends the dataset with new variables.

Here, we define a new feature, petal_area.

Code
df["petal_area"] = df["petal_length"] * df["petal_width"]
df.head()
sepal_length sepal_width petal_length petal_width species petal_area
0 5.1 3.5 1.4 0.2 setosa 0.28
1 4.9 3.0 1.4 0.2 setosa 0.28
2 4.7 3.2 1.3 0.2 setosa 0.26
3 4.6 3.1 1.5 0.2 setosa 0.30
4 5.0 3.6 1.4 0.2 setosa 0.28

Sorting Data

Sort the dataset to examine extreme or ordered values.

Code
df_sorted = df.sort_values(by="petal_area", ascending=False)
df_sorted.head()
sepal_length sepal_width petal_length petal_width species petal_area
118 7.7 2.6 6.9 2.3 virginica 15.87
109 7.2 3.6 6.1 2.5 virginica 15.25
100 6.3 3.3 6.0 2.5 virginica 15.00
117 7.7 3.8 6.7 2.2 virginica 14.74
143 6.7 3.3 5.7 2.5 virginica 14.25

Grouping and Aggregation

Summarize values across groups.

Here, we compute the mean petal_area per species.

Code
grouped = (
    df.groupby("species")
      .agg(mean_petal_area=("petal_area", "mean"),
           mean_sepal_length=("sepal_length", "mean"))
      .reset_index()
)

grouped
species mean_petal_area mean_sepal_length
0 setosa 0.36560 5.006000
1 versicolor 5.72040 5.936000
2 virginica 11.32898 6.604082

Conditional Logic

Create categorical variables based on defined rules.

Here, we assign a size category based on petal_area.

Code
df["size_category"] = df["petal_area"].apply(
    lambda x: "large" if x > df["petal_area"].median() else "small"
)

df.head()
sepal_length sepal_width petal_length petal_width species petal_area size_category
0 5.1 3.5 1.4 0.2 setosa 0.28 small
1 4.9 3.0 1.4 0.2 setosa 0.28 small
2 4.7 3.2 1.3 0.2 setosa 0.26 small
3 4.6 3.1 1.5 0.2 setosa 0.30 small
4 5.0 3.6 1.4 0.2 setosa 0.28 small

Reshaping Example (Long Format)

Reshape the dataset into long format for flexible analysis and visualization.

Code
df_long = df.melt(
    id_vars="species",
    value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
    var_name="measurement",
    value_name="value"
)

df_long.head()
species measurement value
0 setosa sepal_length 5.1
1 setosa sepal_length 4.9
2 setosa sepal_length 4.7
3 setosa sepal_length 4.6
4 setosa sepal_length 5.0

Validation Checks

After transformation, confirm that the dataset remains consistent and complete.

Code
print("Shape:", df.shape)
print("Missing values:", int(df.isna().sum().sum()))
Shape: (149, 7)
Missing values: 0

Summary

  • You selected and filtered data
  • You created derived features
  • You grouped and aggregated values
  • You applied conditional logic
  • You reshaped data into long format

These transformations form the foundation of real-world data preparation workflows.