Data Wrangling Basics

  • ID: DS-L04
  • Type: Lesson
  • Audience: Public
  • Theme: Transforming and reshaping data using reproducible workflows

Data rarely arrives in the exact format you need for analysis.

In this lesson, you will learn how to transform, filter, group, and reshape data using pandas.
As in previous chapters, all code runs inside this Quarto (.qmd) file and executes when you render the book.


Lesson Overview

By the end of this lesson, you will be able to:

  • Select and filter rows and columns
  • Create new derived features
  • Sort and reorder data
  • Group and aggregate values
  • Apply conditional logic
  • Prepare structured outputs for analysis

Chapter Initialization

from cdi_viz.theme import cdi_notebook_init

cdi_notebook_init(chapter="04", title_x=0.5)

Load the Clean Dataset

We use the cleaned dataset created in Lesson 03.

import pandas as pd

df = pd.read_csv("data/iris_clean.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Selecting Columns

df[["sepal_length", "petal_length"]].head()
sepal_length petal_length
0 5.1 1.4
1 4.9 1.4
2 4.7 1.3
3 4.6 1.5
4 5.0 1.4

Filtering Rows

Example: filter rows where petal_length > 4.

df_filtered = df[df["petal_length"] > 4]
df_filtered.head()
sepal_length sepal_width petal_length petal_width species
50 7.0 3.2 4.7 1.4 versicolor
51 6.4 3.2 4.5 1.5 versicolor
52 6.9 3.1 4.9 1.5 versicolor
54 6.5 2.8 4.6 1.5 versicolor
55 5.7 2.8 4.5 1.3 versicolor

Creating New Features

We create a new feature: petal_area.

df["petal_area"] = df["petal_length"] * df["petal_width"]
df.head()
sepal_length sepal_width petal_length petal_width species petal_area
0 5.1 3.5 1.4 0.2 setosa 0.28
1 4.9 3.0 1.4 0.2 setosa 0.28
2 4.7 3.2 1.3 0.2 setosa 0.26
3 4.6 3.1 1.5 0.2 setosa 0.30
4 5.0 3.6 1.4 0.2 setosa 0.28

Sorting Data

df_sorted = df.sort_values(by="petal_area", ascending=False)
df_sorted.head()
sepal_length sepal_width petal_length petal_width species petal_area
118 7.7 2.6 6.9 2.3 virginica 15.87
109 7.2 3.6 6.1 2.5 virginica 15.25
100 6.3 3.3 6.0 2.5 virginica 15.00
117 7.7 3.8 6.7 2.2 virginica 14.74
143 6.7 3.3 5.7 2.5 virginica 14.25

Grouping and Aggregation

Compute the mean petal_area per species.

grouped = (
    df.groupby("species")
      .agg(mean_petal_area=("petal_area", "mean"),
           mean_sepal_length=("sepal_length", "mean"))
      .reset_index()
)

grouped
species mean_petal_area mean_sepal_length
0 setosa 0.36560 5.006000
1 versicolor 5.72040 5.936000
2 virginica 11.32898 6.604082

Conditional Logic

Create a simple size category based on petal_area.

df["size_category"] = df["petal_area"].apply(
    lambda x: "large" if x > df["petal_area"].median() else "small"
)

df.head()
sepal_length sepal_width petal_length petal_width species petal_area size_category
0 5.1 3.5 1.4 0.2 setosa 0.28 small
1 4.9 3.0 1.4 0.2 setosa 0.28 small
2 4.7 3.2 1.3 0.2 setosa 0.26 small
3 4.6 3.1 1.5 0.2 setosa 0.30 small
4 5.0 3.6 1.4 0.2 setosa 0.28 small

Reshaping Example (Long Format)

Convert selected columns to long format.

df_long = df.melt(
    id_vars="species",
    value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
    var_name="measurement",
    value_name="value"
)

df_long.head()
species measurement value
0 setosa sepal_length 5.1
1 setosa sepal_length 4.9
2 setosa sepal_length 4.7
3 setosa sepal_length 4.6
4 setosa sepal_length 5.0

Validation Checks

print("Shape:", df.shape)
print("Missing values:", int(df.isna().sum().sum()))
Shape: (149, 7)
Missing values: 0

Summary

  • You selected and filtered data
  • You created derived features
  • You grouped and aggregated values
  • You applied conditional logic
  • You reshaped data into long format

These transformations form the foundation of real-world data preparation workflows.