Data Wrangling Basics

Published

Jun 2026

  • ID: DS-L04
  • Type: Lesson
  • Audience: Beginner / Intermediate
  • Theme: Transforming and reshaping data using reproducible workflows

Data rarely arrives in the exact format required for analysis.

After a dataset has been cleaned, the next step is often to reshape, filter, summarize, and create analysis-ready outputs.

In CDI systems, wrangling is the bridge between:

cleaned table
        ↓
analysis-ready tables
        ↓
visualization, summaries, modeling, and reporting

This lesson continues from Chapter 03, where we created:

data/iris_clean.csv

In this chapter, we transform that cleaned table and create reusable downstream outputs.


Lesson overview

By the end of this lesson, you will be able to:

  • select and filter rows and columns
  • create derived features
  • sort and reorder data
  • group and aggregate values
  • apply conditional logic
  • reshape data into long format
  • save structured outputs for downstream analysis
  • run a reusable wrangling script from the command line

Chapter workflow

This chapter introduces the third reusable Python script in the system:

04-data-wrangling-basics.qmd
        ↓
scripts/python/wrangle_example_data.py
        ↓
data/iris_wrangled.csv
results/wrangling/species-summary.tsv
results/wrangling/iris-long.tsv
results/wrangling/wrangling-report.txt

The wrangled outputs support the next chapters on visualization and summary statistics.


Load the cleaned dataset

We use the cleaned dataset created in Chapter 03.

import pandas as pd

df = pd.read_csv("data/iris_clean.csv")
df.head()

Select columns

Selecting columns helps focus the analysis on variables of interest.

df[["sepal_length", "petal_length"]].head()

Column selection is often used to simplify a table before plotting, summarizing, or exporting results.


Filter rows

Filtering keeps rows that meet a condition.

Example: select rows where petal_length is greater than 4.

df_filtered = df[df["petal_length"] > 4]
df_filtered.head()

Filtering should always be tied to an analysis question or quality-control rule.


Create derived features

Feature creation extends the dataset with new variables.

Here, we define a new feature, petal_area.

df["petal_area"] = df["petal_length"] * df["petal_width"]
df.head()

A derived feature should be easy to explain and reproducible from existing columns.


Sort data

Sorting helps examine extreme or ordered values.

df_sorted = df.sort_values(by="petal_area", ascending=False)
df_sorted.head()

This makes it easier to inspect the largest values, smallest values, or top-ranked observations.


Group and aggregate values

Grouping summarizes values across categories.

Here, we compute summary measurements by species.

grouped = (
    df.groupby("species", observed=False)
      .agg(
          n=("species", "size"),
          mean_petal_area=("petal_area", "mean"),
          median_petal_area=("petal_area", "median"),
          mean_sepal_length=("sepal_length", "mean"),
          mean_petal_length=("petal_length", "mean")
      )
      .reset_index()
)

grouped

This type of grouped table is one of the most common outputs in tidy-table analysis.


Apply conditional logic

Conditional logic creates categories based on defined rules.

Here, we assign a size category based on whether petal_area is above or below the median.

median_petal_area = df["petal_area"].median()

df["size_category"] = df["petal_area"].apply(
    lambda x: "large" if x > median_petal_area else "small"
)

df.head()

The rule should be stated clearly so another analyst can reproduce or challenge it.


Reshape into long format

Long format is especially useful for visualization and grouped summaries.

df_long = df.melt(
    id_vars=["species", "size_category"],
    value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
    var_name="measurement",
    value_name="value"
)

df_long.head()

Long format is useful because many plotting and analysis workflows expect:

one row per observation-measurement pair

Validation checks

After transformation, confirm that the dataset remains consistent and complete.

print("Shape:", df.shape)
print("Missing values:", int(df.isna().sum().sum()))
print("Columns:", df.columns.tolist())

assert int(df.isna().sum().sum()) == 0, "Missing values detected after wrangling."
assert "petal_area" in df.columns, "petal_area was not created."
assert "size_category" in df.columns, "size_category was not created."

Validation checks help prevent downstream chapters from using incomplete or incorrectly transformed data.


Save wrangled outputs

Save each output with a clear purpose.

from pathlib import Path

Path("data").mkdir(exist_ok=True)
Path("results/wrangling").mkdir(parents=True, exist_ok=True)

df.to_csv("data/iris_wrangled.csv", index=False)
grouped.to_csv("results/wrangling/species-summary.tsv", sep="\t", index=False)
df_long.to_csv("results/wrangling/iris-long.tsv", sep="\t", index=False)

print("Saved wrangled table to: data/iris_wrangled.csv")
print("Saved species summary to: results/wrangling/species-summary.tsv")
print("Saved long-format table to: results/wrangling/iris-long.tsv")

These files make the workflow progressive and testable.


Run the reusable wrangling script

The manual steps above explain the logic. The reusable script applies the wrangling workflow from the command line.

Run this from the project root:

python scripts/python/wrangle_example_data.py data/iris_clean.csv data/iris_wrangled.csv results/wrangling

Expected outputs:

data/iris_wrangled.csv
results/wrangling/species-summary.tsv
results/wrangling/iris-long.tsv
results/wrangling/wrangling-report.txt

What the wrangling script does

The script:

  • reads the cleaned input table
  • creates petal_area
  • creates size_category
  • sorts the table by petal_area
  • creates a species-level summary table
  • reshapes measurements into long format
  • validates the transformed table
  • saves downstream outputs
  • writes a wrangling report

Exercise

Try the following:

  1. Open results/wrangling/species-summary.tsv.
  2. Identify the species with the largest mean petal_area.
  3. Open results/wrangling/iris-long.tsv.
  4. Confirm that the long-format table has one row per species-measurement value.
  5. Rerun the Chapter 02 inspection script on the wrangled table.

Use:

python scripts/python/inspect_table.py data/iris_wrangled.csv results/inspection-wrangled

Inspect the wrangled table with:

python scripts/python/inspect_table.py data/iris_wrangled.csv results/inspection-wrangled

Expected outputs:

results/inspection-wrangled/
├── table-inspection-summary.txt
├── table-column-summary.tsv
└── table-missing-values.tsv

The species summary table should show that petal-based measurements differ strongly by species.


CDI Insight

Wrangling turns a cleaned table into analysis-ready outputs.

This is where data begins to serve specific analytical purposes.

A good wrangling step should produce outputs that are:

  • clear
  • documented
  • reusable
  • easy to validate
  • ready for downstream visualization and interpretation

In CDI systems, wrangling is not hidden inside a notebook. It is captured as a repeatable script.


Summary

In this lesson, you:

  • selected and filtered data
  • created petal_area
  • sorted rows by a derived feature
  • grouped and summarized values by species
  • created size_category
  • reshaped the table into long format
  • saved wrangled outputs
  • created a wrangling report with wrangle_example_data.py

Looking Ahead

In the next chapter, we use the wrangled outputs to create exploratory visualizations. The files produced here become inputs for plotting.