Data Wrangling Basics
Data rarely arrives in the exact format required for analysis.
After a dataset has been cleaned, the next step is often to reshape, filter, summarize, and create analysis-ready outputs.
In CDI systems, wrangling is the bridge between:
cleaned table
↓
analysis-ready tables
↓
visualization, summaries, modeling, and reporting
This lesson continues from Chapter 03, where we created:
data/iris_clean.csv
In this chapter, we transform that cleaned table and create reusable downstream outputs.
Lesson overview
By the end of this lesson, you will be able to:
- select and filter rows and columns
- create derived features
- sort and reorder data
- group and aggregate values
- apply conditional logic
- reshape data into long format
- save structured outputs for downstream analysis
- run a reusable wrangling script from the command line
Chapter workflow
This chapter introduces the third reusable Python script in the system:
04-data-wrangling-basics.qmd
↓
scripts/python/wrangle_example_data.py
↓
data/iris_wrangled.csv
results/wrangling/species-summary.tsv
results/wrangling/iris-long.tsv
results/wrangling/wrangling-report.txt
The wrangled outputs support the next chapters on visualization and summary statistics.
Load the cleaned dataset
We use the cleaned dataset created in Chapter 03.
import pandas as pd
df = pd.read_csv("data/iris_clean.csv")
df.head()Select columns
Selecting columns helps focus the analysis on variables of interest.
df[["sepal_length", "petal_length"]].head()Column selection is often used to simplify a table before plotting, summarizing, or exporting results.
Filter rows
Filtering keeps rows that meet a condition.
Example: select rows where petal_length is greater than 4.
df_filtered = df[df["petal_length"] > 4]
df_filtered.head()Filtering should always be tied to an analysis question or quality-control rule.
Create derived features
Feature creation extends the dataset with new variables.
Here, we define a new feature, petal_area.
df["petal_area"] = df["petal_length"] * df["petal_width"]
df.head()A derived feature should be easy to explain and reproducible from existing columns.
Sort data
Sorting helps examine extreme or ordered values.
df_sorted = df.sort_values(by="petal_area", ascending=False)
df_sorted.head()This makes it easier to inspect the largest values, smallest values, or top-ranked observations.
Group and aggregate values
Grouping summarizes values across categories.
Here, we compute summary measurements by species.
grouped = (
df.groupby("species", observed=False)
.agg(
n=("species", "size"),
mean_petal_area=("petal_area", "mean"),
median_petal_area=("petal_area", "median"),
mean_sepal_length=("sepal_length", "mean"),
mean_petal_length=("petal_length", "mean")
)
.reset_index()
)
groupedThis type of grouped table is one of the most common outputs in tidy-table analysis.
Apply conditional logic
Conditional logic creates categories based on defined rules.
Here, we assign a size category based on whether petal_area is above or below the median.
median_petal_area = df["petal_area"].median()
df["size_category"] = df["petal_area"].apply(
lambda x: "large" if x > median_petal_area else "small"
)
df.head()The rule should be stated clearly so another analyst can reproduce or challenge it.
Reshape into long format
Long format is especially useful for visualization and grouped summaries.
df_long = df.melt(
id_vars=["species", "size_category"],
value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
var_name="measurement",
value_name="value"
)
df_long.head()Long format is useful because many plotting and analysis workflows expect:
one row per observation-measurement pair
Validation checks
After transformation, confirm that the dataset remains consistent and complete.
print("Shape:", df.shape)
print("Missing values:", int(df.isna().sum().sum()))
print("Columns:", df.columns.tolist())
assert int(df.isna().sum().sum()) == 0, "Missing values detected after wrangling."
assert "petal_area" in df.columns, "petal_area was not created."
assert "size_category" in df.columns, "size_category was not created."Validation checks help prevent downstream chapters from using incomplete or incorrectly transformed data.
Save wrangled outputs
Save each output with a clear purpose.
from pathlib import Path
Path("data").mkdir(exist_ok=True)
Path("results/wrangling").mkdir(parents=True, exist_ok=True)
df.to_csv("data/iris_wrangled.csv", index=False)
grouped.to_csv("results/wrangling/species-summary.tsv", sep="\t", index=False)
df_long.to_csv("results/wrangling/iris-long.tsv", sep="\t", index=False)
print("Saved wrangled table to: data/iris_wrangled.csv")
print("Saved species summary to: results/wrangling/species-summary.tsv")
print("Saved long-format table to: results/wrangling/iris-long.tsv")These files make the workflow progressive and testable.
Run the reusable wrangling script
The manual steps above explain the logic. The reusable script applies the wrangling workflow from the command line.
Run this from the project root:
python scripts/python/wrangle_example_data.py data/iris_clean.csv data/iris_wrangled.csv results/wranglingExpected outputs:
data/iris_wrangled.csv
results/wrangling/species-summary.tsv
results/wrangling/iris-long.tsv
results/wrangling/wrangling-report.txt
What the wrangling script does
The script:
- reads the cleaned input table
- creates
petal_area - creates
size_category - sorts the table by
petal_area - creates a species-level summary table
- reshapes measurements into long format
- validates the transformed table
- saves downstream outputs
- writes a wrangling report
Exercise
Try the following:
- Open
results/wrangling/species-summary.tsv. - Identify the species with the largest mean
petal_area. - Open
results/wrangling/iris-long.tsv. - Confirm that the long-format table has one row per species-measurement value.
- Rerun the Chapter 02 inspection script on the wrangled table.
Use:
python scripts/python/inspect_table.py data/iris_wrangled.csv results/inspection-wrangledInspect the wrangled table with:
python scripts/python/inspect_table.py data/iris_wrangled.csv results/inspection-wrangledExpected outputs:
results/inspection-wrangled/
├── table-inspection-summary.txt
├── table-column-summary.tsv
└── table-missing-values.tsv
The species summary table should show that petal-based measurements differ strongly by species.
CDI Insight
Wrangling turns a cleaned table into analysis-ready outputs.
This is where data begins to serve specific analytical purposes.
A good wrangling step should produce outputs that are:
- clear
- documented
- reusable
- easy to validate
- ready for downstream visualization and interpretation
In CDI systems, wrangling is not hidden inside a notebook. It is captured as a repeatable script.
Summary
In this lesson, you:
- selected and filtered data
- created
petal_area - sorted rows by a derived feature
- grouped and summarized values by species
- created
size_category - reshaped the table into long format
- saved wrangled outputs
- created a wrangling report with
wrangle_example_data.py
Looking Ahead
In the next chapter, we use the wrangled outputs to create exploratory visualizations. The files produced here become inputs for plotting.