Data Cleaning and Preparation
Data cleaning is not just a technical step.
It is a decision-making process.
Before a dataset is used for summaries, visualizations, modeling, or reporting, it should be checked for common quality issues and prepared in a reproducible way.
In CDI systems, cleaning should be:
- measured before modification
- minimal and justified
- reproducible
- validated with explicit checks
- saved as a new output rather than silently overwriting the original table
This lesson continues from Chapter 02, where we created and inspected:
data/iris.csv
In this chapter, we clean that dataset and save a downstream-ready version:
data/iris_clean.csv
Lesson overview
By the end of this lesson, you will be able to:
- inspect a dataset for common quality issues
- detect and handle duplicate rows
- standardize column names
- handle missing values using clear rules
- correct inconsistent data types
- validate and save a cleaned dataset
- run a reusable cleaning script from the command line
Chapter workflow
This chapter introduces the second reusable Python script in the system:
03-data-cleaning-and-preparation.qmd
↓
scripts/python/clean_example_data.py
↓
data/iris_clean.csv
results/cleaning/cleaning-report.txt
The cleaned dataset becomes the input for downstream wrangling, visualization, and summary statistics.
Cleaning philosophy
A responsible cleaning workflow follows a simple rule:
inspect first
clean second
validate third
save last
Never modify data without first understanding what is being changed.
Always validate the results of cleaning steps with explicit checks.
Load the dataset
Start from the table created in Chapter 02.
import pandas as pd
df = pd.read_csv("data/iris.csv")
df.head()Initial inspection
Check the table structure before making changes.
print("Shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)Then check summary statistics and missing values.
print("Summary statistics:")
print(df.describe(include="all"))
print("\nMissing values per column:")
print(df.isna().sum())Detect column types
Separating numeric and categorical columns helps us apply appropriate cleaning rules.
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()
print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)For this dataset, the measurement columns are numeric and species is categorical.
Detect and remove duplicates
Duplicate rows can distort counts, summaries, visualizations, and model training.
dup_count = int(df.duplicated().sum())
print("Duplicate rows found:", dup_count)
if dup_count > 0:
print("\nInspect duplicates carefully before removal.")
print(df[df.duplicated()].head())If duplicates exist, remove them only after inspection.
if dup_count > 0:
df = df.drop_duplicates().reset_index(drop=True)
print("Shape after duplicate handling:", df.shape)For a real project, duplicate handling should always be explained in the analysis notes.
Column name hygiene
Clean column names make downstream code easier to read and less error-prone.
df.columns = [
str(c)
.strip()
.lower()
.replace(" ", "_")
.replace("(", "")
.replace(")", "")
for c in df.columns
]
print("Updated columns:", df.columns.tolist())This keeps column names consistent with the tidy-table style used across CDI systems.
Handle missing values
First, measure missingness.
print(df.isna().sum())Then apply a clear rule.
For this lesson, we use a simple starter pattern:
- numeric columns: fill missing values with the median
- categorical columns: fill missing values with the most frequent value
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
for c in cat_cols:
if df[c].isna().any():
df[c] = df[c].fillna(df[c].mode().iloc[0])Fix data types
After cleaning, confirm that columns have appropriate types.
for c in num_cols:
df[c] = pd.to_numeric(df[c], errors="coerce")
if "species" in df.columns:
df["species"] = df["species"].astype("category")
print(df.dtypes)This step protects later analysis from hidden type problems, such as numeric values stored as text.
Validation checks
Validation confirms that cleaning produced the expected result.
print("Final shape:", df.shape)
print("Total missing values:", int(df.isna().sum().sum()))
print("Duplicate rows:", int(df.duplicated().sum()))
assert int(df.isna().sum().sum()) == 0, "Missing values remain."
assert int(df.duplicated().sum()) == 0, "Duplicates remain."Assertions are useful because they stop the workflow when expected conditions are not met.
Save the cleaned dataset
Save the cleaned table as a new file.
from pathlib import Path
Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_clean.csv", index=False)
print("Saved cleaned dataset to: data/iris_clean.csv")The original input remains available as:
data/iris.csv
The cleaned downstream version is:
data/iris_clean.csv
Run the reusable cleaning script
The manual steps above explain the logic. The reusable script applies the same cleaning pattern from the command line.
Run this from the project root:
python scripts/python/clean_example_data.py data/iris.csv data/iris_clean.csv results/cleaningExpected output:
data/iris_clean.csv
results/cleaning/cleaning-report.txt
This makes the cleaning step repeatable and easy to test as the system grows.
What the cleaning script does
The script:
- reads the input table
- standardizes column names
- detects duplicate rows
- removes duplicate rows
- fills missing numeric values with medians
- fills missing categorical values with modes
- converts numeric-like columns where appropriate
- converts
speciesto a categorical-style column - validates that no missing values or duplicates remain
- writes a cleaned table
- writes a plain-text cleaning report
Exercise
Try the following:
- Open
results/cleaning/cleaning-report.txt. - Confirm the input and output row counts.
- Confirm the number of missing values after cleaning.
- Rerun the Chapter 02 inspection script on the cleaned table.
Use:
python scripts/python/inspect_table.py data/iris_clean.csv results/inspection-cleanedThen compare:
results/inspection/
results/inspection-cleaned/
A cleaned table can be inspected using the same reusable inspection script:
python scripts/python/inspect_table.py data/iris_clean.csv results/inspection-cleanedExpected outputs:
results/inspection-cleaned/
├── table-inspection-summary.txt
├── table-column-summary.tsv
└── table-missing-values.tsv
The key idea is that cleaning does not end with saving a file.
It should be followed by inspection and validation.
CDI Insight
Cleaning is not about making data look perfect.
It is about making preparation decisions explicit, reproducible, and testable.
A cleaned dataset should be easier to analyze, but it should also preserve the logic of how it was produced.
That is why CDI systems keep both:
input table
cleaning script
cleaned output table
cleaning report
Together, these make the workflow easier to trust.
Summary
In this lesson, you:
- inspected dataset quality
- checked missing values and duplicates
- standardized column names
- applied simple missing value handling rules
- fixed data types
- validated the cleaned table
- saved
data/iris_clean.csv - created a cleaning report with
clean_example_data.py
Looking Ahead
In the next chapter, we transform and organize the cleaned dataset for analysis. The cleaned table produced here becomes the input for wrangling.