Data Cleaning and Preparation

Published

Mar 2026

ID: DS-L03
Type: Lesson
Audience: Beginner / Intermediate
Theme: Cleaning discipline, validation checks, and reproducible preparation

In this lesson, you will apply essential data cleaning steps used in real analytical workflows.

As in previous chapters, all code runs inside this Quarto (.qmd) file.

When the book is rendered, Python code is executed and results are embedded directly into the page.

Lesson Overview

By the end of this lesson, you will be able to:

inspect a dataset for common quality issues
detect and handle duplicate rows
standardize column names
handle missing values using clear rules
correct inconsistent data types
validate and save a cleaned dataset

Cleaning Philosophy

Data cleaning should be:

measured before modification
minimal and justified
reproducible
verified with explicit checks

Never modify data without first understanding what is being changed.

Always validate the results of cleaning steps with explicit checks.

Load the Dataset

Code

import pandas as pd

df = pd.read_csv("data/iris.csv")
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Initial Inspection

Code

print("Shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Shape: (150, 5)

Column names: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species             str
dtype: object

Code

print("Summary statistics:")
print(df.describe(include="all"))

print("\nMissing values per column:")
print(df.isna().sum())

Summary statistics:
        sepal_length  sepal_width  petal_length  petal_width species
count     150.000000   150.000000    150.000000   150.000000     150
unique           NaN          NaN           NaN          NaN       3
top              NaN          NaN           NaN          NaN  setosa
freq             NaN          NaN           NaN          NaN      50
mean        5.843333     3.057333      3.758000     1.199333     NaN
std         0.828066     0.435866      1.765298     0.762238     NaN
min         4.300000     2.000000      1.000000     0.100000     NaN
25%         5.100000     2.800000      1.600000     0.300000     NaN
50%         5.800000     3.000000      4.350000     1.300000     NaN
75%         6.400000     3.300000      5.100000     1.800000     NaN
max         7.900000     4.400000      6.900000     2.500000     NaN

Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Detect Column Types

Code

num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)

Numeric columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Categorical columns: ['species']

Detect and Remove Duplicates

Code

dup_count = int(df.duplicated().sum())
print("Duplicate rows found:", dup_count)

if dup_count > 0:
    print("\nInspect duplicates carefully before removal.")
    print(df[df.duplicated()].head())

Duplicate rows found: 1

Inspect duplicates carefully before removal.
     sepal_length  sepal_width  petal_length  petal_width    species
142           5.8          2.7           5.1          1.9  virginica

Code

if dup_count > 0:
    df = df.drop_duplicates().reset_index(drop=True)

print("Shape after duplicate handling:", df.shape)

Shape after duplicate handling: (149, 5)

Column Name Hygiene

Code

df.columns = [
    str(c).strip().lower().replace(" ", "_").replace("(", "").replace(")", "")
    for c in df.columns
]

print("Updated columns:", df.columns.tolist())

Updated columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Handle Missing Values

Code

print(df.isna().sum())

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Code

num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

df[num_cols] = df[num_cols].fillna(df[num_cols].median())

for c in cat_cols:
    if df[c].isna().any():
        df[c] = df[c].fillna(df[c].mode().iloc[0])

Fix Data Types

Code

for c in num_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

if "species" in df.columns:
    df["species"] = df["species"].astype("category")

print(df.dtypes)

sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

Validation Checks

Code

print("Final shape:", df.shape)
print("Total missing values:", int(df.isna().sum().sum()))
print("Duplicate rows:", int(df.duplicated().sum()))

assert int(df.isna().sum().sum()) == 0, "Missing values remain."
assert int(df.duplicated().sum()) == 0, "Duplicates remain."

Final shape: (149, 5)
Total missing values: 0
Duplicate rows: 0

Save Clean Dataset

Code

from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_clean.csv", index=False)

print("Saved cleaned dataset to: data/iris_clean.csv")

Saved cleaned dataset to: data/iris_clean.csv

Summary

You inspected dataset quality
You applied disciplined cleaning steps
You validated the dataset explicitly
You saved a reproducible cleaned version for later lessons

Next Step

Transform and organize the cleaned dataset for analysis.