Data Cleaning and Preparation

Published

Mar 2026

  • ID: DS-L03
  • Type: Lesson
  • Audience: Beginner / Intermediate
  • Theme: Cleaning discipline, validation checks, and reproducible preparation

In this lesson, you will apply essential data cleaning steps used in real analytical workflows.

As in previous chapters, all code runs inside this Quarto (.qmd) file.

When the book is rendered, Python code is executed and results are embedded directly into the page.


Lesson Overview

By the end of this lesson, you will be able to:

  • inspect a dataset for common quality issues
  • detect and handle duplicate rows
  • standardize column names
  • handle missing values using clear rules
  • correct inconsistent data types
  • validate and save a cleaned dataset

Cleaning Philosophy

Data cleaning should be:

  • measured before modification
  • minimal and justified
  • reproducible
  • verified with explicit checks

Never modify data without first understanding what is being changed.

Always validate the results of cleaning steps with explicit checks.


Load the Dataset

Code
import pandas as pd

df = pd.read_csv("data/iris.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Initial Inspection

Code
print("Shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
Shape: (150, 5)

Column names: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species             str
dtype: object
Code
print("Summary statistics:")
print(df.describe(include="all"))

print("\nMissing values per column:")
print(df.isna().sum())
Summary statistics:
        sepal_length  sepal_width  petal_length  petal_width species
count     150.000000   150.000000    150.000000   150.000000     150
unique           NaN          NaN           NaN          NaN       3
top              NaN          NaN           NaN          NaN  setosa
freq             NaN          NaN           NaN          NaN      50
mean        5.843333     3.057333      3.758000     1.199333     NaN
std         0.828066     0.435866      1.765298     0.762238     NaN
min         4.300000     2.000000      1.000000     0.100000     NaN
25%         5.100000     2.800000      1.600000     0.300000     NaN
50%         5.800000     3.000000      4.350000     1.300000     NaN
75%         6.400000     3.300000      5.100000     1.800000     NaN
max         7.900000     4.400000      6.900000     2.500000     NaN

Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Detect Column Types

Code
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)
Numeric columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Categorical columns: ['species']

Detect and Remove Duplicates

Code
dup_count = int(df.duplicated().sum())
print("Duplicate rows found:", dup_count)

if dup_count > 0:
    print("\nInspect duplicates carefully before removal.")
    print(df[df.duplicated()].head())
Duplicate rows found: 1

Inspect duplicates carefully before removal.
     sepal_length  sepal_width  petal_length  petal_width    species
142           5.8          2.7           5.1          1.9  virginica
Code
if dup_count > 0:
    df = df.drop_duplicates().reset_index(drop=True)

print("Shape after duplicate handling:", df.shape)
Shape after duplicate handling: (149, 5)

Column Name Hygiene

Code
df.columns = [
    str(c).strip().lower().replace(" ", "_").replace("(", "").replace(")", "")
    for c in df.columns
]

print("Updated columns:", df.columns.tolist())
Updated columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Handle Missing Values

Code
print(df.isna().sum())
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
Code
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

df[num_cols] = df[num_cols].fillna(df[num_cols].median())

for c in cat_cols:
    if df[c].isna().any():
        df[c] = df[c].fillna(df[c].mode().iloc[0])

Fix Data Types

Code
for c in num_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

if "species" in df.columns:
    df["species"] = df["species"].astype("category")

print(df.dtypes)
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

Validation Checks

Code
print("Final shape:", df.shape)
print("Total missing values:", int(df.isna().sum().sum()))
print("Duplicate rows:", int(df.duplicated().sum()))

assert int(df.isna().sum().sum()) == 0, "Missing values remain."
assert int(df.duplicated().sum()) == 0, "Duplicates remain."
Final shape: (149, 5)
Total missing values: 0
Duplicate rows: 0

Save Clean Dataset

Code
from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_clean.csv", index=False)

print("Saved cleaned dataset to: data/iris_clean.csv")
Saved cleaned dataset to: data/iris_clean.csv

Summary

  • You inspected dataset quality
  • You applied disciplined cleaning steps
  • You validated the dataset explicitly
  • You saved a reproducible cleaned version for later lessons

Next Step

Transform and organize the cleaned dataset for analysis.