Data Cleaning and Preparation

  • ID: DS-L03
  • Type: Lesson
  • Audience: Public
  • Theme: Cleaning discipline, validation checks, and reproducible preparation

In this lesson, you will practice essential data cleaning steps used in real data science workflows.

As in previous chapters, all code runs inside this Quarto (.qmd) file. When you render the book, Quarto executes the Python chunks and embeds results directly into the page.


Lesson Overview

By the end of this lesson, you will be able to:

  • Inspect a dataset for common quality issues
  • Detect and handle duplicate rows
  • Standardize column names
  • Handle missing values using clear rules
  • Fix inconsistent data types
  • Validate and save a cleaned dataset

Cleaning Philosophy

Data cleaning should be:

  • Measured before modification
  • Minimal and justified
  • Reproducible
  • Verified with explicit checks

Never modify data without first understanding what you are changing.


Chapter Initialization

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

cdi_notebook_init(chapter="03", title_x=0.5)

Load the Dataset

import pandas as pd

df = pd.read_csv("data/iris.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Initial Inspection

print("Shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
Shape: (150, 5)

Column names: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species             str
dtype: object
print("Summary statistics:")
print(df.describe(include="all"))

print("\nMissing values per column:")
print(df.isna().sum())
Summary statistics:
        sepal_length  sepal_width  petal_length  petal_width species
count     150.000000   150.000000    150.000000   150.000000     150
unique           NaN          NaN           NaN          NaN       3
top              NaN          NaN           NaN          NaN  setosa
freq             NaN          NaN           NaN          NaN      50
mean        5.843333     3.057333      3.758000     1.199333     NaN
std         0.828066     0.435866      1.765298     0.762238     NaN
min         4.300000     2.000000      1.000000     0.100000     NaN
25%         5.100000     2.800000      1.600000     0.300000     NaN
50%         5.800000     3.000000      4.350000     1.300000     NaN
75%         6.400000     3.300000      5.100000     1.800000     NaN
max         7.900000     4.400000      6.900000     2.500000     NaN

Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Detect Column Types

num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)
Numeric columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Categorical columns: ['species']

Detect and Remove Duplicates

dup_count = int(df.duplicated().sum())
print("Duplicate rows found:", dup_count)

if dup_count > 0:
    print("\nInspect duplicates carefully before removal.")
    print(df[df.duplicated()].head())
Duplicate rows found: 1

Inspect duplicates carefully before removal.
     sepal_length  sepal_width  petal_length  petal_width    species
142           5.8          2.7           5.1          1.9  virginica
if dup_count > 0:
    df = df.drop_duplicates().reset_index(drop=True)

print("Shape after duplicate handling:", df.shape)
Shape after duplicate handling: (149, 5)

Column Name Hygiene

df.columns = [
    str(c).strip().lower().replace(" ", "_").replace("(", "").replace(")", "")
    for c in df.columns
]

print("Updated columns:", df.columns.tolist())
Updated columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Handle Missing Values

print(df.isna().sum())
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

df[num_cols] = df[num_cols].fillna(df[num_cols].median())

for c in cat_cols:
    if df[c].isna().any():
        df[c] = df[c].fillna(df[c].mode().iloc[0])

Fix Data Types

for c in num_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

if "species" in df.columns:
    df["species"] = df["species"].astype("category")

print(df.dtypes)
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

Validation Checks

print("Final shape:", df.shape)
print("Total missing values:", int(df.isna().sum().sum()))
print("Duplicate rows:", int(df.duplicated().sum()))

assert int(df.isna().sum().sum()) == 0, "Missing values remain."
assert int(df.duplicated().sum()) == 0, "Duplicates remain."
Final shape: (149, 5)
Total missing values: 0
Duplicate rows: 0

Save Clean Dataset

from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_clean.csv", index=False)

print("Saved cleaned dataset to: data/iris_clean.csv")
Saved cleaned dataset to: data/iris_clean.csv

Summary

  • You inspected dataset quality
  • You applied disciplined cleaning steps
  • You validated the dataset explicitly
  • You saved a reproducible cleaned version for later lessons