from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
cdi_notebook_init(chapter="03", title_x=0.5)Data Cleaning and Preparation
In this lesson, you will practice essential data cleaning steps used in real data science workflows.
As in previous chapters, all code runs inside this Quarto (.qmd) file. When you render the book, Quarto executes the Python chunks and embeds results directly into the page.
Lesson Overview
By the end of this lesson, you will be able to:
- Inspect a dataset for common quality issues
- Detect and handle duplicate rows
- Standardize column names
- Handle missing values using clear rules
- Fix inconsistent data types
- Validate and save a cleaned dataset
Cleaning Philosophy
Data cleaning should be:
- Measured before modification
- Minimal and justified
- Reproducible
- Verified with explicit checks
Never modify data without first understanding what you are changing.
Chapter Initialization
Load the Dataset
import pandas as pd
df = pd.read_csv("data/iris.csv")
df.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Initial Inspection
print("Shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)Shape: (150, 5)
Column names: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
Data types:
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species str
dtype: object
print("Summary statistics:")
print(df.describe(include="all"))
print("\nMissing values per column:")
print(df.isna().sum())Summary statistics:
sepal_length sepal_width petal_length petal_width species
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN setosa
freq NaN NaN NaN NaN 50
mean 5.843333 3.057333 3.758000 1.199333 NaN
std 0.828066 0.435866 1.765298 0.762238 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 NaN
Missing values per column:
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
Detect Column Types
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()
print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)Numeric columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Categorical columns: ['species']
Detect and Remove Duplicates
dup_count = int(df.duplicated().sum())
print("Duplicate rows found:", dup_count)
if dup_count > 0:
print("\nInspect duplicates carefully before removal.")
print(df[df.duplicated()].head())Duplicate rows found: 1
Inspect duplicates carefully before removal.
sepal_length sepal_width petal_length petal_width species
142 5.8 2.7 5.1 1.9 virginica
if dup_count > 0:
df = df.drop_duplicates().reset_index(drop=True)
print("Shape after duplicate handling:", df.shape)Shape after duplicate handling: (149, 5)
Column Name Hygiene
df.columns = [
str(c).strip().lower().replace(" ", "_").replace("(", "").replace(")", "")
for c in df.columns
]
print("Updated columns:", df.columns.tolist())Updated columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
Handle Missing Values
print(df.isna().sum())sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
for c in cat_cols:
if df[c].isna().any():
df[c] = df[c].fillna(df[c].mode().iloc[0])Fix Data Types
for c in num_cols:
df[c] = pd.to_numeric(df[c], errors="coerce")
if "species" in df.columns:
df["species"] = df["species"].astype("category")
print(df.dtypes)sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species category
dtype: object
Validation Checks
print("Final shape:", df.shape)
print("Total missing values:", int(df.isna().sum().sum()))
print("Duplicate rows:", int(df.duplicated().sum()))
assert int(df.isna().sum().sum()) == 0, "Missing values remain."
assert int(df.duplicated().sum()) == 0, "Duplicates remain."Final shape: (149, 5)
Total missing values: 0
Duplicate rows: 0
Save Clean Dataset
from pathlib import Path
Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_clean.csv", index=False)
print("Saved cleaned dataset to: data/iris_clean.csv")Saved cleaned dataset to: data/iris_clean.csv
Summary
- You inspected dataset quality
- You applied disciplined cleaning steps
- You validated the dataset explicitly
- You saved a reproducible cleaned version for later lessons