Lesson 3 Data Cleaning and Preparation

In this lesson, you will practice essential data cleaning steps used in real data science workflows. You will check for quality issues, apply fixes, and save a cleaned dataset for use in later lessons.

3.1 Lesson Overview

By the end of this lesson, you will be able to:

Inspect a dataset for common quality issues
Standardize and clean column names
Detect and remove duplicate rows
Handle missing values
Fix inconsistent data types
Validate and save a clean dataset

3.2 Notebook Setup

We use CDI publishing helpers so that figures are:

saved incrementally to figures/
embedded into the notebook output as PNGs
safe for the pipeline ipynb → md → Rmd → GitBook

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

# Lesson ID drives figure naming (e.g., figures/03_001.png)
_ = cdi_notebook_init(chapter="03", title_x=0)

3.3 Load the Dataset

We begin by loading the dataset you created in Lesson 02.

import pandas as pd

df = pd.read_csv("data/iris.csv")
print(df.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

3.4 Inspect the Dataset for Common Issues

A quick inspection checklist includes:

Shape (rows, columns)
Column names
Data types
Summary statistics
Missing values

print("Shape (rows, cols):")
print(df.shape)

print("\nColumn names:")
print(df.columns.tolist())

print("\nData types:")
print(df.dtypes)

Shape (rows, cols):
(150, 5)

Column names:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

print("Summary statistics (numeric + categorical):")
print(df.describe(include="all"))

print("\nMissing values per column:")
print(df.isna().sum())

Summary statistics (numeric + categorical):
        sepal_length  sepal_width  petal_length  petal_width species
count     150.000000   150.000000    150.000000   150.000000     150
unique           NaN          NaN           NaN          NaN       3
top              NaN          NaN           NaN          NaN  setosa
freq             NaN          NaN           NaN          NaN      50
mean        5.843333     3.057333      3.758000     1.199333     NaN
std         0.828066     0.435866      1.765298     0.762238     NaN
min         4.300000     2.000000      1.000000     0.100000     NaN
25%         5.100000     2.800000      1.600000     0.300000     NaN
50%         5.800000     3.000000      4.350000     1.300000     NaN
75%         6.400000     3.300000      5.100000     1.800000     NaN
max         7.900000     4.400000      6.900000     2.500000     NaN

Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

3.5 Visual Inspection (Optional but Recommended)

Visual checks can reveal:

skewed distributions
outliers
unusual values that deserve attention

import matplotlib.pyplot as plt
import seaborn as sns

num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

print("Numeric columns:")
print(num_cols)

print("\nCategorical columns:")
print(cat_cols)

Numeric columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

Categorical columns:
['species']

# Histograms for numeric features
df[num_cols].hist(figsize=(10, 7), bins=12)
plt.suptitle("Iris — Distributions of Numeric Features", y=1.02)
plt.tight_layout()

show_and_save_mpl()

Saved PNG → figures/03_001.png

# Boxplots for numeric features (quick outlier scan)
fig, ax = plt.subplots(figsize=(9, 4))
sns.boxplot(data=df[num_cols], ax=ax)
ax.set_title("Iris — Boxplots of Numeric Features")
ax.set_xlabel("")
ax.tick_params(axis="x", rotation=30)
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)

Saved PNG → figures/03_002.png

Tip: Visual inspection is a fast way to spot outliers, skew, and anomalies before you start “fixing” values.

3.6 Detect and Remove Duplicates

Duplicates can appear due to data entry issues, merges, or repeated exports.

dup_count = int(df.duplicated().sum())
print("Duplicate rows found:")
print(dup_count)

if dup_count > 0:
    print("\nDuplicate rows preview:")
    print(df[df.duplicated()].head())

Duplicate rows found:
1

Duplicate rows preview:
     sepal_length  sepal_width  petal_length  petal_width    species
142           5.8          2.7           5.1          1.9  virginica

# Remove duplicates (if any) and reset index
df = df.drop_duplicates().reset_index(drop=True)

print("New shape after removing duplicates:")
print(df.shape)

New shape after removing duplicates:
(149, 5)

Common pitfall: Some datasets contain partial duplicates that look similar but differ in important columns. Always review duplicates before dropping them.

3.7 Column Name Hygiene

We standardize names to lower_snake_case so that:

code is consistent across lessons
column references are predictable
it is easier to join/merge later

df.columns = [
    str(c).strip().lower().replace(" ", "_").replace("(", "").replace(")", "")
    for c in df.columns
]

print("Updated columns:")
print(df.columns.tolist())

Updated columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

3.8 Handle Missing Values

First, always measure missingness.

print("Missing values per column:")
print(df.isna().sum())

Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

3.8.1 Example Imputation Pattern

The Iris dataset is usually complete, but most real datasets are not.

Below is a safe, reusable pattern you can apply when missing values exist.

# Recompute after any prior transformations
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

# Numeric: fill with median
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Categorical: fill with mode
for c in cat_cols:
    if df[c].isna().any():
        df[c] = df[c].fillna(df[c].mode().iloc[0])

Warning: Imputation changes data. Always consider whether missingness has meaning (for example, “not recorded” vs “zero”).

3.9 Fix Data Types

CSV imports sometimes load numbers as strings. We enforce numeric types and convert species into a categorical column.

# Convert numeric-looking columns to numeric
for c in num_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# Convert species column to category if present
if "species" in df.columns:
    df["species"] = df["species"].astype("category")

print("Updated dtypes:")
print(df.dtypes)

Updated dtypes:
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

Tip: Always verify dtypes before modeling. Incorrect dtypes can silently break your analysis later.

3.10 Validate and Save the Clean Dataset

Validation checks ensure that your saved dataset is safe to reuse.

print("Final shape (rows, cols):")
print(df.shape)

print("\nTotal missing values remaining:")
print(int(df.isna().sum().sum()))

Final shape (rows, cols):
(149, 5)

Total missing values remaining:
0

from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_clean.csv", index=False)

print("Saved cleaned dataset to: data/iris_clean.csv")

Saved cleaned dataset to: data/iris_clean.csv

3.11 Exercise

Confirm whether duplicates exist in your dataset and report the count
Introduce one missing value in a numeric column and then fill it using the median
Convert species to category and confirm the dtype

# Exercise workspace (edit and run)

# 1) Duplicates
print("Duplicate rows found:")
print(int(df.duplicated().sum()))

# 2) Introduce a missing value + impute
df_ex = df.copy()
df_ex.loc[0, "sepal_length"] = None

print("\nMissing values after insertion:")
print(df_ex.isna().sum())

df_ex[num_cols] = df_ex[num_cols].fillna(df_ex[num_cols].median())

print("\nMissing values after median fill:")
print(df_ex.isna().sum())

# 3) Category dtype
df_ex["species"] = df_ex["species"].astype("category")
print("\nDtypes after conversion:")
print(df_ex.dtypes)

Duplicate rows found:
0

Missing values after insertion:
sepal_length    1
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Missing values after median fill:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Dtypes after conversion:
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

3.12 Summary

You inspected the dataset for common quality issues
You checked duplicates, missing values, and data types
You standardized column names for consistent coding
You validated and saved a cleaned dataset for future lessons

Continue to Lesson 04 — Data Wrangling Basics.