Lesson 3 Data Cleaning and Preparation

In this lesson, you will practice essential data cleaning steps used in real data science workflows. You will check for quality issues, apply fixes, and save a cleaned dataset for use in later lessons.

3.1 Lesson Overview

By the end of this lesson, you will be able to:

  • Inspect a dataset for common quality issues
  • Standardize and clean column names
  • Detect and remove duplicate rows
  • Handle missing values
  • Fix inconsistent data types
  • Validate and save a clean dataset

3.2 Notebook Setup

We use CDI publishing helpers so that figures are:

  • saved incrementally to figures/
  • embedded into the notebook output as PNGs
  • safe for the pipeline ipynb → md → Rmd → GitBook
from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

# Lesson ID drives figure naming (e.g., figures/03_001.png)
_ = cdi_notebook_init(chapter="03", title_x=0)

3.3 Load the Dataset

We begin by loading the dataset you created in Lesson 02.

import pandas as pd

df = pd.read_csv("data/iris.csv")
print(df.head())
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

3.4 Inspect the Dataset for Common Issues

A quick inspection checklist includes:

  • Shape (rows, columns)
  • Column names
  • Data types
  • Summary statistics
  • Missing values
print("Shape (rows, cols):")
print(df.shape)

print("\nColumn names:")
print(df.columns.tolist())

print("\nData types:")
print(df.dtypes)
Shape (rows, cols):
(150, 5)

Column names:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object
print("Summary statistics (numeric + categorical):")
print(df.describe(include="all"))

print("\nMissing values per column:")
print(df.isna().sum())
Summary statistics (numeric + categorical):
        sepal_length  sepal_width  petal_length  petal_width species
count     150.000000   150.000000    150.000000   150.000000     150
unique           NaN          NaN           NaN          NaN       3
top              NaN          NaN           NaN          NaN  setosa
freq             NaN          NaN           NaN          NaN      50
mean        5.843333     3.057333      3.758000     1.199333     NaN
std         0.828066     0.435866      1.765298     0.762238     NaN
min         4.300000     2.000000      1.000000     0.100000     NaN
25%         5.100000     2.800000      1.600000     0.300000     NaN
50%         5.800000     3.000000      4.350000     1.300000     NaN
75%         6.400000     3.300000      5.100000     1.800000     NaN
max         7.900000     4.400000      6.900000     2.500000     NaN

Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

3.6 Detect and Remove Duplicates

Duplicates can appear due to data entry issues, merges, or repeated exports.

dup_count = int(df.duplicated().sum())
print("Duplicate rows found:")
print(dup_count)

if dup_count > 0:
    print("\nDuplicate rows preview:")
    print(df[df.duplicated()].head())
Duplicate rows found:
1

Duplicate rows preview:
     sepal_length  sepal_width  petal_length  petal_width    species
142           5.8          2.7           5.1          1.9  virginica
# Remove duplicates (if any) and reset index
df = df.drop_duplicates().reset_index(drop=True)

print("New shape after removing duplicates:")
print(df.shape)
New shape after removing duplicates:
(149, 5)

Common pitfall: Some datasets contain partial duplicates that look similar but differ in important columns. Always review duplicates before dropping them.

3.7 Column Name Hygiene

We standardize names to lower_snake_case so that:

  • code is consistent across lessons
  • column references are predictable
  • it is easier to join/merge later
df.columns = [
    str(c).strip().lower().replace(" ", "_").replace("(", "").replace(")", "")
    for c in df.columns
]

print("Updated columns:")
print(df.columns.tolist())
Updated columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

3.8 Handle Missing Values

First, always measure missingness.

print("Missing values per column:")
print(df.isna().sum())
Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

3.8.1 Example Imputation Pattern

The Iris dataset is usually complete, but most real datasets are not.

Below is a safe, reusable pattern you can apply when missing values exist.

# Recompute after any prior transformations
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()

# Numeric: fill with median
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Categorical: fill with mode
for c in cat_cols:
    if df[c].isna().any():
        df[c] = df[c].fillna(df[c].mode().iloc[0])

Warning: Imputation changes data. Always consider whether missingness has meaning (for example, “not recorded” vs “zero”).

3.9 Fix Data Types

CSV imports sometimes load numbers as strings. We enforce numeric types and convert species into a categorical column.

# Convert numeric-looking columns to numeric
for c in num_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# Convert species column to category if present
if "species" in df.columns:
    df["species"] = df["species"].astype("category")

print("Updated dtypes:")
print(df.dtypes)
Updated dtypes:
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

Tip: Always verify dtypes before modeling. Incorrect dtypes can silently break your analysis later.

3.10 Validate and Save the Clean Dataset

Validation checks ensure that your saved dataset is safe to reuse.

print("Final shape (rows, cols):")
print(df.shape)

print("\nTotal missing values remaining:")
print(int(df.isna().sum().sum()))
Final shape (rows, cols):
(149, 5)

Total missing values remaining:
0
from pathlib import Path

Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_clean.csv", index=False)

print("Saved cleaned dataset to: data/iris_clean.csv")
Saved cleaned dataset to: data/iris_clean.csv

3.11 Exercise

  • Confirm whether duplicates exist in your dataset and report the count
  • Introduce one missing value in a numeric column and then fill it using the median
  • Convert species to category and confirm the dtype
# Exercise workspace (edit and run)

# 1) Duplicates
print("Duplicate rows found:")
print(int(df.duplicated().sum()))

# 2) Introduce a missing value + impute
df_ex = df.copy()
df_ex.loc[0, "sepal_length"] = None

print("\nMissing values after insertion:")
print(df_ex.isna().sum())

df_ex[num_cols] = df_ex[num_cols].fillna(df_ex[num_cols].median())

print("\nMissing values after median fill:")
print(df_ex.isna().sum())

# 3) Category dtype
df_ex["species"] = df_ex["species"].astype("category")
print("\nDtypes after conversion:")
print(df_ex.dtypes)
Duplicate rows found:
0

Missing values after insertion:
sepal_length    1
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Missing values after median fill:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Dtypes after conversion:
sepal_length     float64
sepal_width      float64
petal_length     float64
petal_width      float64
species         category
dtype: object

3.12 Summary

  • You inspected the dataset for common quality issues
  • You checked duplicates, missing values, and data types
  • You standardized column names for consistent coding
  • You validated and saved a cleaned dataset for future lessons