Lesson 3 Data Cleaning and Preparation
In this lesson, you will practice essential data cleaning steps used in real data science workflows. You will check for quality issues, apply fixes, and save a cleaned dataset for use in later lessons.
3.1 Lesson Overview
By the end of this lesson, you will be able to:
- Inspect a dataset for common quality issues
- Standardize and clean column names
- Detect and remove duplicate rows
- Handle missing values
- Fix inconsistent data types
- Validate and save a clean dataset
3.2 Notebook Setup
We use CDI publishing helpers so that figures are:
- saved incrementally to
figures/
- embedded into the notebook output as PNGs
- safe for the pipeline
ipynb → md → Rmd → GitBook
3.3 Load the Dataset
We begin by loading the dataset you created in Lesson 02.
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
3.4 Inspect the Dataset for Common Issues
A quick inspection checklist includes:
- Shape (rows, columns)
- Column names
- Data types
- Summary statistics
- Missing values
print("Shape (rows, cols):")
print(df.shape)
print("\nColumn names:")
print(df.columns.tolist())
print("\nData types:")
print(df.dtypes)Shape (rows, cols):
(150, 5)
Column names:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
Data types:
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species object
dtype: object
print("Summary statistics (numeric + categorical):")
print(df.describe(include="all"))
print("\nMissing values per column:")
print(df.isna().sum())Summary statistics (numeric + categorical):
sepal_length sepal_width petal_length petal_width species
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN setosa
freq NaN NaN NaN NaN 50
mean 5.843333 3.057333 3.758000 1.199333 NaN
std 0.828066 0.435866 1.765298 0.762238 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 NaN
Missing values per column:
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
3.5 Visual Inspection (Optional but Recommended)
Visual checks can reveal:
- skewed distributions
- outliers
- unusual values that deserve attention
import matplotlib.pyplot as plt
import seaborn as sns
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()
print("Numeric columns:")
print(num_cols)
print("\nCategorical columns:")
print(cat_cols)Numeric columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Categorical columns:
['species']
# Histograms for numeric features
df[num_cols].hist(figsize=(10, 7), bins=12)
plt.suptitle("Iris — Distributions of Numeric Features", y=1.02)
plt.tight_layout()
show_and_save_mpl()Saved PNG → figures/03_001.png

# Boxplots for numeric features (quick outlier scan)
fig, ax = plt.subplots(figsize=(9, 4))
sns.boxplot(data=df[num_cols], ax=ax)
ax.set_title("Iris — Boxplots of Numeric Features")
ax.set_xlabel("")
ax.tick_params(axis="x", rotation=30)
ax.grid(True, axis="y", alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/03_002.png

3.6 Detect and Remove Duplicates
Duplicates can appear due to data entry issues, merges, or repeated exports.
dup_count = int(df.duplicated().sum())
print("Duplicate rows found:")
print(dup_count)
if dup_count > 0:
print("\nDuplicate rows preview:")
print(df[df.duplicated()].head())Duplicate rows found:
1
Duplicate rows preview:
sepal_length sepal_width petal_length petal_width species
142 5.8 2.7 5.1 1.9 virginica
# Remove duplicates (if any) and reset index
df = df.drop_duplicates().reset_index(drop=True)
print("New shape after removing duplicates:")
print(df.shape)New shape after removing duplicates:
(149, 5)
3.7 Column Name Hygiene
We standardize names to lower_snake_case so that:
- code is consistent across lessons
- column references are predictable
- it is easier to join/merge later
df.columns = [
str(c).strip().lower().replace(" ", "_").replace("(", "").replace(")", "")
for c in df.columns
]
print("Updated columns:")
print(df.columns.tolist())Updated columns:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
3.8 Handle Missing Values
First, always measure missingness.
Missing values per column:
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
3.8.1 Example Imputation Pattern
The Iris dataset is usually complete, but most real datasets are not.
Below is a safe, reusable pattern you can apply when missing values exist.
# Recompute after any prior transformations
num_cols = df.select_dtypes(include="number").columns.tolist()
cat_cols = df.select_dtypes(exclude="number").columns.tolist()
# Numeric: fill with median
df[num_cols] = df[num_cols].fillna(df[num_cols].median())
# Categorical: fill with mode
for c in cat_cols:
if df[c].isna().any():
df[c] = df[c].fillna(df[c].mode().iloc[0])3.9 Fix Data Types
CSV imports sometimes load numbers as strings. We enforce numeric types and convert species into a categorical column.
# Convert numeric-looking columns to numeric
for c in num_cols:
df[c] = pd.to_numeric(df[c], errors="coerce")
# Convert species column to category if present
if "species" in df.columns:
df["species"] = df["species"].astype("category")
print("Updated dtypes:")
print(df.dtypes)Updated dtypes:
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species category
dtype: object
3.10 Validate and Save the Clean Dataset
Validation checks ensure that your saved dataset is safe to reuse.
print("Final shape (rows, cols):")
print(df.shape)
print("\nTotal missing values remaining:")
print(int(df.isna().sum().sum()))Final shape (rows, cols):
(149, 5)
Total missing values remaining:
0
from pathlib import Path
Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris_clean.csv", index=False)
print("Saved cleaned dataset to: data/iris_clean.csv")Saved cleaned dataset to: data/iris_clean.csv
3.11 Exercise
- Confirm whether duplicates exist in your dataset and report the count
- Introduce one missing value in a numeric column and then fill it using the median
- Convert
speciesto category and confirm the dtype
# Exercise workspace (edit and run)
# 1) Duplicates
print("Duplicate rows found:")
print(int(df.duplicated().sum()))
# 2) Introduce a missing value + impute
df_ex = df.copy()
df_ex.loc[0, "sepal_length"] = None
print("\nMissing values after insertion:")
print(df_ex.isna().sum())
df_ex[num_cols] = df_ex[num_cols].fillna(df_ex[num_cols].median())
print("\nMissing values after median fill:")
print(df_ex.isna().sum())
# 3) Category dtype
df_ex["species"] = df_ex["species"].astype("category")
print("\nDtypes after conversion:")
print(df_ex.dtypes)Duplicate rows found:
0
Missing values after insertion:
sepal_length 1
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
Missing values after median fill:
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
Dtypes after conversion:
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species category
dtype: object
3.12 Summary
- You inspected the dataset for common quality issues
- You checked duplicates, missing values, and data types
- You standardized column names for consistent coding
- You validated and saved a cleaned dataset for future lessons
Continue to Lesson 04 — Data Wrangling Basics.