print("Summary statistics:")print(df.describe(include="all"))print("\nMissing values per column:")print(df.isna().sum())
Summary statistics:
sepal_length sepal_width petal_length petal_width species
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN setosa
freq NaN NaN NaN NaN 50
mean 5.843333 3.057333 3.758000 1.199333 NaN
std 0.828066 0.435866 1.765298 0.762238 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 NaN
Missing values per column:
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
num_cols = df.select_dtypes(include="number").columns.tolist()cat_cols = df.select_dtypes(exclude="number").columns.tolist()df[num_cols] = df[num_cols].fillna(df[num_cols].median())for c in cat_cols:if df[c].isna().any(): df[c] = df[c].fillna(df[c].mode().iloc[0])
Fix Data Types
Code
for c in num_cols: df[c] = pd.to_numeric(df[c], errors="coerce")if"species"in df.columns: df["species"] = df["species"].astype("category")print(df.dtypes)
Final shape: (149, 5)
Total missing values: 0
Duplicate rows: 0
Save Clean Dataset
Code
from pathlib import PathPath("data").mkdir(exist_ok=True)df.to_csv("data/iris_clean.csv", index=False)print("Saved cleaned dataset to: data/iris_clean.csv")
Saved cleaned dataset to: data/iris_clean.csv
Summary
You inspected dataset quality
You applied disciplined cleaning steps
You validated the dataset explicitly
You saved a reproducible cleaned version for later lessons
Next Step
Transform and organize the cleaned dataset for analysis.