Lesson 6 Summary Statistics and Insights
In this final lesson of the Free CDI Python Data Science Track, you will compute descriptive statistics, explore grouped summaries, and extract insights from your wrangled dataset.
6.1 Lesson Overview
By the end of this lesson, you will be able to:
- Compute descriptive statistics for numeric and categorical variables
- Summarize data grouped by category (species)
- Detect feature differences and patterns
- Generate a simple insights report
- Visualize summary statistics using clean, publication-ready plots
6.2 Notebook Setup
We initialize the CDI notebook utilities so that all figures:
- are saved incrementally to
figures/
- render safely in GitBook and PDF outputs
from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
# Lesson ID drives figure naming (e.g., figures/06_001.png)
cdi_notebook_init(chapter="06", title_x=0)'cdi'
6.3 Load the Dataset
We use the wrangled dataset produced in Lesson 04.
sepal_length sepal_width petal_length petal_width species sepal_area \
0 5.1 3.5 1.4 0.2 setosa 17.85
1 4.9 3.0 1.4 0.2 setosa 14.70
2 4.7 3.2 1.3 0.2 setosa 15.04
3 4.6 3.1 1.5 0.2 setosa 14.26
4 5.0 3.6 1.4 0.2 setosa 18.00
petal_ratio petal_size
0 7.0 small
1 7.0 small
2 6.5 small
3 7.5 small
4 7.0 small
(149, 8)
6.4 Descriptive Statistics
We begin by summarizing numeric variables.
sepal_length sepal_width petal_length petal_width sepal_area \
count 149.000000 149.000000 149.000000 149.000000 149.000000
mean 5.843624 3.059732 3.748993 1.194631 17.837383
std 0.830851 0.436342 1.767791 0.762622 3.368472
min 4.300000 2.000000 1.000000 0.100000 10.000000
25% 5.100000 2.800000 1.600000 0.300000 15.660000
50% 5.800000 3.000000 4.300000 1.300000 17.680000
75% 6.400000 3.300000 5.100000 1.800000 20.400000
max 7.900000 4.400000 6.900000 2.500000 30.020000
petal_ratio
count 149.000000
mean 4.321414
std 2.494442
min 2.125000
25% 2.809524
50% 3.300000
75% 4.666667
max 15.000000
6.4.1 Visual: Distributions of Numeric Features
These histograms help you see spread, skew, and outliers at a glance.
import matplotlib.pyplot as plt
num_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
fig, axes = plt.subplots(2, 2, figsize=(10, 7))
axes = axes.ravel()
for ax, col in zip(axes, num_cols):
ax.hist(df[col], bins=12)
ax.set_title(col)
ax.grid(True, alpha=0.2)
plt.suptitle("Iris — Numeric Feature Distributions", y=1.02)
plt.tight_layout()
show_and_save_mpl(fig)Saved PNG → figures/06_001.png

6.5 Categorical Summary
We examine how observations are distributed across species.
counts = df["species"].value_counts()
print(counts)
print("\nProportions:")
print(df["species"].value_counts(normalize=True))species
setosa 50
versicolor 50
virginica 49
Name: count, dtype: int64
Proportions:
species
setosa 0.335570
versicolor 0.335570
virginica 0.328859
Name: proportion, dtype: float64
6.5.1 Visual: Species Distribution
fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(counts.index.astype(str), counts.values)
ax.set_title("Iris — Species Distribution")
ax.set_xlabel("species")
ax.set_ylabel("count")
ax.grid(True, axis="y", alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/06_002.png

6.6 Grouped Summary Statistics
Grouped summaries reveal systematic differences between species.
summary = (
df.groupby("species", as_index=False)
.agg(
sepal_length_mean=("sepal_length", "mean"),
sepal_width_mean=("sepal_width", "mean"),
petal_length_mean=("petal_length", "mean"),
petal_width_mean=("petal_width", "mean"),
sepal_area_mean=("sepal_area", "mean"),
petal_ratio_mean=("petal_ratio", "mean"),
n=("species", "count"),
)
)
print(summary) species sepal_length_mean sepal_width_mean petal_length_mean \
0 setosa 5.006000 3.428000 1.462000
1 versicolor 5.936000 2.770000 4.260000
2 virginica 6.604082 2.979592 5.561224
petal_width_mean sepal_area_mean petal_ratio_mean n
0 0.246000 17.257800 6.908000 50
1 1.326000 16.526200 3.242837 50
2 2.028571 19.766735 2.782631 49
6.6.1 Visual: Mean Petal Length by Species
fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(summary["species"].astype(str), summary["petal_length_mean"])
ax.set_title("Mean Petal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("mean petal_length")
ax.grid(True, axis="y", alpha=0.2)
show_and_save_mpl(fig)Saved PNG → figures/06_003.png

6.7 Key Feature Differences
A quick ranking helps you turn summary tables into insights.
print("Petal length (descending):")
print(summary.sort_values("petal_length_mean", ascending=False)[["species", "petal_length_mean"]])
print("\nSepal area (descending):")
print(summary.sort_values("sepal_area_mean", ascending=False)[["species", "sepal_area_mean"]])Petal length (descending):
species petal_length_mean
2 virginica 5.561224
1 versicolor 4.260000
0 setosa 1.462000
Sepal area (descending):
species sepal_area_mean
2 virginica 19.766735
0 setosa 17.257800
1 versicolor 16.526200
6.8 Correlation Heatmap (Numeric Features)
Correlation helps you detect strong linear relationships between features.
import numpy as np
import matplotlib.pyplot as plt
corr = df.select_dtypes(include="number").corr()
fig, ax = plt.subplots(figsize=(7, 6))
im = ax.imshow(corr.values)
ax.set_xticks(range(len(corr.columns)))
ax.set_yticks(range(len(corr.columns)))
ax.set_xticklabels(corr.columns, rotation=45, ha="right")
ax.set_yticklabels(corr.columns)
ax.set_title("Correlation Heatmap (Numeric Features)")
# Add correlation values on cells (optional but helpful)
for i in range(corr.shape[0]):
for j in range(corr.shape[1]):
ax.text(j, i, f"{corr.values[i, j]:.2f}", ha="center", va="center", fontsize=8)
plt.tight_layout()
show_and_save_mpl(fig)Saved PNG → figures/06_004.png

6.9 Generate a Simple Insights Report
We create a small text report summarizing our findings.
from pathlib import Path
report_lines = []
report_lines.append("CDI — Summary Insights Report")
report_lines.append("----------------------------------")
report_lines.append(f"Total rows: {df.shape[0]}")
report_lines.append("")
report_lines.append("Species Distribution:")
report_lines.append(str(df["species"].value_counts()))
report_lines.append("")
report_lines.append("Summary by Species:")
report_lines.append(str(summary))
report_lines.append("")
Path("reports").mkdir(exist_ok=True)
report_path = "reports/iris-summary-report.txt"
with open(report_path, "w") as f:
f.write("\n".join(report_lines))
print("Saved summary report to:")
print(report_path)Saved summary report to:
reports/iris-summary-report.txt
6.10 Exercise
- Print the correlation matrix (
df.corr(numeric_only=True)is fine)
- Rank species by average
petal_ratio_meanusing thesummarytable
- Write 3–5 insights based on the grouped summary (in plain text, as comments)
- Re-run the correlation heatmap cell and confirm it saved to
figures/
6.11 Summary
- You computed descriptive statistics for numeric and categorical data
- You built grouped summaries to compare species
- You visualized key differences using simple, clear plots
- You created a reusable text report for future reference
Congratulations!
You have completed all lessons in the Free CDI Python Data Science Track.
Continue to the closing chapter:
🎉 Congratulations on Completing the Free Track
If you’re reading the Premium Track:
Continue to Welcome to the Premium CDI Track →