from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
cdi_notebook_init(chapter="06", title_x=0.5)Summary Statistics and Insights
In this final lesson of the Foundations Track, you will compute descriptive statistics, explore grouped summaries, and extract defensible insights from your cleaned dataset.
As in previous chapters, all code runs inside this Quarto (.qmd) file. When you render the book, Quarto executes the Python chunks and embeds results directly into the page.
Lesson Overview
By the end of this lesson, you will be able to:
- Compute descriptive statistics for numeric and categorical variables
- Summarize data grouped by category (species)
- Detect feature differences and patterns
- Create simple visuals that support your summaries
- Write short, careful interpretations grounded in evidence
Chapter Initialization
Load the Dataset
We use the cleaned dataset created in Lesson 03.
import pandas as pd
df = pd.read_csv("data/iris_clean.csv")
df.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Descriptive Statistics
Numeric summary
describe() gives a compact overview of central tendency and spread.
df.describe()| sepal_length | sepal_width | petal_length | petal_width | |
|---|---|---|---|---|
| count | 149.000000 | 149.000000 | 149.000000 | 149.000000 |
| mean | 5.843624 | 3.059732 | 3.748993 | 1.194631 |
| std | 0.830851 | 0.436342 | 1.767791 | 0.762622 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
| 50% | 5.800000 | 3.000000 | 4.300000 | 1.300000 |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Visual: Distributions of numeric features
import matplotlib.pyplot as plt
num_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
df[num_cols].hist(figsize=(10, 7), bins=12)
plt.suptitle("Iris — Distributions of Numeric Features", y=1.02)
plt.tight_layout()
show_and_save_mpl()
'figures/06_001.png'
Categorical Summary
For categorical variables, counts are often the simplest and most useful summary.
species_counts = df["species"].value_counts()
species_countsspecies
setosa 50
versicolor 50
virginica 49
Name: count, dtype: int64
Visual: Species distribution
fig, ax = plt.subplots(figsize=(7, 4))
species_counts.plot(kind="bar", ax=ax)
ax.set_title("Iris — Count of Samples by Species")
ax.set_xlabel("species")
ax.set_ylabel("count")
ax.grid(True, axis="y", alpha=0.2)
show_and_save_mpl(fig)
'figures/06_002.png'
Grouped Summary Statistics
A common workflow is to summarize numeric variables by a grouping column.
group_summary = (
df.groupby("species")
.agg(
sepal_length_mean=("sepal_length", "mean"),
sepal_length_sd=("sepal_length", "std"),
petal_length_mean=("petal_length", "mean"),
petal_length_sd=("petal_length", "std"),
petal_width_mean=("petal_width", "mean"),
petal_width_sd=("petal_width", "std"),
)
.reset_index()
)
group_summary| species | sepal_length_mean | sepal_length_sd | petal_length_mean | petal_length_sd | petal_width_mean | petal_width_sd | |
|---|---|---|---|---|---|---|---|
| 0 | setosa | 5.006000 | 0.352490 | 1.462000 | 0.173664 | 0.246000 | 0.105386 |
| 1 | versicolor | 5.936000 | 0.516171 | 4.260000 | 0.469911 | 1.326000 | 0.197753 |
| 2 | virginica | 6.604082 | 0.632113 | 5.561224 | 0.553706 | 2.028571 | 0.276887 |
Visual: Mean petal length by species
import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=df, x="species", y="petal_length", errorbar="sd", ax=ax)
ax.set_title("Iris — Mean Petal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("petal_length")
ax.grid(True, axis="y", alpha=0.2)
show_and_save_mpl(fig)
'figures/06_003.png'
Key Feature Differences
A simple way to detect the strongest separation is to compare group means across features.
sep = {
col: group_summary[f"{col}_mean"].max() - group_summary[f"{col}_mean"].min()
for col in ["sepal_length", "petal_length", "petal_width"]
if f"{col}_mean" in group_summary.columns
}
sep{'sepal_length': np.float64(1.5980816326530611),
'petal_length': np.float64(4.099224489795919),
'petal_width': np.float64(1.7825714285714285)}
Correlation Heatmap (Numeric Features)
Correlations summarize linear relationships between variables.
corr = df[num_cols].corr()
corr| sepal_length | sepal_width | petal_length | petal_width | |
|---|---|---|---|---|
| sepal_length | 1.000000 | -0.118129 | 0.873738 | 0.820620 |
| sepal_width | -0.118129 | 1.000000 | -0.426028 | -0.362894 |
| petal_length | 0.873738 | -0.426028 | 1.000000 | 0.962772 |
| petal_width | 0.820620 | -0.362894 | 0.962772 | 1.000000 |
fig, ax = plt.subplots(figsize=(7, 5))
sns.heatmap(corr, annot=True, fmt=".2f", ax=ax)
ax.set_title("Iris — Correlation Heatmap")
show_and_save_mpl(fig)
'figures/06_004.png'
Generate a Simple Insights Report
An insights report links:
- evidence (tables and plots)
- interpretation (what the evidence suggests)
- caution (what you cannot claim)
top_sep = group_summary.sort_values("petal_length_mean", ascending=False)[["species", "petal_length_mean"]]
report = f"""Insights Report (Iris)
1) Dataset size
- Rows: {df.shape[0]}
- Columns: {df.shape[1]}
2) Group separation
- Petal length shows strong separation by species.
- Mean petal length by species:
{top_sep.to_string(index=False)}
3) Consistency with visuals
- The group means align with the clustering seen in Lesson 05 plots.
4) Caution
- These are descriptive patterns. This dataset does not justify causal claims.
"""
print(report)Insights Report (Iris)
1) Dataset size
- Rows: 149
- Columns: 5
2) Group separation
- Petal length shows strong separation by species.
- Mean petal length by species:
species petal_length_mean
virginica 5.561224
versicolor 4.260000
setosa 1.462000
3) Consistency with visuals
- The group means align with the clustering seen in Lesson 05 plots.
4) Caution
- These are descriptive patterns. This dataset does not justify causal claims.
Exercise
- Compute the mean and standard deviation of
sepal_widthby species
- Create a table with the median of each numeric feature by species
- Write one paragraph interpreting your results without causal language
sepal_width_stats = (
df.groupby("species")
.agg(
sepal_width_mean=("sepal_width", "mean"),
sepal_width_sd=("sepal_width", "std"),
)
.reset_index()
)
sepal_width_stats| species | sepal_width_mean | sepal_width_sd | |
|---|---|---|---|
| 0 | setosa | 3.428000 | 0.379064 |
| 1 | versicolor | 2.770000 | 0.313798 |
| 2 | virginica | 2.979592 | 0.323380 |
median_table = (
df.groupby("species")[num_cols]
.median()
.reset_index()
)
median_table| species | sepal_length | sepal_width | petal_length | petal_width | |
|---|---|---|---|---|---|
| 0 | setosa | 5.0 | 3.4 | 1.50 | 0.2 |
| 1 | versicolor | 5.9 | 2.8 | 4.35 | 1.3 |
| 2 | virginica | 6.5 | 3.0 | 5.60 | 2.0 |
Summary
- You computed descriptive statistics for numeric and categorical variables
- You compared groups using aggregation and simple plots
- You examined correlations between numeric features
- You produced a short insights report grounded in evidence
- You practiced writing careful interpretations