Summary Statistics and Insights

  • ID: DS-L06
  • Type: Lesson
  • Audience: Public
  • Theme: Using summary statistics to move from numbers to interpretation

In this final lesson of the Foundations Track, you will compute descriptive statistics, explore grouped summaries, and extract defensible insights from your cleaned dataset.

As in previous chapters, all code runs inside this Quarto (.qmd) file. When you render the book, Quarto executes the Python chunks and embeds results directly into the page.


Lesson Overview

By the end of this lesson, you will be able to:

  • Compute descriptive statistics for numeric and categorical variables
  • Summarize data grouped by category (species)
  • Detect feature differences and patterns
  • Create simple visuals that support your summaries
  • Write short, careful interpretations grounded in evidence

Chapter Initialization

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

cdi_notebook_init(chapter="06", title_x=0.5)

Load the Dataset

We use the cleaned dataset created in Lesson 03.

import pandas as pd

df = pd.read_csv("data/iris_clean.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Descriptive Statistics

Numeric summary

describe() gives a compact overview of central tendency and spread.

df.describe()
sepal_length sepal_width petal_length petal_width
count 149.000000 149.000000 149.000000 149.000000
mean 5.843624 3.059732 3.748993 1.194631
std 0.830851 0.436342 1.767791 0.762622
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.300000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Visual: Distributions of numeric features

import matplotlib.pyplot as plt

num_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

df[num_cols].hist(figsize=(10, 7), bins=12)
plt.suptitle("Iris — Distributions of Numeric Features", y=1.02)
plt.tight_layout()

show_and_save_mpl()

Iris — Distributions of Numeric Features
'figures/06_001.png'

Categorical Summary

For categorical variables, counts are often the simplest and most useful summary.

species_counts = df["species"].value_counts()
species_counts
species
setosa        50
versicolor    50
virginica     49
Name: count, dtype: int64

Visual: Species distribution

fig, ax = plt.subplots(figsize=(7, 4))
species_counts.plot(kind="bar", ax=ax)
ax.set_title("Iris — Count of Samples by Species")
ax.set_xlabel("species")
ax.set_ylabel("count")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)

Iris — Count of Samples by Species
'figures/06_002.png'

Grouped Summary Statistics

A common workflow is to summarize numeric variables by a grouping column.

group_summary = (
    df.groupby("species")
      .agg(
          sepal_length_mean=("sepal_length", "mean"),
          sepal_length_sd=("sepal_length", "std"),
          petal_length_mean=("petal_length", "mean"),
          petal_length_sd=("petal_length", "std"),
          petal_width_mean=("petal_width", "mean"),
          petal_width_sd=("petal_width", "std"),
      )
      .reset_index()
)

group_summary
species sepal_length_mean sepal_length_sd petal_length_mean petal_length_sd petal_width_mean petal_width_sd
0 setosa 5.006000 0.352490 1.462000 0.173664 0.246000 0.105386
1 versicolor 5.936000 0.516171 4.260000 0.469911 1.326000 0.197753
2 virginica 6.604082 0.632113 5.561224 0.553706 2.028571 0.276887

Visual: Mean petal length by species

import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 5))
sns.barplot(data=df, x="species", y="petal_length", errorbar="sd", ax=ax)
ax.set_title("Iris — Mean Petal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("petal_length")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)

Iris — Mean Petal Length by Species (SD error bars)
'figures/06_003.png'

A table like this becomes more meaningful when you connect it to patterns you saw in Lesson 05. Numbers should support interpretation, not replace it.


Key Feature Differences

A simple way to detect the strongest separation is to compare group means across features.

sep = {
    col: group_summary[f"{col}_mean"].max() - group_summary[f"{col}_mean"].min()
    for col in ["sepal_length", "petal_length", "petal_width"]
    if f"{col}_mean" in group_summary.columns
}

sep
{'sepal_length': np.float64(1.5980816326530611),
 'petal_length': np.float64(4.099224489795919),
 'petal_width': np.float64(1.7825714285714285)}

Correlation Heatmap (Numeric Features)

Correlations summarize linear relationships between variables.

corr = df[num_cols].corr()
corr
sepal_length sepal_width petal_length petal_width
sepal_length 1.000000 -0.118129 0.873738 0.820620
sepal_width -0.118129 1.000000 -0.426028 -0.362894
petal_length 0.873738 -0.426028 1.000000 0.962772
petal_width 0.820620 -0.362894 0.962772 1.000000
fig, ax = plt.subplots(figsize=(7, 5))
sns.heatmap(corr, annot=True, fmt=".2f", ax=ax)
ax.set_title("Iris — Correlation Heatmap")

show_and_save_mpl(fig)

Iris — Correlation Heatmap (Numeric Features)
'figures/06_004.png'

Generate a Simple Insights Report

An insights report links:

  • evidence (tables and plots)
  • interpretation (what the evidence suggests)
  • caution (what you cannot claim)
top_sep = group_summary.sort_values("petal_length_mean", ascending=False)[["species", "petal_length_mean"]]

report = f"""Insights Report (Iris)

1) Dataset size
- Rows: {df.shape[0]}
- Columns: {df.shape[1]}

2) Group separation
- Petal length shows strong separation by species.
- Mean petal length by species:
{top_sep.to_string(index=False)}

3) Consistency with visuals
- The group means align with the clustering seen in Lesson 05 plots.

4) Caution
- These are descriptive patterns. This dataset does not justify causal claims.
"""

print(report)
Insights Report (Iris)

1) Dataset size
- Rows: 149
- Columns: 5

2) Group separation
- Petal length shows strong separation by species.
- Mean petal length by species:
   species  petal_length_mean
 virginica           5.561224
versicolor           4.260000
    setosa           1.462000

3) Consistency with visuals
- The group means align with the clustering seen in Lesson 05 plots.

4) Caution
- These are descriptive patterns. This dataset does not justify causal claims.

Exercise

  • Compute the mean and standard deviation of sepal_width by species
  • Create a table with the median of each numeric feature by species
  • Write one paragraph interpreting your results without causal language
sepal_width_stats = (
    df.groupby("species")
      .agg(
          sepal_width_mean=("sepal_width", "mean"),
          sepal_width_sd=("sepal_width", "std"),
      )
      .reset_index()
)

sepal_width_stats
species sepal_width_mean sepal_width_sd
0 setosa 3.428000 0.379064
1 versicolor 2.770000 0.313798
2 virginica 2.979592 0.323380
median_table = (
    df.groupby("species")[num_cols]
      .median()
      .reset_index()
)

median_table
species sepal_length sepal_width petal_length petal_width
0 setosa 5.0 3.4 1.50 0.2
1 versicolor 5.9 2.8 4.35 1.3
2 virginica 6.5 3.0 5.60 2.0

Summary

  • You computed descriptive statistics for numeric and categorical variables
  • You compared groups using aggregation and simple plots
  • You examined correlations between numeric features
  • You produced a short insights report grounded in evidence
  • You practiced writing careful interpretations