Lesson 6 Summary Statistics and Insights

In this final lesson of the Free CDI Python Data Science Track, you will compute descriptive statistics, explore grouped summaries, and extract insights from your wrangled dataset.

6.1 Lesson Overview

By the end of this lesson, you will be able to:

  • Compute descriptive statistics for numeric and categorical variables
  • Summarize data grouped by category (species)
  • Detect feature differences and patterns
  • Generate a simple insights report
  • Visualize summary statistics using clean, publication-ready plots

Summary statistics are the backbone of exploratory data analysis.
Before modeling or advanced visualization, analysts profile their datasets numerically to understand structure and variation.

6.2 Notebook Setup

We initialize the CDI notebook utilities so that all figures:

  • are saved incrementally to figures/
  • render safely in GitBook and PDF outputs
from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

# Lesson ID drives figure naming (e.g., figures/06_001.png)
cdi_notebook_init(chapter="06", title_x=0)
'cdi'

6.3 Load the Dataset

We use the wrangled dataset produced in Lesson 04.

import pandas as pd

df = pd.read_csv("data/iris_wrangled.csv")

print(df.head())
print(df.shape)
   sepal_length  sepal_width  petal_length  petal_width species  sepal_area  \
0           5.1          3.5           1.4          0.2  setosa       17.85   
1           4.9          3.0           1.4          0.2  setosa       14.70   
2           4.7          3.2           1.3          0.2  setosa       15.04   
3           4.6          3.1           1.5          0.2  setosa       14.26   
4           5.0          3.6           1.4          0.2  setosa       18.00   

   petal_ratio petal_size  
0          7.0      small  
1          7.0      small  
2          6.5      small  
3          7.5      small  
4          7.0      small  
(149, 8)

6.4 Descriptive Statistics

We begin by summarizing numeric variables.

# Numeric summary
print(df.describe())
       sepal_length  sepal_width  petal_length  petal_width  sepal_area  \
count    149.000000   149.000000    149.000000   149.000000  149.000000   
mean       5.843624     3.059732      3.748993     1.194631   17.837383   
std        0.830851     0.436342      1.767791     0.762622    3.368472   
min        4.300000     2.000000      1.000000     0.100000   10.000000   
25%        5.100000     2.800000      1.600000     0.300000   15.660000   
50%        5.800000     3.000000      4.300000     1.300000   17.680000   
75%        6.400000     3.300000      5.100000     1.800000   20.400000   
max        7.900000     4.400000      6.900000     2.500000   30.020000   

       petal_ratio  
count   149.000000  
mean      4.321414  
std       2.494442  
min       2.125000  
25%       2.809524  
50%       3.300000  
75%       4.666667  
max      15.000000  

6.4.1 Visual: Distributions of Numeric Features

These histograms help you see spread, skew, and outliers at a glance.

import matplotlib.pyplot as plt

num_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

fig, axes = plt.subplots(2, 2, figsize=(10, 7))
axes = axes.ravel()

for ax, col in zip(axes, num_cols):
    ax.hist(df[col], bins=12)
    ax.set_title(col)
    ax.grid(True, alpha=0.2)

plt.suptitle("Iris — Numeric Feature Distributions", y=1.02)
plt.tight_layout()

show_and_save_mpl(fig)
Saved PNG → figures/06_001.png

6.5 Categorical Summary

We examine how observations are distributed across species.

counts = df["species"].value_counts()
print(counts)

print("\nProportions:")
print(df["species"].value_counts(normalize=True))
species
setosa        50
versicolor    50
virginica     49
Name: count, dtype: int64

Proportions:
species
setosa        0.335570
versicolor    0.335570
virginica     0.328859
Name: proportion, dtype: float64

6.5.1 Visual: Species Distribution

fig, ax = plt.subplots(figsize=(6, 4))

ax.bar(counts.index.astype(str), counts.values)
ax.set_title("Iris — Species Distribution")
ax.set_xlabel("species")
ax.set_ylabel("count")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)
Saved PNG → figures/06_002.png

6.6 Grouped Summary Statistics

Grouped summaries reveal systematic differences between species.

summary = (
    df.groupby("species", as_index=False)
    .agg(
        sepal_length_mean=("sepal_length", "mean"),
        sepal_width_mean=("sepal_width", "mean"),
        petal_length_mean=("petal_length", "mean"),
        petal_width_mean=("petal_width", "mean"),
        sepal_area_mean=("sepal_area", "mean"),
        petal_ratio_mean=("petal_ratio", "mean"),
        n=("species", "count"),
    )
)

print(summary)
      species  sepal_length_mean  sepal_width_mean  petal_length_mean  \
0      setosa           5.006000          3.428000           1.462000   
1  versicolor           5.936000          2.770000           4.260000   
2   virginica           6.604082          2.979592           5.561224   

   petal_width_mean  sepal_area_mean  petal_ratio_mean   n  
0          0.246000        17.257800          6.908000  50  
1          1.326000        16.526200          3.242837  50  
2          2.028571        19.766735          2.782631  49  

6.6.1 Visual: Mean Petal Length by Species

fig, ax = plt.subplots(figsize=(7, 4))

ax.bar(summary["species"].astype(str), summary["petal_length_mean"])
ax.set_title("Mean Petal Length by Species")
ax.set_xlabel("species")
ax.set_ylabel("mean petal_length")
ax.grid(True, axis="y", alpha=0.2)

show_and_save_mpl(fig)
Saved PNG → figures/06_003.png

6.7 Key Feature Differences

A quick ranking helps you turn summary tables into insights.

print("Petal length (descending):")
print(summary.sort_values("petal_length_mean", ascending=False)[["species", "petal_length_mean"]])

print("\nSepal area (descending):")
print(summary.sort_values("sepal_area_mean", ascending=False)[["species", "sepal_area_mean"]])
Petal length (descending):
      species  petal_length_mean
2   virginica           5.561224
1  versicolor           4.260000
0      setosa           1.462000

Sepal area (descending):
      species  sepal_area_mean
2   virginica        19.766735
0      setosa        17.257800
1  versicolor        16.526200

6.8 Correlation Heatmap (Numeric Features)

Correlation helps you detect strong linear relationships between features.

import numpy as np
import matplotlib.pyplot as plt

corr = df.select_dtypes(include="number").corr()

fig, ax = plt.subplots(figsize=(7, 6))
im = ax.imshow(corr.values)

ax.set_xticks(range(len(corr.columns)))
ax.set_yticks(range(len(corr.columns)))
ax.set_xticklabels(corr.columns, rotation=45, ha="right")
ax.set_yticklabels(corr.columns)

ax.set_title("Correlation Heatmap (Numeric Features)")

# Add correlation values on cells (optional but helpful)
for i in range(corr.shape[0]):
    for j in range(corr.shape[1]):
        ax.text(j, i, f"{corr.values[i, j]:.2f}", ha="center", va="center", fontsize=8)

plt.tight_layout()
show_and_save_mpl(fig)
Saved PNG → figures/06_004.png

6.9 Generate a Simple Insights Report

We create a small text report summarizing our findings.

from pathlib import Path

report_lines = []
report_lines.append("CDI — Summary Insights Report")
report_lines.append("----------------------------------")
report_lines.append(f"Total rows: {df.shape[0]}")
report_lines.append("")

report_lines.append("Species Distribution:")
report_lines.append(str(df["species"].value_counts()))
report_lines.append("")

report_lines.append("Summary by Species:")
report_lines.append(str(summary))
report_lines.append("")

Path("reports").mkdir(exist_ok=True)
report_path = "reports/iris-summary-report.txt"

with open(report_path, "w") as f:
    f.write("\n".join(report_lines))

print("Saved summary report to:")
print(report_path)
Saved summary report to:
reports/iris-summary-report.txt

6.10 Exercise

  • Print the correlation matrix (df.corr(numeric_only=True) is fine)
  • Rank species by average petal_ratio_mean using the summary table
  • Write 3–5 insights based on the grouped summary (in plain text, as comments)
  • Re-run the correlation heatmap cell and confirm it saved to figures/

6.11 Summary

  • You computed descriptive statistics for numeric and categorical data
  • You built grouped summaries to compare species
  • You visualized key differences using simple, clear plots
  • You created a reusable text report for future reference