Summary Statistics and Insights

Published

Jun 2026

  • ID: DS-L06
  • Type: Lesson
  • Audience: Beginner / Intermediate
  • Theme: Using summary statistics to move from numbers to interpretation

Summary statistics help move analysis from individual rows to interpretable evidence.

They describe the center, spread, balance, and relationships in a dataset without requiring us to inspect every observation individually.

However, descriptive summaries are only useful when they are connected to careful interpretation.

In this lesson, we compute descriptive statistics, compare grouped summaries, examine simple relationships, and write short insights grounded in evidence.

This chapter closes the core tidy-table workflow introduced in the Data Science Foundations System:

inspect
  ↓
clean
  ↓
wrangle
  ↓
visualize
  ↓
summarize and interpret

Lesson overview

By the end of this lesson, you will be able to:

  • compute descriptive statistics for numeric variables
  • summarize categorical variables with counts
  • compare groups using means, medians, and standard deviations
  • examine simple relationships between numeric features
  • save reusable summary tables
  • write careful interpretations based on descriptive evidence
  • run a reusable summarization script from the command line

Chapter workflow

This chapter introduces the fifth reusable Python script in the system:

06-summary-statistics-and-insights.qmd
        ↓
scripts/python/summarize_table.py
        ↓
data/iris_wrangled.csv
results/summary/

Expected outputs:

results/summary/
├── numeric-summary.tsv
├── species-counts.tsv
├── grouped-summary.tsv
├── median-summary.tsv
├── feature-separation.tsv
├── correlation-matrix.tsv
└── analysis-insights.md

These outputs provide a compact evidence package for reporting and interpretation.


Load the wrangled dataset

We use the wrangled dataset created in Chapter 04.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/iris_wrangled.csv")
df.head()

Inspect the structure

Before summarizing, confirm the structure of the dataset.

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Interpretation

Before computing summaries, ask:

  • which variables are numeric?
  • which variable defines the groups?
  • which derived features are available?
  • which summaries are likely to be most informative?

For the Iris dataset, the flower measurements and petal_area are numeric, while species provides a natural grouping variable.


Descriptive statistics for numeric variables

describe() gives a compact summary of central tendency and spread.

num_cols = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "petal_area"
]

numeric_summary = df[num_cols].describe().transpose()
numeric_summary

Interpretation

This table helps you review:

  • the typical range of each feature
  • the approximate center using the mean
  • variability using the standard deviation
  • minimum and maximum values
  • quartiles that describe spread

A numeric summary is often the first checkpoint before deeper comparison.


Visual check of numeric distributions

A simple set of histograms helps connect the summary table to visible distributions.

df_long = df.melt(
    value_vars=num_cols,
    var_name="feature",
    value_name="value"
)

g = sns.displot(
    data=df_long,
    x="value",
    col="feature",
    col_wrap=3,
    bins=12,
    kde=True,
    height=3.3,
    aspect=1.15
)

g.fig.suptitle("Iris — Distributions of Numeric Features", y=1.03)

plt.show()

Interpretation

These histograms help you assess:

  • whether distributions are narrow or wide
  • whether some features appear more variable than others
  • whether there may be multiple peaks suggesting subgroup structure

A summary table alone may not reveal distribution shape.


Categorical summary

For a categorical variable, counts are often the simplest descriptive summary.

species_counts = (
    df["species"]
    .value_counts()
    .sort_index()
    .reset_index()
)

species_counts.columns = ["species", "n"]
species_counts

Visual: species distribution

fig, ax = plt.subplots(figsize=(7, 4))

sns.countplot(
    data=df,
    x="species",
    ax=ax
)

ax.set_title("Count of Samples by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Count")

plt.show()

Interpretation

Category counts help you assess whether groups are balanced.

Balanced group sizes make comparisons easier to interpret, while strong imbalance can affect how patterns appear in tables and plots.


Grouped summary statistics

A common workflow is to summarize numeric variables by a grouping column.

group_summary = (
    df.groupby("species", observed=False)
      .agg(
          n=("species", "size"),
          sepal_length_mean=("sepal_length", "mean"),
          sepal_length_sd=("sepal_length", "std"),
          sepal_width_mean=("sepal_width", "mean"),
          sepal_width_sd=("sepal_width", "std"),
          petal_length_mean=("petal_length", "mean"),
          petal_length_sd=("petal_length", "std"),
          petal_width_mean=("petal_width", "mean"),
          petal_width_sd=("petal_width", "std"),
          petal_area_mean=("petal_area", "mean"),
          petal_area_sd=("petal_area", "std")
      )
      .reset_index()
)

group_summary

Interpretation

Grouped summaries help answer questions such as:

  • which species has the largest average petals?
  • which variable differs most across species?
  • which groups appear similar and which appear distinct?

This is where descriptive statistics begin to support insight.


Median table by species

Means are useful, but medians can provide a more robust summary when you want a center less influenced by unusual values.

median_table = (
    df.groupby("species", observed=False)[num_cols]
      .median()
      .reset_index()
)

median_table

Interpretation

Comparing means and medians helps you assess whether a group’s center is stable or influenced by a few values.


Visual comparison of group means

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.barplot(
    data=df,
    x="species",
    y="petal_length",
    errorbar="sd",
    ax=ax
)

ax.set_title("Mean Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")

plt.show()

Interpretation

This plot combines summary and variation.

It helps you compare:

  • average petal length across species
  • within-group spread using standard deviation
  • whether group differences appear large or modest

When this agrees with earlier visualizations, confidence in the pattern increases.


Key feature differences

A simple way to identify strong descriptive separation is to compare the range of group means across variables.

feature_separation = []

for col in num_cols:
    group_means = df.groupby("species", observed=False)[col].mean()
    feature_separation.append(
        {
            "feature": col,
            "min_group_mean": group_means.min(),
            "max_group_mean": group_means.max(),
            "range_of_group_means": group_means.max() - group_means.min()
        }
    )

feature_separation = (
    pd.DataFrame(feature_separation)
    .sort_values("range_of_group_means", ascending=False)
)

feature_separation

Interpretation

A larger difference in group means suggests that a feature may separate species more clearly.

This does not establish classification performance, but it provides a useful descriptive signal about which variables are more informative.


Correlation matrix

Correlations summarize linear relationships among numeric variables.

corr = df[num_cols].corr()
corr

Correlation heatmap

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.heatmap(
    corr,
    annot=True,
    fmt=".2f",
    ax=ax
)

ax.set_title("Correlation Heatmap")

plt.show()

Interpretation

A correlation heatmap helps you see:

  • which measurements tend to increase together
  • which relationships appear weak
  • where redundancy may exist between variables

This can guide later feature selection and interpretation.


Generate a simple insights report

An insights report should link evidence to interpretation without overstating what the data can support.

top_sep = feature_separation.iloc[0]

top_petal_length = (
    df.groupby("species", observed=False)["petal_length"]
      .mean()
      .sort_values(ascending=False)
      .reset_index(name="petal_length_mean")
)

report = f"""# Insights Report: Iris Dataset

## 1. Dataset size

- Rows: {df.shape[0]}
- Columns: {df.shape[1]}

## 2. Group separation

Petal-related features show strong descriptive differences across species.

Mean petal length by species:

{top_petal_length.to_markdown(index=False)}

## 3. Strongest descriptive separation

The feature with the largest range of group means is `{top_sep["feature"]}`.

- Minimum group mean: {top_sep["min_group_mean"]:.3f}
- Maximum group mean: {top_sep["max_group_mean"]:.3f}
- Range of group means: {top_sep["range_of_group_means"]:.3f}

## 4. Consistency with earlier plots

These grouped summaries are consistent with the separation seen in visualization-based exploration.

## 5. Caution

These are descriptive patterns only. They support comparison, but they do not justify causal claims.
"""

print(report)

Interpretation

A good insights report:

  • states what the evidence shows
  • links evidence to careful interpretation
  • avoids claims that go beyond the analysis

This habit is central to CDI-style reasoning.


Validation checks

print("Total missing values:", int(df.isna().sum().sum()))
print("Numeric columns used:", num_cols)
print("Grouped summary shape:", group_summary.shape)
print("Median table shape:", median_table.shape)

assert int(df.isna().sum().sum()) == 0, "Missing values remain in the dataset."
assert "species" in df.columns, "Grouping variable species is missing."
assert len(num_cols) > 0, "No numeric columns selected."

Interpretation

Even in a summary-focused chapter, validation remains important.

It confirms that:

  • the dataset is still clean
  • expected variables are present
  • grouped summaries were constructed correctly

Save summary outputs

Save the tables and report so they can be reused later.

from pathlib import Path

Path("results/summary").mkdir(parents=True, exist_ok=True)

numeric_summary.to_csv("results/summary/numeric-summary.tsv", sep="\t")
species_counts.to_csv("results/summary/species-counts.tsv", sep="\t", index=False)
group_summary.to_csv("results/summary/grouped-summary.tsv", sep="\t", index=False)
median_table.to_csv("results/summary/median-summary.tsv", sep="\t", index=False)
feature_separation.to_csv("results/summary/feature-separation.tsv", sep="\t", index=False)
corr.to_csv("results/summary/correlation-matrix.tsv", sep="\t")
Path("results/summary/analysis-insights.md").write_text(report, encoding="utf-8")

Run the reusable summary script

The manual steps above explain the logic. The reusable script creates the summary evidence package from the command line.

Run this from the project root:

python scripts/python/summarize_table.py data/iris_wrangled.csv results/summary

Expected outputs:

results/summary/
├── numeric-summary.tsv
├── species-counts.tsv
├── grouped-summary.tsv
├── median-summary.tsv
├── feature-separation.tsv
├── correlation-matrix.tsv
└── analysis-insights.md

What the summary script does

The script:

  • reads the wrangled input table
  • validates required columns
  • creates numeric descriptive summaries
  • creates categorical counts
  • creates grouped mean and standard deviation summaries
  • creates median summaries
  • estimates descriptive feature separation
  • computes a correlation matrix
  • writes a short markdown insights report

Exercise

Try the following:

  1. Compute the mean and standard deviation of sepal_width by species.
  2. Create a table with the median of each numeric feature by species.
  3. Open results/summary/analysis-insights.md.
  4. Write one short paragraph interpreting your results without causal language.

sepal_width_stats = (
    df.groupby("species", observed=False)
      .agg(
          sepal_width_mean=("sepal_width", "mean"),
          sepal_width_sd=("sepal_width", "std"),
      )
      .reset_index()
)

print("Sepal width mean and standard deviation by species:")
print(sepal_width_stats)

median_table_ex = (
    df.groupby("species", observed=False)[num_cols]
      .median()
      .reset_index()
)

print("\nMedian table by species:")
print(median_table_ex)

print("\nInterpretation:")
print("The grouped summaries suggest that species differ descriptively across several measurements, with petal-related variables showing clearer separation than sepal-related variables.")

CDI Insight

Summary statistics are not the end of analysis.

They are a way to organize evidence so that interpretation becomes clearer and more defensible.

A responsible analyst does not report numbers alone.

They explain what those numbers mean — and what they do not.

In CDI systems, summary tables and insights reports become reusable evidence, not just temporary output.


Summary

In this lesson, you:

  • computed descriptive statistics for numeric and categorical variables
  • compared group means, standard deviations, and medians
  • used simple visuals to support statistical summaries
  • examined linear relationships using a correlation matrix
  • created reusable summary tables
  • wrote an insights report grounded in descriptive evidence
  • completed the core Data Science Foundations tidy-table workflow

Looking Ahead

The Foundations System has now produced a complete tidy-table analysis package: inspected data, cleaned data, wrangled outputs, figures, summary tables, and a written insights report. In the next chapter, we connect this foundation to modeling and the Applied Data Science System.