Summary Statistics and Insights

Published

Mar 2026

ID: DS-L06
Type: Lesson
Audience: Public
Theme: Using summary statistics to move from numbers to interpretation

In this final lesson of the Foundations Track, you use summary statistics to move from raw measurements to interpretable evidence.

Summary statistics help describe the center, spread, and differences in a dataset without inspecting every row individually.

However, descriptive summaries are only useful when they are connected to clear interpretation.

In this lesson, you compute descriptive statistics, compare grouped summaries, examine simple relationships, and write careful insights grounded in evidence.

Lesson Overview

By the end of this lesson, you will be able to:

Compute descriptive statistics for numeric variables
Summarize categorical variables with counts
Compare groups using grouped means, medians, and standard deviations
Examine simple relationships between numeric features
Write short, careful interpretations based on descriptive evidence

Load the Dataset

We use the cleaned dataset created earlier in the guide.

Code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/iris_clean.csv")
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Inspect the Structure

Before summarizing, confirm the structure of the dataset.

Code

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Shape: (149, 5)

Columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species             str
dtype: object

Interpretation

Before computing summaries, ask:

which variables are numeric?
which variable defines the groups?
which summaries are likely to be most informative?

For the Iris dataset, the flower measurements are numeric and species provides a natural grouping variable.

Descriptive Statistics for Numeric Variables

describe() gives a compact summary of central tendency and spread.

Code

df.describe()

	sepal_length	sepal_width	petal_length	petal_width
count	149.000000	149.000000	149.000000	149.000000
mean	5.843624	3.059732	3.748993	1.194631
std	0.830851	0.436342	1.767791	0.762622
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.300000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Interpretation

This table helps you review:

the typical range of each feature
the approximate center using the mean
variability using the standard deviation
minimum and maximum values
quartiles that describe spread

A numeric summary is often the first checkpoint before deeper comparison.

Visual Check of Numeric Distributions

A simple set of histograms helps connect the summary table to visible distributions.

Code

num_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

df_long = df.melt(
    value_vars=num_cols,
    var_name="feature",
    value_name="value"
)

g = sns.displot(
    data=df_long,
    x="value",
    col="feature",
    col_wrap=2,
    bins=12,
    kde=True,
    height=3.5,
    aspect=1.2
)

g.fig.suptitle("Iris — Distributions of Numeric Features", y=1.02)

plt.show()

Iris — Distributions of Numeric Features

Interpretation

These histograms help you assess:

whether distributions are narrow or wide
whether some features appear more variable than others
whether there may be multiple peaks suggesting subgroup structure

A summary table alone may not reveal distribution shape.

Categorical Summary

For a categorical variable, counts are often the simplest descriptive summary.

Code

species_counts = df["species"].value_counts()
species_counts

species
setosa        50
versicolor    50
virginica     49
Name: count, dtype: int64

Visual: Species Distribution

Code

fig, ax = plt.subplots(figsize=(7, 4))

species_counts.plot(kind="bar", ax=ax)

ax.set_title("Count of Samples by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Count")

plt.show()

Interpretation

Category counts help you assess whether groups are balanced.

Balanced group sizes make comparisons easier to interpret, while strong imbalance can affect how patterns appear in tables and plots.

Grouped Summary Statistics

A common workflow is to summarize numeric variables by a grouping column.

Code

group_summary = (
    df.groupby("species", observed=False)
      .agg(
          sepal_length_mean=("sepal_length", "mean"),
          sepal_length_sd=("sepal_length", "std"),
          sepal_width_mean=("sepal_width", "mean"),
          sepal_width_sd=("sepal_width", "std"),
          petal_length_mean=("petal_length", "mean"),
          petal_length_sd=("petal_length", "std"),
          petal_width_mean=("petal_width", "mean"),
          petal_width_sd=("petal_width", "std"),
      )
      .reset_index()
)

group_summary

	species	sepal_length_mean	sepal_length_sd	sepal_width_mean	sepal_width_sd	petal_length_mean	petal_length_sd	petal_width_mean	petal_width_sd
0	setosa	5.006000	0.352490	3.428000	0.379064	1.462000	0.173664	0.246000	0.105386
1	versicolor	5.936000	0.516171	2.770000	0.313798	4.260000	0.469911	1.326000	0.197753
2	virginica	6.604082	0.632113	2.979592	0.323380	5.561224	0.553706	2.028571	0.276887

Interpretation

Grouped summaries help answer questions such as:

which species has the largest average petals?
which variable differs most across species?
which groups appear similar and which appear distinct?

This is where descriptive statistics begin to support insight.

Median Table by Species

Means are useful, but medians can provide a more robust summary when you want a center less influenced by unusual values.

Code

median_table = (
    df.groupby("species", observed=False)[num_cols]
      .median()
      .reset_index()
)

median_table

	species	sepal_length	sepal_width	petal_length	petal_width
0	setosa	5.0	3.4	1.50	0.2
1	versicolor	5.9	2.8	4.35	1.3
2	virginica	6.5	3.0	5.60	2.0

Interpretation

Comparing means and medians helps you assess whether a group’s center is stable or influenced by a few values.

Visual Comparison of Group Means

Code

fig, ax = plt.subplots(figsize=(8, 5.5))

sns.barplot(
    data=df,
    x="species",
    y="petal_length",
    errorbar="sd",
    ax=ax
)

ax.set_title("Mean Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")

plt.show()

Interpretation

This plot combines summary and variation.

It helps you compare:

average petal length across species
within-group spread using standard deviation
whether group differences appear large or modest

When this agrees with earlier visualizations, confidence in the pattern increases.

Key Feature Differences

A simple way to identify strong separation is to compare the range of group means across variables.

Code

feature_separation = {
    col: df.groupby("species", observed=False)[col].mean().max()
         - df.groupby("species", observed=False)[col].mean().min()
    for col in num_cols
}

feature_separation

{'sepal_length': np.float64(1.5980816326530611),
 'sepal_width': np.float64(0.6579999999999999),
 'petal_length': np.float64(4.099224489795919),
 'petal_width': np.float64(1.7825714285714285)}

Interpretation

A larger difference in group means suggests that a feature may separate species more clearly.

This does not establish classification performance, but it provides a useful descriptive signal about which variables are more informative.

Correlation Matrix

Correlations summarize linear relationships among numeric variables.

Code

corr = df[num_cols].corr()
corr

	sepal_length	sepal_width	petal_length	petal_width
sepal_length	1.000000	-0.118129	0.873738	0.820620
sepal_width	-0.118129	1.000000	-0.426028	-0.362894
petal_length	0.873738	-0.426028	1.000000	0.962772
petal_width	0.820620	-0.362894	0.962772	1.000000

Correlation Heatmap

Code

fig, ax = plt.subplots(figsize=(7, 5.5))

sns.heatmap(
    corr,
    annot=True,
    fmt=".2f",
    ax=ax
)

ax.set_title("Correlation Heatmap")

plt.show()

Interpretation

A correlation heatmap helps you see:

which measurements tend to increase together
which relationships appear weak
where redundancy may exist between variables

This can guide later feature selection and interpretation.

Generate a Simple Insights Report

An insights report should link evidence to interpretation without overstating what the data can support.

Code

top_sep = (
    df.groupby("species", observed=False)["petal_length"]
      .mean()
      .sort_values(ascending=False)
      .reset_index(name="petal_length_mean")
)

report = f"""Insights Report (Iris)

1) Dataset size
- Rows: {df.shape[0]}
- Columns: {df.shape[1]}

2) Group separation
- Petal-related features show strong descriptive differences across species.
- Mean petal length by species:
{top_sep.to_string(index=False)}

3) Consistency with earlier plots
- These grouped summaries are consistent with the separation seen in visualization-based exploration.

4) Caution
- These are descriptive patterns only.
- They support comparison, but they do not justify causal claims.
"""

print(report)

Insights Report (Iris)

1) Dataset size
- Rows: 149
- Columns: 5

2) Group separation
- Petal-related features show strong descriptive differences across species.
- Mean petal length by species:
   species  petal_length_mean
 virginica           5.561224
versicolor           4.260000
    setosa           1.462000

3) Consistency with earlier plots
- These grouped summaries are consistent with the separation seen in visualization-based exploration.

4) Caution
- These are descriptive patterns only.
- They support comparison, but they do not justify causal claims.

Interpretation

A good insights report:

states what the evidence shows
links evidence to careful interpretation
avoids claims that go beyond the analysis

This habit is central to CDI-style reasoning.

Validation Checks

Code

print("Total missing values:", int(df.isna().sum().sum()))
print("Numeric columns used:", num_cols)
print("Grouped summary shape:", group_summary.shape)
print("Median table shape:", median_table.shape)

assert int(df.isna().sum().sum()) == 0, "Missing values remain in the dataset."
assert "species" in df.columns, "Grouping variable species is missing."

Total missing values: 0
Numeric columns used: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Grouped summary shape: (3, 9)
Median table shape: (3, 5)

Interpretation

Even in a summary-focused chapter, validation remains important.

It confirms that:

the dataset is still clean
expected variables are present
grouped summaries were constructed correctly

Exercise

Try the following:

Compute the mean and standard deviation of sepal_width by species
Create a table with the median of each numeric feature by species
Write one short paragraph interpreting your results without causal language

Solution

Code

sepal_width_stats = (
    df.groupby("species", observed=False)
      .agg(
          sepal_width_mean=("sepal_width", "mean"),
          sepal_width_sd=("sepal_width", "std"),
      )
      .reset_index()
)

print("Sepal width mean and standard deviation by species:")
print(sepal_width_stats)

median_table_ex = (
    df.groupby("species", observed=False)[num_cols]
      .median()
      .reset_index()
)

print("\nMedian table by species:")
print(median_table_ex)

print("\nYour interpretation:")
print("The grouped summaries suggest that species differ descriptively across several measurements, with petal-related variables showing clearer separation than sepal-related variables.")

Sepal width mean and standard deviation by species:
      species  sepal_width_mean  sepal_width_sd
0      setosa          3.428000        0.379064
1  versicolor          2.770000        0.313798
2   virginica          2.979592        0.323380

Median table by species:
      species  sepal_length  sepal_width  petal_length  petal_width
0      setosa           5.0          3.4          1.50          0.2
1  versicolor           5.9          2.8          4.35          1.3
2   virginica           6.5          3.0          5.60          2.0

Your interpretation:
The grouped summaries suggest that species differ descriptively across several measurements, with petal-related variables showing clearer separation than sepal-related variables.

Summary

you computed descriptive statistics for numeric and categorical variables
you compared group means, standard deviations, and medians
you used simple visuals to support statistical summaries
you examined linear relationships using a correlation matrix
you wrote an insights report grounded in descriptive evidence

This chapter closes the Foundations Track by showing how summary statistics support careful interpretation, not just calculation.

CDI Insight

Summary statistics are not the end of analysis.

They are a way to organize evidence so that interpretation becomes clearer and more defensible.

A responsible analyst does not report numbers alone.

They explain what those numbers mean — and what they do not.