Summary Statistics and Insights

Published

Mar 2026

  • ID: DS-L06
  • Type: Lesson
  • Audience: Public
  • Theme: Using summary statistics to move from numbers to interpretation

In this final lesson of the Foundations Track, you use summary statistics to move from raw measurements to interpretable evidence.

Summary statistics help describe the center, spread, and differences in a dataset without inspecting every row individually.

However, descriptive summaries are only useful when they are connected to clear interpretation.

In this lesson, you compute descriptive statistics, compare grouped summaries, examine simple relationships, and write careful insights grounded in evidence.


Lesson Overview

By the end of this lesson, you will be able to:

  • Compute descriptive statistics for numeric variables
  • Summarize categorical variables with counts
  • Compare groups using grouped means, medians, and standard deviations
  • Examine simple relationships between numeric features
  • Write short, careful interpretations based on descriptive evidence

Load the Dataset

We use the cleaned dataset created earlier in the guide.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/iris_clean.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Inspect the Structure

Before summarizing, confirm the structure of the dataset.

Code
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
Shape: (149, 5)

Columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Data types:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species             str
dtype: object

Interpretation

Before computing summaries, ask:

  • which variables are numeric?
  • which variable defines the groups?
  • which summaries are likely to be most informative?

For the Iris dataset, the flower measurements are numeric and species provides a natural grouping variable.


Descriptive Statistics for Numeric Variables

describe() gives a compact summary of central tendency and spread.

Code
df.describe()
sepal_length sepal_width petal_length petal_width
count 149.000000 149.000000 149.000000 149.000000
mean 5.843624 3.059732 3.748993 1.194631
std 0.830851 0.436342 1.767791 0.762622
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.300000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Interpretation

This table helps you review:

  • the typical range of each feature
  • the approximate center using the mean
  • variability using the standard deviation
  • minimum and maximum values
  • quartiles that describe spread

A numeric summary is often the first checkpoint before deeper comparison.


Visual Check of Numeric Distributions

A simple set of histograms helps connect the summary table to visible distributions.

Code
num_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

df_long = df.melt(
    value_vars=num_cols,
    var_name="feature",
    value_name="value"
)

g = sns.displot(
    data=df_long,
    x="value",
    col="feature",
    col_wrap=2,
    bins=12,
    kde=True,
    height=3.5,
    aspect=1.2
)

g.fig.suptitle("Iris — Distributions of Numeric Features", y=1.02)

plt.show()

Iris — Distributions of Numeric Features

Interpretation

These histograms help you assess:

  • whether distributions are narrow or wide
  • whether some features appear more variable than others
  • whether there may be multiple peaks suggesting subgroup structure

A summary table alone may not reveal distribution shape.


Categorical Summary

For a categorical variable, counts are often the simplest descriptive summary.

Code
species_counts = df["species"].value_counts()
species_counts
species
setosa        50
versicolor    50
virginica     49
Name: count, dtype: int64

Visual: Species Distribution

Code
fig, ax = plt.subplots(figsize=(7, 4))

species_counts.plot(kind="bar", ax=ax)

ax.set_title("Count of Samples by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Count")

plt.show()

Iris — Count of Samples by Species

Interpretation

Category counts help you assess whether groups are balanced.

Balanced group sizes make comparisons easier to interpret, while strong imbalance can affect how patterns appear in tables and plots.


Grouped Summary Statistics

A common workflow is to summarize numeric variables by a grouping column.

Code
group_summary = (
    df.groupby("species", observed=False)
      .agg(
          sepal_length_mean=("sepal_length", "mean"),
          sepal_length_sd=("sepal_length", "std"),
          sepal_width_mean=("sepal_width", "mean"),
          sepal_width_sd=("sepal_width", "std"),
          petal_length_mean=("petal_length", "mean"),
          petal_length_sd=("petal_length", "std"),
          petal_width_mean=("petal_width", "mean"),
          petal_width_sd=("petal_width", "std"),
      )
      .reset_index()
)

group_summary
species sepal_length_mean sepal_length_sd sepal_width_mean sepal_width_sd petal_length_mean petal_length_sd petal_width_mean petal_width_sd
0 setosa 5.006000 0.352490 3.428000 0.379064 1.462000 0.173664 0.246000 0.105386
1 versicolor 5.936000 0.516171 2.770000 0.313798 4.260000 0.469911 1.326000 0.197753
2 virginica 6.604082 0.632113 2.979592 0.323380 5.561224 0.553706 2.028571 0.276887

Interpretation

Grouped summaries help answer questions such as:

  • which species has the largest average petals?
  • which variable differs most across species?
  • which groups appear similar and which appear distinct?

This is where descriptive statistics begin to support insight.


Median Table by Species

Means are useful, but medians can provide a more robust summary when you want a center less influenced by unusual values.

Code
median_table = (
    df.groupby("species", observed=False)[num_cols]
      .median()
      .reset_index()
)

median_table
species sepal_length sepal_width petal_length petal_width
0 setosa 5.0 3.4 1.50 0.2
1 versicolor 5.9 2.8 4.35 1.3
2 virginica 6.5 3.0 5.60 2.0

Interpretation

Comparing means and medians helps you assess whether a group’s center is stable or influenced by a few values.


Visual Comparison of Group Means

Code
fig, ax = plt.subplots(figsize=(8, 5.5))

sns.barplot(
    data=df,
    x="species",
    y="petal_length",
    errorbar="sd",
    ax=ax
)

ax.set_title("Mean Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")

plt.show()

Iris — Mean Petal Length by Species

Interpretation

This plot combines summary and variation.

It helps you compare:

  • average petal length across species
  • within-group spread using standard deviation
  • whether group differences appear large or modest

When this agrees with earlier visualizations, confidence in the pattern increases.


Key Feature Differences

A simple way to identify strong separation is to compare the range of group means across variables.

Code
feature_separation = {
    col: df.groupby("species", observed=False)[col].mean().max()
         - df.groupby("species", observed=False)[col].mean().min()
    for col in num_cols
}

feature_separation
{'sepal_length': np.float64(1.5980816326530611),
 'sepal_width': np.float64(0.6579999999999999),
 'petal_length': np.float64(4.099224489795919),
 'petal_width': np.float64(1.7825714285714285)}

Interpretation

A larger difference in group means suggests that a feature may separate species more clearly.

This does not establish classification performance, but it provides a useful descriptive signal about which variables are more informative.


Correlation Matrix

Correlations summarize linear relationships among numeric variables.

Code
corr = df[num_cols].corr()
corr
sepal_length sepal_width petal_length petal_width
sepal_length 1.000000 -0.118129 0.873738 0.820620
sepal_width -0.118129 1.000000 -0.426028 -0.362894
petal_length 0.873738 -0.426028 1.000000 0.962772
petal_width 0.820620 -0.362894 0.962772 1.000000

Correlation Heatmap

Code
fig, ax = plt.subplots(figsize=(7, 5.5))

sns.heatmap(
    corr,
    annot=True,
    fmt=".2f",
    ax=ax
)

ax.set_title("Correlation Heatmap")

plt.show()

Iris — Correlation Heatmap

Interpretation

A correlation heatmap helps you see:

  • which measurements tend to increase together
  • which relationships appear weak
  • where redundancy may exist between variables

This can guide later feature selection and interpretation.


Generate a Simple Insights Report

An insights report should link evidence to interpretation without overstating what the data can support.

Code
top_sep = (
    df.groupby("species", observed=False)["petal_length"]
      .mean()
      .sort_values(ascending=False)
      .reset_index(name="petal_length_mean")
)

report = f"""Insights Report (Iris)

1) Dataset size
- Rows: {df.shape[0]}
- Columns: {df.shape[1]}

2) Group separation
- Petal-related features show strong descriptive differences across species.
- Mean petal length by species:
{top_sep.to_string(index=False)}

3) Consistency with earlier plots
- These grouped summaries are consistent with the separation seen in visualization-based exploration.

4) Caution
- These are descriptive patterns only.
- They support comparison, but they do not justify causal claims.
"""

print(report)
Insights Report (Iris)

1) Dataset size
- Rows: 149
- Columns: 5

2) Group separation
- Petal-related features show strong descriptive differences across species.
- Mean petal length by species:
   species  petal_length_mean
 virginica           5.561224
versicolor           4.260000
    setosa           1.462000

3) Consistency with earlier plots
- These grouped summaries are consistent with the separation seen in visualization-based exploration.

4) Caution
- These are descriptive patterns only.
- They support comparison, but they do not justify causal claims.

Interpretation

A good insights report:

  • states what the evidence shows
  • links evidence to careful interpretation
  • avoids claims that go beyond the analysis

This habit is central to CDI-style reasoning.


Validation Checks

Code
print("Total missing values:", int(df.isna().sum().sum()))
print("Numeric columns used:", num_cols)
print("Grouped summary shape:", group_summary.shape)
print("Median table shape:", median_table.shape)

assert int(df.isna().sum().sum()) == 0, "Missing values remain in the dataset."
assert "species" in df.columns, "Grouping variable species is missing."
Total missing values: 0
Numeric columns used: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
Grouped summary shape: (3, 9)
Median table shape: (3, 5)

Interpretation

Even in a summary-focused chapter, validation remains important.

It confirms that:

  • the dataset is still clean
  • expected variables are present
  • grouped summaries were constructed correctly

Exercise

Try the following:

  1. Compute the mean and standard deviation of sepal_width by species
  2. Create a table with the median of each numeric feature by species
  3. Write one short paragraph interpreting your results without causal language
Code
sepal_width_stats = (
    df.groupby("species", observed=False)
      .agg(
          sepal_width_mean=("sepal_width", "mean"),
          sepal_width_sd=("sepal_width", "std"),
      )
      .reset_index()
)

print("Sepal width mean and standard deviation by species:")
print(sepal_width_stats)

median_table_ex = (
    df.groupby("species", observed=False)[num_cols]
      .median()
      .reset_index()
)

print("\nMedian table by species:")
print(median_table_ex)

print("\nYour interpretation:")
print("The grouped summaries suggest that species differ descriptively across several measurements, with petal-related variables showing clearer separation than sepal-related variables.")
Sepal width mean and standard deviation by species:
      species  sepal_width_mean  sepal_width_sd
0      setosa          3.428000        0.379064
1  versicolor          2.770000        0.313798
2   virginica          2.979592        0.323380

Median table by species:
      species  sepal_length  sepal_width  petal_length  petal_width
0      setosa           5.0          3.4          1.50          0.2
1  versicolor           5.9          2.8          4.35          1.3
2   virginica           6.5          3.0          5.60          2.0

Your interpretation:
The grouped summaries suggest that species differ descriptively across several measurements, with petal-related variables showing clearer separation than sepal-related variables.

Summary

  • you computed descriptive statistics for numeric and categorical variables
  • you compared group means, standard deviations, and medians
  • you used simple visuals to support statistical summaries
  • you examined linear relationships using a correlation matrix
  • you wrote an insights report grounded in descriptive evidence

This chapter closes the Foundations Track by showing how summary statistics support careful interpretation, not just calculation.

CDI Insight

Summary statistics are not the end of analysis.

They are a way to organize evidence so that interpretation becomes clearer and more defensible.

A responsible analyst does not report numbers alone.

They explain what those numbers mean — and what they do not.