Summary Statistics and Insights
Summary statistics help move analysis from individual rows to interpretable evidence.
They describe the center, spread, balance, and relationships in a dataset without requiring us to inspect every observation individually.
However, descriptive summaries are only useful when they are connected to careful interpretation.
In this lesson, we compute descriptive statistics, compare grouped summaries, examine simple relationships, and write short insights grounded in evidence.
This chapter closes the core tidy-table workflow introduced in the Data Science Foundations System:
inspect
↓
clean
↓
wrangle
↓
visualize
↓
summarize and interpret
Lesson overview
By the end of this lesson, you will be able to:
- compute descriptive statistics for numeric variables
- summarize categorical variables with counts
- compare groups using means, medians, and standard deviations
- examine simple relationships between numeric features
- save reusable summary tables
- write careful interpretations based on descriptive evidence
- run a reusable summarization script from the command line
Chapter workflow
This chapter introduces the fifth reusable Python script in the system:
06-summary-statistics-and-insights.qmd
↓
scripts/python/summarize_table.py
↓
data/iris_wrangled.csv
results/summary/
Expected outputs:
results/summary/
├── numeric-summary.tsv
├── species-counts.tsv
├── grouped-summary.tsv
├── median-summary.tsv
├── feature-separation.tsv
├── correlation-matrix.tsv
└── analysis-insights.md
These outputs provide a compact evidence package for reporting and interpretation.
Load the wrangled dataset
We use the wrangled dataset created in Chapter 04.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("data/iris_wrangled.csv")
df.head()Inspect the structure
Before summarizing, confirm the structure of the dataset.
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)Interpretation
Before computing summaries, ask:
- which variables are numeric?
- which variable defines the groups?
- which derived features are available?
- which summaries are likely to be most informative?
For the Iris dataset, the flower measurements and petal_area are numeric, while species provides a natural grouping variable.
Descriptive statistics for numeric variables
describe() gives a compact summary of central tendency and spread.
num_cols = [
"sepal_length",
"sepal_width",
"petal_length",
"petal_width",
"petal_area"
]
numeric_summary = df[num_cols].describe().transpose()
numeric_summaryInterpretation
This table helps you review:
- the typical range of each feature
- the approximate center using the mean
- variability using the standard deviation
- minimum and maximum values
- quartiles that describe spread
A numeric summary is often the first checkpoint before deeper comparison.
Visual check of numeric distributions
A simple set of histograms helps connect the summary table to visible distributions.
df_long = df.melt(
value_vars=num_cols,
var_name="feature",
value_name="value"
)
g = sns.displot(
data=df_long,
x="value",
col="feature",
col_wrap=3,
bins=12,
kde=True,
height=3.3,
aspect=1.15
)
g.fig.suptitle("Iris — Distributions of Numeric Features", y=1.03)
plt.show()Interpretation
These histograms help you assess:
- whether distributions are narrow or wide
- whether some features appear more variable than others
- whether there may be multiple peaks suggesting subgroup structure
A summary table alone may not reveal distribution shape.
Categorical summary
For a categorical variable, counts are often the simplest descriptive summary.
species_counts = (
df["species"]
.value_counts()
.sort_index()
.reset_index()
)
species_counts.columns = ["species", "n"]
species_countsVisual: species distribution
fig, ax = plt.subplots(figsize=(7, 4))
sns.countplot(
data=df,
x="species",
ax=ax
)
ax.set_title("Count of Samples by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Count")
plt.show()Interpretation
Category counts help you assess whether groups are balanced.
Balanced group sizes make comparisons easier to interpret, while strong imbalance can affect how patterns appear in tables and plots.
Grouped summary statistics
A common workflow is to summarize numeric variables by a grouping column.
group_summary = (
df.groupby("species", observed=False)
.agg(
n=("species", "size"),
sepal_length_mean=("sepal_length", "mean"),
sepal_length_sd=("sepal_length", "std"),
sepal_width_mean=("sepal_width", "mean"),
sepal_width_sd=("sepal_width", "std"),
petal_length_mean=("petal_length", "mean"),
petal_length_sd=("petal_length", "std"),
petal_width_mean=("petal_width", "mean"),
petal_width_sd=("petal_width", "std"),
petal_area_mean=("petal_area", "mean"),
petal_area_sd=("petal_area", "std")
)
.reset_index()
)
group_summaryInterpretation
Grouped summaries help answer questions such as:
- which species has the largest average petals?
- which variable differs most across species?
- which groups appear similar and which appear distinct?
This is where descriptive statistics begin to support insight.
Median table by species
Means are useful, but medians can provide a more robust summary when you want a center less influenced by unusual values.
median_table = (
df.groupby("species", observed=False)[num_cols]
.median()
.reset_index()
)
median_tableInterpretation
Comparing means and medians helps you assess whether a group’s center is stable or influenced by a few values.
Visual comparison of group means
fig, ax = plt.subplots(figsize=(8, 5.5))
sns.barplot(
data=df,
x="species",
y="petal_length",
errorbar="sd",
ax=ax
)
ax.set_title("Mean Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")
plt.show()Interpretation
This plot combines summary and variation.
It helps you compare:
- average petal length across species
- within-group spread using standard deviation
- whether group differences appear large or modest
When this agrees with earlier visualizations, confidence in the pattern increases.
Key feature differences
A simple way to identify strong descriptive separation is to compare the range of group means across variables.
feature_separation = []
for col in num_cols:
group_means = df.groupby("species", observed=False)[col].mean()
feature_separation.append(
{
"feature": col,
"min_group_mean": group_means.min(),
"max_group_mean": group_means.max(),
"range_of_group_means": group_means.max() - group_means.min()
}
)
feature_separation = (
pd.DataFrame(feature_separation)
.sort_values("range_of_group_means", ascending=False)
)
feature_separationInterpretation
A larger difference in group means suggests that a feature may separate species more clearly.
This does not establish classification performance, but it provides a useful descriptive signal about which variables are more informative.
Correlation matrix
Correlations summarize linear relationships among numeric variables.
corr = df[num_cols].corr()
corrCorrelation heatmap
fig, ax = plt.subplots(figsize=(8, 5.5))
sns.heatmap(
corr,
annot=True,
fmt=".2f",
ax=ax
)
ax.set_title("Correlation Heatmap")
plt.show()Interpretation
A correlation heatmap helps you see:
- which measurements tend to increase together
- which relationships appear weak
- where redundancy may exist between variables
This can guide later feature selection and interpretation.
Generate a simple insights report
An insights report should link evidence to interpretation without overstating what the data can support.
top_sep = feature_separation.iloc[0]
top_petal_length = (
df.groupby("species", observed=False)["petal_length"]
.mean()
.sort_values(ascending=False)
.reset_index(name="petal_length_mean")
)
report = f"""# Insights Report: Iris Dataset
## 1. Dataset size
- Rows: {df.shape[0]}
- Columns: {df.shape[1]}
## 2. Group separation
Petal-related features show strong descriptive differences across species.
Mean petal length by species:
{top_petal_length.to_markdown(index=False)}
## 3. Strongest descriptive separation
The feature with the largest range of group means is `{top_sep["feature"]}`.
- Minimum group mean: {top_sep["min_group_mean"]:.3f}
- Maximum group mean: {top_sep["max_group_mean"]:.3f}
- Range of group means: {top_sep["range_of_group_means"]:.3f}
## 4. Consistency with earlier plots
These grouped summaries are consistent with the separation seen in visualization-based exploration.
## 5. Caution
These are descriptive patterns only. They support comparison, but they do not justify causal claims.
"""
print(report)Interpretation
A good insights report:
- states what the evidence shows
- links evidence to careful interpretation
- avoids claims that go beyond the analysis
This habit is central to CDI-style reasoning.
Validation checks
print("Total missing values:", int(df.isna().sum().sum()))
print("Numeric columns used:", num_cols)
print("Grouped summary shape:", group_summary.shape)
print("Median table shape:", median_table.shape)
assert int(df.isna().sum().sum()) == 0, "Missing values remain in the dataset."
assert "species" in df.columns, "Grouping variable species is missing."
assert len(num_cols) > 0, "No numeric columns selected."Interpretation
Even in a summary-focused chapter, validation remains important.
It confirms that:
- the dataset is still clean
- expected variables are present
- grouped summaries were constructed correctly
Save summary outputs
Save the tables and report so they can be reused later.
from pathlib import Path
Path("results/summary").mkdir(parents=True, exist_ok=True)
numeric_summary.to_csv("results/summary/numeric-summary.tsv", sep="\t")
species_counts.to_csv("results/summary/species-counts.tsv", sep="\t", index=False)
group_summary.to_csv("results/summary/grouped-summary.tsv", sep="\t", index=False)
median_table.to_csv("results/summary/median-summary.tsv", sep="\t", index=False)
feature_separation.to_csv("results/summary/feature-separation.tsv", sep="\t", index=False)
corr.to_csv("results/summary/correlation-matrix.tsv", sep="\t")
Path("results/summary/analysis-insights.md").write_text(report, encoding="utf-8")Run the reusable summary script
The manual steps above explain the logic. The reusable script creates the summary evidence package from the command line.
Run this from the project root:
python scripts/python/summarize_table.py data/iris_wrangled.csv results/summaryExpected outputs:
results/summary/
├── numeric-summary.tsv
├── species-counts.tsv
├── grouped-summary.tsv
├── median-summary.tsv
├── feature-separation.tsv
├── correlation-matrix.tsv
└── analysis-insights.md
What the summary script does
The script:
- reads the wrangled input table
- validates required columns
- creates numeric descriptive summaries
- creates categorical counts
- creates grouped mean and standard deviation summaries
- creates median summaries
- estimates descriptive feature separation
- computes a correlation matrix
- writes a short markdown insights report
Exercise
Try the following:
- Compute the mean and standard deviation of
sepal_widthby species. - Create a table with the median of each numeric feature by species.
- Open
results/summary/analysis-insights.md. - Write one short paragraph interpreting your results without causal language.
sepal_width_stats = (
df.groupby("species", observed=False)
.agg(
sepal_width_mean=("sepal_width", "mean"),
sepal_width_sd=("sepal_width", "std"),
)
.reset_index()
)
print("Sepal width mean and standard deviation by species:")
print(sepal_width_stats)
median_table_ex = (
df.groupby("species", observed=False)[num_cols]
.median()
.reset_index()
)
print("\nMedian table by species:")
print(median_table_ex)
print("\nInterpretation:")
print("The grouped summaries suggest that species differ descriptively across several measurements, with petal-related variables showing clearer separation than sepal-related variables.")CDI Insight
Summary statistics are not the end of analysis.
They are a way to organize evidence so that interpretation becomes clearer and more defensible.
A responsible analyst does not report numbers alone.
They explain what those numbers mean — and what they do not.
In CDI systems, summary tables and insights reports become reusable evidence, not just temporary output.
Summary
In this lesson, you:
- computed descriptive statistics for numeric and categorical variables
- compared group means, standard deviations, and medians
- used simple visuals to support statistical summaries
- examined linear relationships using a correlation matrix
- created reusable summary tables
- wrote an insights report grounded in descriptive evidence
- completed the core Data Science Foundations tidy-table workflow
Looking Ahead
The Foundations System has now produced a complete tidy-table analysis package: inspected data, cleaned data, wrangled outputs, figures, summary tables, and a written insights report. In the next chapter, we connect this foundation to modeling and the Applied Data Science System.