Theme: Using summary statistics to move from numbers to interpretation
In this final lesson of the Foundations Track, you use summary statistics to move from raw measurements to interpretable evidence.
Summary statistics help describe the center, spread, and differences in a dataset without inspecting every row individually.
However, descriptive summaries are only useful when they are connected to clear interpretation.
In this lesson, you compute descriptive statistics, compare grouped summaries, examine simple relationships, and write careful insights grounded in evidence.
Lesson Overview
By the end of this lesson, you will be able to:
Compute descriptive statistics for numeric variables
Summarize categorical variables with counts
Compare groups using grouped means, medians, and standard deviations
Examine simple relationships between numeric features
Write short, careful interpretations based on descriptive evidence
Load the Dataset
We use the cleaned dataset created earlier in the guide.
Code
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsdf = pd.read_csv("data/iris_clean.csv")df.head()
sepal_length
sepal_width
petal_length
petal_width
species
0
5.1
3.5
1.4
0.2
setosa
1
4.9
3.0
1.4
0.2
setosa
2
4.7
3.2
1.3
0.2
setosa
3
4.6
3.1
1.5
0.2
setosa
4
5.0
3.6
1.4
0.2
setosa
Inspect the Structure
Before summarizing, confirm the structure of the dataset.
fig, ax = plt.subplots(figsize=(7, 4))species_counts.plot(kind="bar", ax=ax)ax.set_title("Count of Samples by Species")ax.set_xlabel("Species")ax.set_ylabel("Count")plt.show()
Iris — Count of Samples by Species
Interpretation
Category counts help you assess whether groups are balanced.
Balanced group sizes make comparisons easier to interpret, while strong imbalance can affect how patterns appear in tables and plots.
Grouped Summary Statistics
A common workflow is to summarize numeric variables by a grouping column.
This can guide later feature selection and interpretation.
Generate a Simple Insights Report
An insights report should link evidence to interpretation without overstating what the data can support.
Code
top_sep = ( df.groupby("species", observed=False)["petal_length"] .mean() .sort_values(ascending=False) .reset_index(name="petal_length_mean"))report =f"""Insights Report (Iris)1) Dataset size- Rows: {df.shape[0]}- Columns: {df.shape[1]}2) Group separation- Petal-related features show strong descriptive differences across species.- Mean petal length by species:{top_sep.to_string(index=False)}3) Consistency with earlier plots- These grouped summaries are consistent with the separation seen in visualization-based exploration.4) Caution- These are descriptive patterns only.- They support comparison, but they do not justify causal claims."""print(report)
Insights Report (Iris)
1) Dataset size
- Rows: 149
- Columns: 5
2) Group separation
- Petal-related features show strong descriptive differences across species.
- Mean petal length by species:
species petal_length_mean
virginica 5.561224
versicolor 4.260000
setosa 1.462000
3) Consistency with earlier plots
- These grouped summaries are consistent with the separation seen in visualization-based exploration.
4) Caution
- These are descriptive patterns only.
- They support comparison, but they do not justify causal claims.
Interpretation
A good insights report:
states what the evidence shows
links evidence to careful interpretation
avoids claims that go beyond the analysis
This habit is central to CDI-style reasoning.
Validation Checks
Code
print("Total missing values:", int(df.isna().sum().sum()))print("Numeric columns used:", num_cols)print("Grouped summary shape:", group_summary.shape)print("Median table shape:", median_table.shape)assertint(df.isna().sum().sum()) ==0, "Missing values remain in the dataset."assert"species"in df.columns, "Grouping variable species is missing."
Even in a summary-focused chapter, validation remains important.
It confirms that:
the dataset is still clean
expected variables are present
grouped summaries were constructed correctly
Exercise
Try the following:
Compute the mean and standard deviation of sepal_width by species
Create a table with the median of each numeric feature by species
Write one short paragraph interpreting your results without causal language
Solution
Code
sepal_width_stats = ( df.groupby("species", observed=False) .agg( sepal_width_mean=("sepal_width", "mean"), sepal_width_sd=("sepal_width", "std"), ) .reset_index())print("Sepal width mean and standard deviation by species:")print(sepal_width_stats)median_table_ex = ( df.groupby("species", observed=False)[num_cols] .median() .reset_index())print("\nMedian table by species:")print(median_table_ex)print("\nYour interpretation:")print("The grouped summaries suggest that species differ descriptively across several measurements, with petal-related variables showing clearer separation than sepal-related variables.")
Sepal width mean and standard deviation by species:
species sepal_width_mean sepal_width_sd
0 setosa 3.428000 0.379064
1 versicolor 2.770000 0.313798
2 virginica 2.979592 0.323380
Median table by species:
species sepal_length sepal_width petal_length petal_width
0 setosa 5.0 3.4 1.50 0.2
1 versicolor 5.9 2.8 4.35 1.3
2 virginica 6.5 3.0 5.60 2.0
Your interpretation:
The grouped summaries suggest that species differ descriptively across several measurements, with petal-related variables showing clearer separation than sepal-related variables.
Summary
you computed descriptive statistics for numeric and categorical variables
you compared group means, standard deviations, and medians
you used simple visuals to support statistical summaries
you examined linear relationships using a correlation matrix
you wrote an insights report grounded in descriptive evidence
This chapter closes the Foundations Track by showing how summary statistics support careful interpretation, not just calculation.
CDI Insight
Summary statistics are not the end of analysis.
They are a way to organize evidence so that interpretation becomes clearer and more defensible.
A responsible analyst does not report numbers alone.
They explain what those numbers mean — and what they do not.