Load and Explore a Dataset
Analytical work does not begin with modeling.
It begins with looking carefully at the data.
Before fitting models or drawing conclusions, we need to understand:
- what the dataset contains
- how variables are structured
- whether values look plausible
- whether early patterns deserve closer attention
In CDI, this stage is not a formality. It is where analytical judgment begins.
In this lesson, we use the classic Iris dataset (Fisher 1936) as a small example table. The goal is not to study flowers. The goal is to learn a reusable pattern for loading, inspecting, saving, and exploring a tidy dataset.
You will:
- create an example dataset
- save it as
data/iris.csv - inspect its structure and column types
- generate table inspection outputs
- create early exploratory plots
- prepare the dataset for reuse in later lessons
Why this step matters
A dataset can appear clean at first glance and still contain issues that affect later analysis.
Early exploration helps answer practical questions:
- how many rows and columns are present?
- which variables are numeric and which are categorical?
- do any values look unusual?
- are there visible group differences?
- what should be examined more carefully next?
This is where defensible analysis begins.
Chapter workflow
This chapter introduces the first reusable Python script in the system:
02-load-and-explore-dataset.qmd
↓
scripts/python/inspect_table.py
↓
data/iris.csv
results/inspection/table-inspection-summary.txt
results/inspection/table-column-summary.tsv
results/inspection/table-missing-values.tsv
The chapter explains the workflow. The script makes it reusable.
Create and save the example dataset
We will use the Iris dataset from scikit-learn, convert it to a pandas DataFrame, standardize the column names, and save it as a CSV file.
Run this from the project root using Python:
from pathlib import Path
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.columns = [
"sepal_length",
"sepal_width",
"petal_length",
"petal_width",
"species"
]
Path("data").mkdir(exist_ok=True)
df.to_csv("data/iris.csv", index=False)
print("Saved dataset to: data/iris.csv")
print(df.head())What happened here?
iris.datacontains the numeric measurements.iris.targetcontains the species labels encoded as integers.pd.Categorical.from_codes()converts the encoded values into readable species names.- the columns are renamed to
snake_caseso later code stays clean and consistent. - the dataset is saved to
data/iris.csvso downstream lessons can reuse the same input.
Inspect the dataset manually
Before plotting or summarizing, inspect the shape and variable types.
import pandas as pd
df = pd.read_csv("data/iris.csv")
print("Columns:")
print(df.columns.tolist())
print("\nShape (rows, columns):")
print(df.shape)
print("\nData types:")
print(df.dtypes)Interpretation
At this stage, you should notice:
- the dataset has 150 rows
- there are 4 numeric measurement columns
speciesis a categorical grouping variable
That already tells us this is a small, tidy dataset that is well suited for learning exploratory analysis.
Preview the data more carefully
df.sample(8, random_state=42)Looking at a few random rows is often more informative than only checking the first five rows.
It helps confirm that values look realistic across the dataset instead of only at the top.
Run the reusable inspection script
The manual checks above are useful for learning. For a reusable CDI system, we also want a script that can inspect any tidy table.
Run:
python scripts/python/inspect_table.py data/iris.csv results/inspectionThis creates:
results/inspection/
├── table-inspection-summary.txt
├── table-column-summary.tsv
└── table-missing-values.tsv
These outputs document the structure of the table and can be reused later for reporting or quality checks.
What the inspection script does
The script checks:
- file path
- table dimensions
- column names
- data types
- missing values
- numeric columns
- categorical columns
This turns early exploration into a repeatable step.
Summary statistics
Use pandas to generate a first summary of the numeric columns:
df.describe()Interpretation
These summary statistics help answer early questions:
- What is the average size of each measurement?
- Which variables have wider spread?
- Are minimum and maximum values plausible?
- Do some variables appear more variable than others?
For Iris, petal measurements often show stronger separation across species than sepal measurements. We will confirm that visually next.
Species counts
Before comparing groups, verify that each group is represented.
df["species"].value_counts().sort_index()Interpretation
This dataset is balanced across species, which makes visual comparison easier.
Exploratory visualizations
We will use seaborn and matplotlib for exploratory plotting.
These plots are not final presentation graphics. Their purpose is to help us inspect the data and notice patterns worth explaining.
Distribution of numeric features
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid", context="notebook")
df_long = df.melt(
value_vars=["sepal_length", "sepal_width", "petal_length", "petal_width"],
var_name="feature",
value_name="value"
)
g = sns.displot(
data=df_long,
x="value",
col="feature",
col_wrap=2,
bins=12,
height=3.6,
aspect=1.15,
facet_kws={"sharex": False, "sharey": False}
)
g.set_titles("{col_name}")
g.set_axis_labels("Value", "Count")
g.fig.suptitle("Iris — Distribution of Numeric Features", y=1.03)
plt.show()Interpretation
These histograms help us see:
- the overall spread of each feature
- whether values are concentrated or dispersed
- whether some variables may contain overlapping or separated groups
A single histogram does not separate species, but it gives a first impression of the measurement ranges.
Boxplot by species
A boxplot gives a compact summary of group differences using the median, quartiles, and overall spread.
fig, ax = plt.subplots(figsize=(8, 5.5))
sns.boxplot(
data=df,
x="species",
y="petal_length",
width=0.5,
fliersize=0,
ax=ax
)
ax.set_title("Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")
plt.show()Interpretation
This plot makes it easy to compare:
- the typical petal length in each species
- the spread within each group
- whether groups overlap
However, a boxplot hides the individual observations. That means we cannot directly see how densely values are clustered or how points are distributed within each group.
Boxplot with observed points
To make the distribution more visible, we can overlay the individual observations.
fig, ax = plt.subplots(figsize=(8, 5.5))
sns.boxplot(
data=df,
x="species",
y="petal_length",
width=0.5,
fliersize=0,
ax=ax
)
sns.stripplot(
data=df,
x="species",
y="petal_length",
color="black",
alpha=0.6,
size=4,
jitter=0.22,
ax=ax
)
ax.set_title("Petal Length by Species")
ax.set_xlabel("Species")
ax.set_ylabel("Petal Length")
plt.show()Interpretation
Adding the observed points makes the plot more informative.
Now we can see:
- how values are distributed within each species
- whether observations are tightly clustered or more dispersed
- where overlap is limited or more substantial
Petal length differs strongly by species. This makes it a strong candidate for distinguishing between groups in later analysis.
Scatter plot: sepal length vs petal length
fig, ax = plt.subplots(figsize=(8, 5.5))
sns.scatterplot(
data=df,
x="sepal_length",
y="petal_length",
hue="species",
s=70,
alpha=0.8,
ax=ax
)
ax.set_title("Sepal Length vs Petal Length")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Petal Length")
plt.show()Interpretation
Scatter plots help us inspect relationships between two variables.
Here, the species begin to separate into visible clusters, especially when petal length is involved.
This is a strong exploratory signal that some measurements carry more discriminatory information than others.
Pairwise relationships
g = sns.pairplot(
df,
hue="species",
corner=True,
diag_kind="hist",
plot_kws={"alpha": 0.7, "s": 45}
)
g.fig.suptitle("Iris — Pairwise Relationships by Species", y=1.02)
plt.show()Interpretation
The pairplot provides a broader view of the dataset:
- which variable pairs show clear separation
- where clusters overlap
- which variables appear more useful for distinguishing species
This type of overview is valuable before moving to formal modeling.
Create a simple derived feature
Feature creation is often part of exploration.
Here, we define a new variable, petal_area, as a simple combination of petal length and petal width.
df["petal_area"] = df["petal_length"] * df["petal_width"]
df.head()This does not mean the new feature is automatically better.
It illustrates how exploratory analysis can lead to new candidate variables.
Exercise
Try the following:
- Print the last five rows of the dataset.
- Compute the mean of
petal_areafor each species. - Create a scatter plot of
sepal_widthversuspetal_width. - Write one sentence describing which feature seems most useful for separating species.
print("Last 5 rows:")
print(df.tail())
print("\nMean petal_area by species:")
print(df.groupby("species", observed=False)["petal_area"].mean())
fig, ax = plt.subplots(figsize=(8, 5.5))
sns.scatterplot(
data=df,
x="sepal_width",
y="petal_width",
hue="species",
s=70,
alpha=0.8,
ax=ax
)
ax.set_title("Sepal Width vs Petal Width")
ax.set_xlabel("Sepal Width")
ax.set_ylabel("Petal Width")
plt.show()Interpretation:
Petal-based features provide clearer separation between species than sepal-based features.
CDI Insight
Exploration is not just about making plots.
It is about understanding how the dataset behaves before it is used to support conclusions.
A responsible analyst does not move directly from data loading to modeling.
They pause, inspect, compare, and question — and only then move forward.
That habit is a foundation of reliable analysis.
Summary
In this lesson, you:
- created and saved the Iris dataset as
data/iris.csv - inspected structure, data types, and summary statistics
- generated reusable inspection outputs with
inspect_table.py - used exploratory visualizations to examine distributions and group differences
- created a simple derived feature for further exploration
Looking Ahead
In the next chapter, we clean and prepare the dataset for analysis. The cleaned table will become the starting point for downstream wrangling, visualization, and summary statistics.