Appendix
This appendix collects technical reference material that supports the Foundations Track.
It documents the project structure, environment setup, rendering workflow, and reusable patterns used throughout the guide.
Project Structure Overview
A standard CDI Quarto-first project looks like this:
data-science/
├── index.qmd
├── 00-preface.qmd
├── 01-setting-up-environment.qmd
├── ...
├── data/
├── figures/
├── scripts/
│ └── bash/
├── docs/
├── _quarto.yml
└── requirements.txt
Key Directories
data/
stores raw and cleaned datasetsfigures/
stores plots that are explicitly saved during renderingscripts/bash/
contains helper scripts such assetup-env.shandbuild.shdocs/
contains the rendered Quarto site (GitHub Pages output)
Environment Setup Reference
This project uses a local virtual environment (.venv) for reproducibility.
Create environment manually
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRender the book
quarto renderOr use the helper script:
bash scripts/bash/build.shQuarto Workflow Summary
The workflow for this guide is:
Python code
↓
Quarto chapter (.qmd)
↓
quarto render
↓
docs/ (static site)
All figures and outputs are generated during rendering.
Reusable Pandas Patterns
Select columns
df[["col1", "col2"]]Filter rows
df[df["col"] > value]Group and aggregate
df.groupby("group_col").agg(
metric=("value_col", "mean")
)Handle missing values
df["col"] = df["col"].fillna(df["col"].median())Convert dtype
df["category_col"] = df["category_col"].astype("category")Reproducibility Checklist
Before finalizing any chapter:
- confirm no missing values remain (unless justified)
- confirm no unintended duplicates
- confirm correct data types
- confirm figures render correctly
- rebuild the book (
quarto render)
- open
docs/index.htmlto verify output
Data Sources
- Iris dataset (Fisher 1936)
Software and Tools
This guide uses a small set of widely adopted tools for data analysis and visualization:
- Python (Python Software Foundation 2024) — general-purpose programming language for data analysis
- pandas (McKinney et al. 2010) — data manipulation and tabular data handling
- NumPy (Harris et al. 2020) — numerical computing and array operations
- matplotlib (Hunter 2007) — foundational plotting library
- seaborn (Waskom 2021) — statistical data visualization built on matplotlib
These tools form a standard ecosystem for reproducible data science workflows.
Closing Note
The goal of this Foundations Track is not just to teach syntax.
It is to teach structure, discipline, and reproducibility.
The same workflow you used here can be applied to:
- new datasets
- new domains
- larger analytical projects
- future CDI guides