Lesson 1 Setting Up Your Data Analysis Environment

This project is purely Python-based, unless explicitly stated otherwise.

Before we begin exploring data, you need a simple and reliable workspace so you can run code smoothly and follow every lesson without issues.

1.2 Installing Required Libraries

Run the following command in a Jupyter Notebook cell.

!pip install pandas numpy matplotlib seaborn

Note:
If you are using Anaconda, these libraries are usually preinstalled.
The command above ensures your environment is up to date.

1.3 Practice Dataset

We will use the Iris dataset in the early lessons of this course.

You do not need to download it manually at this stage.

  • Lesson 01 will automatically generate data/iris.csv
  • This lesson focuses only on ensuring your environment is ready

1.4 Test Your Setup

You may run the test below after completing Lesson 01, once the dataset exists.

import pandas as pd

df = pd.read_csv('data/iris.csv')
print(df.head())
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

1.5 Troubleshooting

If something does not work as expected:

  • Restart your notebook and try again
  • Confirm the file path is data/iris.csv
  • Ensure the data/ folder exists
  • Re-run the installation command if needed

Setup issues are common — take your time.

1.6 Exercise

  • Confirm that Jupyter Notebook opens successfully
  • Run the library installation command without errors
  • Verify that you can run a simple Python cell

1.7 Summary

  • You set up a Python-based data analysis environment
  • You installed essential data science libraries
  • You verified that your notebook can execute Python code
  • You prepared your system for working with real datasets

References

Harris, C. R. et al. (2020). Array programming with NumPy. Nature. https://doi.org/10.1038/s41586-020-2649-2
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering. https://doi.org/10.1109/MCSE.2007.55
McKinney, W. et al. (2010). Pandas-dev/pandas: pandas. Zenodo. https://doi.org/10.5281/zenodo.3509134
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in python. In Journal of Machine Learning Research (Vol. 12, pp. 2825–2830). https://scikit-learn.org/
Python Software Foundation. (2024). Python language reference. https://www.python.org/
Team, J. D. (2023). Project jupyter. https://jupyter.org/
Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software. https://doi.org/10.21105/joss.03021