ETL with Dask

ETL with Dask

This website documents using Dask and related Python technologies to build ETL pipelines that serve recurring, production systems at large scale including tasks like …

  • Data pre-processing

  • Large scale queries.

  • Model training

This documentation covers common problems and questions faced in building these systems, including pragmatic questions like …

  • How do I wrangle 20 TiB datasets?

  • How do I wrangle them every day?

  • How do I deploy this in the cloud?

  • How do I push changes to my pipeline with a CI/CD pipeline?

  • How do I anticipate costs?

  • How do I access logs?

In doing so, this website points to common tooling in the Python ecosystem.

Reference Architecture

Additionally, this documentation leverages a self-contained reference architecture that implements these patterns and can be easily copied, run, and modified to different applications. We will refer to this worked example throughout. This reference architecture uses tools like …

  • Dask

  • Prefect

  • Coiled

  • Deltalake

  • Streamlit

  • FastAPI

To get started, take a look at the reference architecture or investigate the table of contents to the left.