PyData Global Talk: Computations as Assets - a New Approach to Reproducibility and Transparency

PyData Global 2021

Link to recording of talk on youtube.
Link to 60fps Backblaze-data video on youtube.
Link to 60fps New York taxi data video on youtube.

The ExAx open source project from eBay provides reproducibility, transparency, and fast parallel processing in Python. This talk will show how reproducibility by design actually leads to a simpler and faster development process. To make our examples more interesting, we will use relatively large datasets such as those from NYC Taxi and Backblaze.

The ExAx project is designed to avoid problems like these. It treats computations as assets, tagging computed results with links to input data and source code, and stores them permanently on disk in a way that can be easily looked up and retrieved later. In addition, ExAx provides a simple way to parallel process large datasets in Python on a single computer.

ExAx is open source from eBay. It runs on anything from laptops to rack servers, it can be used in a production environment with multiple users, and it can easily handle datasets with tens of billions of rows.

Link to PyData Global
Link to the corresponding source code at github

Additional Resources

The Accelerator’s Homepage (exax.org)
The Accelerator on Github/exaxorg
The Accelerator on PyPI
Reference Manual