When working with Dask clusters, you may want to distribute auxiliary files across your cluster to make them available to all your workers. In this Towards Data Science post, I discuss a sweet Dask Hack (the once_per_worker
utility) that allows you to do exactly this.
once_per_worker
is a utility to create dask.delayed
objects around functions that you only want to ever run once per distributed
worker. This is useful when you have some large data baked into your docker image and need to use that data as auxiliary input to another dask operation (df.map_partitions
, for example). Rather than transfer the serialised data between workers in the cluster — which will be slow because of the size of the data — once_per_worker
allows you to call the parsing function once per worker, then use the same parsed object downstream.

Shoutout to Gabe Joseph for creating this sweet Dask Hack!