When working with Dask clusters, you may want to distribute auxiliary files across your cluster to make them available to all your workers. In this Towards Data Science post, I discuss a sweet Dask Hack (the
once_per_worker utility) that allows you to do exactly this.
once_per_worker is a utility to create
dask.delayed objects around functions that you only want to ever run once per
distributed worker. This is useful when you have some large data baked into your docker image and need to use that data as auxiliary input to another dask operation (
df.map_partitions, for example). Rather than transfer the serialised data between workers in the cluster — which will be slow because of the size of the data —
once_per_worker allows you to call the parsing function once per worker, then use the same parsed object downstream.
Shoutout to Gabe Joseph for creating this sweet Dask Hack!