Dask Hack: Distribute Auxiliary Files Across a Cluster

When working with Dask clusters, you may want to distribute auxiliary files across your cluster to make them available to all your workers. In this Towards Data Science post, I discuss a sweet Dask Hack (the once_per_worker utility) that allows you to do exactly this.

once_per_worker is a utility to create dask.delayed objects around functions that you only want to ever run once per distributed worker. This is useful when you have some large data baked into your docker image and need to use that data as auxiliary input to another dask operation (df.map_partitions, for example). Rather than transfer the serialised data between workers in the cluster — which will be slow because of the size of the data — once_per_worker allows you to call the parsing function once per worker, then use the same parsed object downstream.

Shoutout to Gabe Joseph for creating this sweet Dask Hack!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s