Train XGBoost on 20GB data in 20 seconds

If you’re looking for ways to train XGBoost models faster or on datasets larger than your machine’s memory, here’s a quick hack you might appreciate. It’ll let you train XGBoost in parallel in the cloud for free.

Perfect for when you’re competing in a Kaggle competition, for example, and need to iterate at the speed of light.

Train your Kaggle XGBoost Model in the Cloud

Yes, I work for the company that makes this hack possible so I may be a little biased… But this is a seriously sweet perk of Coiled’s free tier. And to be honest, I’m not sure my colleagues in Sales will thank me for showing you how to milk the free tier for all it’s worth 😉

Here are the facts:

You can train an XGBoost model on 20GB of data in less than 20 seconds.

You can run that training 18 times a month totally for free.

All you need is your Github credentials to log in. 

Create a Coiled account with your Github credentials. Then follow this notebook to launch your distributed XGBoost model training in the cloud. I’ll walk you through the notebook below.

Parallel XGBoost with Dask

First, spin up the Coiled cluster. This is basically a group of computers (25 in this case) that you’re renting from a cloud provider through Coiled. You’ll be using them to train your distributed XGBoost in parallel. Each computer (“worker”) will train XGBoost on a portion of the data and the Dask scheduler will package it all back up into one result.

import coiled

cluster = coiled.Cluster(
    name="xgboost",
    software="coiled-examples/xgboost",
    n_workers=25,
    worker_cpu=4,
    worker_memory='16Gib',
    backend_options={'spot':'True'},
)

Then, load the dataset into a Dask DataFrame. In this case, the dataset is stored in the efficient Parquet file format in a public Amazon S3 bucket.

import dask.dataframe as dd

# download data from S3
data = dd.read_parquet(
    "s3://coiled-datasets/dea-opioid/arcos_washpost_comp.parquet", 
    compression="lz4",
    storage_options={"anon": True, 'use_ssl': True},
    columns=columns+categorical,
)

The code below conducts some basic preprocessing on our data: filling in missing values and turning the categorical columns into a numerical format that XGBoost can work with:

# fill NaN values
data.BUYER_CITY = data.BUYER_CITY.fillna(value="Unknown")
data.DOSAGE_UNIT = data.DOSAGE_UNIT.fillna(value=0)

# instantiate categorizer
ce = Categorizer(columns=categorical)

# fit categorizer and transform data
data = ce.fit_transform(data)

# replace values in categorical columns with their numerical codes
for col in categorical:
    data[col] = data[col].cat.codes

# rearrange columns
cols = data.columns.to_list()
cols_new = [cols[0]] + cols[2:] + [cols[1]]
data = data[cols_new]

# Create the train-test split
X, y = data.iloc[:, :-1], data["CALC_BASE_WT_IN_GM"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, shuffle=True, random_state=13
)

# persist the train/test splits to cluster memory to speed up training
import dask
dask.persist(X_train, X_test, y_train, y_test)

Next, create your XGBoost DMatrices. Note that because we’ll be training in parallel with Dask we are using the xgb.dask.DaskDMatrix() API instead of the normal xgb.DMatrix().

dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
dtest = xgb.dask.DaskDMatrix(client, X_test, y_test)

Now you’re all set to train XGBoost on 20GB in parallel:

%%time 
# train the model 
output = xgb.dask.train(
    client, params, dtrain, num_boost_round=4,
    evals=[(dtrain, 'train')]
)
CPU times: user 938 ms, sys: 221 ms, total: 1.16 s
Wall time: 15.2 s

Even faster than you thought it would be…a mere 15.2 seconds!

And going over to the Coiled Dashboard we can see that it cost us a total of $0.53, which means you could run this up to 18 times a month and stay within the limits of the Coiled Free Tier.

Kaggle XGBoost Datasets

Below are links to some Big Data Kaggle Datasets that could benefit from running on a Coiled cluster for faster iteration:

Even Larger Datasets

For an example of running XGBoost on even larger datasets (100GB+), check out the tutorial I wrote for the Coiled blog.

One thought on “Train XGBoost on 20GB data in 20 seconds”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s