Using Grid Search to Optimise CatBoost Parameters

Catboost is a gradient boosting library that was released by Yandex. In the benchmarks Yandex provides, CatBoost outperforms XGBoost and LightGBM. Seeing as XGBoost is used by many Kaggle competition winners, it is worth having a look at CatBoost!

Contents

A quick example
An Intro to Gradient Boosting
Parameters to tune for Classification
Parameter Search
Preventing Overfitting
CatBoost Ensembles

A quick example

To start we can install it using: pip install catboost. I had no troubles with this on Windows 10/python 3.5, everything just worked. The interface to CatBoost is basically the same as most sklearn classifiers, so if you've used sklearn you'll have no trouble with CatBoost. CatBoost can handle missing features and also categorical features, you just have to tell the classifier which dimensions are the categorical ones.

Let's do a quick experiment on UCI Repository Adult Dataset. This dataset has around 32,000 training samples, 16,000 test samples. There are 14 features, a mix of categorical and continuous, with some missing features present. We'll use pandas to quickly parse the csv file (to look at a copy of the code used in this tutorial see cb_adult.py):

import pandas
import numpy as np
import catboost as cb

# read in the train and test data from csv files
colnames = ['age','wc','fnlwgt','ed','ednum','ms','occ','rel','race','sex','cgain','closs','hpw','nc','label']
train_set = pandas.read_csv("adult.data.txt",header=None,names=colnames,na_values='?')
test_set = pandas.read_csv("adult.test.txt",header=None,names=colnames,na_values='?',skiprows=[0])

# convert categorical columns to integers
category_cols = ['wc','ed','ms','occ','rel','race','sex','nc','label']
for header in category_cols:
    train_set[header] = train_set[header].astype('category').cat.codes
    test_set[header] = test_set[header].astype('category').cat.codes

# split labels out of data sets    
train_label = train_set['label']
train_set = train_set.drop('label', axis=1) # remove labels
test_label = test_set['label']
test_set = test_set.drop('label', axis=1) # remove labels

# train default classifier    
clf = cb.CatBoostClassifier()
cat_dims = [train_set.columns.get_loc(i) for i in category_cols[:-1]] 
clf.fit(train_set, np.ravel(train_label), cat_features=cat_dims)
res = clf.predict(test_set)
print('error:',1-np.mean(res==np.ravel(test_label)))

To run, use python cb_adult.py, I get an error of around 12.91% averaged over 20 runs. This is better than all the sample classification results listed for the dataset, the best of which is naive bayes at 14% error. Not bad considering we haven't optimised anything yet!

In the dataset, the categorical features are all represented as strings. We have to convert them to integers for CatBoost to use them, and pandas makes this pretty easy. CatBoost doesn't seem to be able to handle missing features in categorical columns, pandas fixes this by just making 'missing' its own category. Missing values in continuous columns are handled fine by CatBoost.

An Intro to Gradient Boosting

CatBoost is a gradient boosting library, I'll give a short description of how gradient boosting works in this section (see here for some more info).

'Gradient boosting' comes from the idea of 'boosting' or improving weak models by combining them with many other weak models to create a strong model. Gradient boosting is an extension of boosting where the process of additively generating weak models is formalised as a gradient descent algorithm over an objective function. Gradient boosting is a supervised, which means that it takes a set of labelled training instances as input and builds a model that tries to correctly predict the label of new unseen examples based on features provided.

Many different types of models can be used for gradient boosting, but in practice decision trees are almost always used. To begin training, a single decision tree is build to predict the label. This first decision tree will predict some instances but will fail for other instances. Subtracting the predicted label (\(\hat{y_i}\)) from the true label (\(y_i\)) shows whether the prediction is an underestimate or an overestimate. This is called the residual and is denoted as \(r_i\):

\(r_i = y_i - \hat{y_i}\)

To improve the model, we can build another decision tree, but this time try to predict the residuals instead of the original labels. This can be thought of as building another model to correct for the error in the current model. After adding the new tree to the model, make new predictions and then calculate residuals again. In order to make predictions with multiple trees, simply pass the given instance through every tree and sum up the predictions from each tree. By building predictors that estimate the residual, we are actually minimising the gradient of the squared error between the real and predicted labels:

\(SSE(y, \hat{y}) = \frac{1}{2} \sum_{i}(y_i - \hat{y_i})^2\)

\(\frac{dSSE(y_i,\hat{y_i})}{d\hat{y}} = -(y_i - \hat{y_i})\)

If we don't want to use squared error, we can use some other differentiable function such as cross entropy, then predict those residuals instead. This covers the basics of gradient boosting, but there are extra terms for e.g. regularisation. You can see how XGBoost does it here

Parameters to tune for Classification

In the previous section I outlined the basics of how gradient boosting works, in this section I'll cover some of the knobs that CatBoost provides to tune our predictions. There are some more parameters that you can find in the docs, but we don't need to tune them if all we're interested in is performance.

argument description
iterations=500 The maximum number of trees that can be built when solving machine learning problems. Fewer may be used.
learning_rate=0.03 used for reducing the gradient step. It affects the overall time of training: the smaller the value, the more iterations are required for training.
depth=6 Depth of the tree. Can be any integer up to 32. Good values in the range 1 - 10.
l2_leaf_reg=3 try different values for the regularizer to find the best possible. Any positive values are allowed.
loss_function='Logloss' For 2-class classification use 'LogLoss' or 'CrossEntropy'. For multiclass use 'MultiClass'.
border_count=32 The number of splits for numerical features. Allowed values are integers from 1 to 255 inclusively.
ctr_border_count=50 The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.

There are a couple more esoteric options, but the defaults work pretty well.

Parameter Search

This step is usually known as grid search, basically looking for parameter values that score the best. An important thing to remember is that we can't use the test set to tune parameters, otherwise we'll overfit to the test set. We need to do cross-validation on the train set (or ideally use a separate validation set), without looking at the test set until the very final accuracy calculation.

We won't be doing a full grid search here, there are simply too many possibilities to try all parameter combinations. We'll do grid search on small groups of parameters instead. This is more of a local search than a full grid search.

We start with the default parameter settings, as these are pretty good, and all the other testing we do will just modify the defaults. First we'll optimise ctr_border_count, border_count and l2_leaf_reg independently of everything else. We remember the best settings, then because iterations and learning_rate are tightly coupled, we'll grid search them together. Finally, we'll find the best depth. Because we didn't do a full grid search we may not have the optimal settings, but we should be pretty close.

Since these steps are pretty much always the same for every problem you'll come across, I wrote some code that automates this process. This is the code from above modified to do parameter tuning using paramsearch.py, the rest of the code is in cb_adult.py. This first bit is basically the same as the code above, it just reads the datasets, converts the categorical features to integers and extracts the labels.

import pandas
import numpy as np
import catboost as cb
from sklearn.model_selection import KFold
from paramsearch import paramsearch
from itertools import product,chain

# read in the train and test data from csv files
colnames = ['age','wc','fnlwgt','ed','ednum','ms','occ','rel','race','sex','cgain','closs','hpw','nc','label']
train_set = pandas.read_csv("adult.data.txt",header=None,names=colnames,na_values='?')
test_set = pandas.read_csv("adult.test.txt",header=None,names=colnames,na_values='?',skiprows=[0])

# convert categorical columns to integers
category_cols = ['wc','ed','ms','occ','rel','race','sex','nc','label']
cat_dims = [train_set.columns.get_loc(i) for i in category_cols[:-1]] 
for header in category_cols:
    train_set[header] = train_set[header].astype('category').cat.codes
    test_set[header] = test_set[header].astype('category').cat.codes

# split labels out of data sets    
train_label = train_set['label']
train_set = train_set.drop('label', axis=1)
test_label = test_set['label']
test_set = test_set.drop('label', axis=1)

So far the code is the same as the previous block of code, but from here it differs. We first specify the parameters we want to do grid search over, along with the values we want to search.

params = {'depth':[3,1,2,6,4,5,7,8,9,10],
          'iterations':[250,100,500,1000],
          'learning_rate':[0.03,0.001,0.01,0.1,0.2,0.3], 
          'l2_leaf_reg':[3,1,5,10,100],
          'border_count':[32,5,10,20,50,100,200],
          'ctr_border_count':[50,5,10,20,100,200],
          'thread_count':4}

Now we define a function for doing cross-validation - This takes a specific set of parameters, does n-fold cross validation on the train set and returns the mean accuracy from each fold.

# this function does 3-fold crossvalidation with catboostclassifier          
def crossvaltest(params,train_set,train_label,cat_dims,n_splits=3):
    kf = KFold(n_splits=n_splits,shuffle=True) 
    res = []
    for train_index, test_index in kf.split(train_set):
        train = train_set.iloc[train_index,:]
        test = train_set.iloc[test_index,:]

        labels = train_label.ix[train_index]
        test_labels = train_label.ix[test_index]

        clf = cb.CatBoostClassifier(**params)
        clf.fit(train, np.ravel(labels), cat_features=cat_dims)

        res.append(np.mean(clf.predict(test)==np.ravel(test_labels)))
    return np.mean(res)

This is the bit that actually calls the grid search function. We use chain here from itertools to combine several iterators into a single iterator. We first search border_count while leaving everything else default. Then we search ctr_border_count using the best of the parameters we found previously. When we get to iterations and learning_rate we grid search these together (test all possible combinations of the two). After that we find the best depth. Once we have tested all these parameter combinations we just have to call catBoostClassifier with the best one.

# this function runs grid search on several parameters
def catboost_param_tune(params,train_set,train_label,cat_dims=None,n_splits=3):
    ps = paramsearch(params)
    # search 'border_count', 'l2_leaf_reg' etc. individually 
    #   but 'iterations','learning_rate' together
    for prms in chain(ps.grid_search(['border_count']),
                      ps.grid_search(['ctr_border_count']),
                      ps.grid_search(['l2_leaf_reg']),
                      ps.grid_search(['iterations','learning_rate']),
                      ps.grid_search(['depth'])):
        res = crossvaltest(prms,train_set,train_label,cat_dims,n_splits)
        # save the crossvalidation result so that future iterations can reuse the best parameters
        ps.register_result(res,prms)
        print(res,prms,s'best:',ps.bestscore(),ps.bestparam())
    return ps.bestparam()

bestparams = catboost_param_tune(params,train_set,train_label,cat_dims)

Now we just have to call CatBoost with the best parameters we have found:

# train classifier with tuned parameters    
clf = cb.CatBoostClassifier(**bestparams)
clf.fit(train_set, np.ravel(train_label), cat_features=cat_dims)
res = clf.predict(test_set)
print('error:',1-np.mean(res==np.ravel(test_label)))

Our final score after tuning the parameters is actually the same as before tuning! Nothing I did seemed to be able to improve on the defaults. This speaks to the quality of the CatBoost classifier, and that the defaults are well chosen (at least for this problem). It might pay to increase n_splits in the crossvalidation to reduce the noise you get just by running the classifier multiple times, but then the grid search will take much longer. If you want to test more parameters or different combinations it is easy enough to modify the code, the files you'll need are cb_adult.py and paramsearch.py.

Preventing Overfitting

CatBoost provides a nice facility to prevent overfitting. If you set iterations to be high, the classifier will use many trees to build the final classifier and you risk overfitting. If you set use_best_model=True and eval_metric='Accuracy' when initialising and then set eval_set to be a validation set then CatBoost won't use all the iterations, it will return the iteration that gives the best accuracy on the eval set. This is similar to early stopping used in neural networks. If you are having problems with overfitting, it would be a good idea to try this. I didn't see any improvement on this dataset though, probably because there are so many train points it is more difficult to overfit on.

CatBoost Ensembles

An Ensemble is a classifier built by combining many instances of some base classifier (or possibly different types of classifier). For CatBoost this would mean running CatBoostClassify e.g. 10 times and taking as the final class label the most common prediction from the 10 classifiers. Generally you want to have some variation between each of the classifiers that make up the ensemble - the best results will come when the errors that each classifier makes are uncorrelated. This happens when each of the classifiers are as different as possible.

We will get a bit of diversity by using catBoost with different parameters. During the grid search procedure, we saved all the parameters we tested along with the scores, so getting the 10 best parameter combinations is easy. Once we have these top 10, we just build a classifier with each of them and take the mode of the results. For this particular problem, I found combining 10 poor parameter settings resulted in good improvements on the poor results, but ensembling didn't seem to help much on the tuned settings. Since most kaggle competitions are won by ensembling, there is obviously a benefit to doing it general.

Comments !