For more information about trees, see previous posts the Classification And Regression Tree and ensembles of decision trees.

In this post, I will show the reason of why we use residual as the training target for each subtree in GBDT no matter it is regression problem or classification problem.

Suppose the GBDT model is composed of trees

  • For regression problem, we have .
  • For binary classification problem, we have .
  • For multi-class classification problem, we have .

And the loss is

where is the regularization of each subtree.

Before we show how to optimize the loss, let’s first decompose the GBDT as follows

The objective function of the -th tree is

According to the second-order Taylor polynomial , we can rewrite the loss function as

where , and

To optimise the loss function, we follow the gradient descent method , thus get

For regression problem with MSE, we have

For binary classification problem with logloss, we have

For multi-class classification problem with logloss, we have

As shown above, in all three cases the learning target for each subtree is the residual of the label and the prediction of previous ensemble tree.