For more information about trees, see previous posts the Classification And Regression Tree and ensembles of decision trees.
In this post, I will show the reason of why we use residual as the training target for each subtree in GBDT no matter it is regression problem or classification problem.
Suppose the GBDT model is composed of trees
- For regression problem, we have .
- For binary classification problem, we have .
- For multi-class classification problem, we have .
And the loss is
where is the regularization of each subtree.
Before we show how to optimize the loss, let’s first decompose the GBDT as follows
The objective function of the -th tree is
According to the second-order Taylor polynomial , we can rewrite the loss function as
where , and
To optimise the loss function, we follow the gradient descent method , thus get
For regression problem with MSE, we have
For binary classification problem with logloss, we have
For multi-class classification problem with logloss, we have
As shown above, in all three cases the learning target for each subtree is the residual of the label and the prediction of previous ensemble tree.