Regression Decision Forest

Decision forest regression is a special case of the Decision Forest model.

Details

Given:

  • \(n\) feature vectors \(X = \{x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np}) \}\) of size \(p\);

  • their non-negative sample weights \(w = (w_1, \ldots, w_n)\);

  • the vector of responses \(y = (y_1, \ldots, y_n)\)

The problem is to build a decision forest regression model that minimizes the Mean-Square Error (MSE) between the predicted and true value.

Training Stage

Decision forest regression follows the algorithmic framework of decision forest training algorithm based on the mean-squared error (MSE) [Breiman84]. If sample weights are provided as input, the library uses a weighted version of the algorithm.

MSE is an impurity metric (\(D\) is a set of observations that reach the node), calculated as follows:

Decision Forest Regression: impurity calculations

Without sample weights

With sample weights

\(I_{\mathrm{MSE}}\left(D\right) = \frac{1}{W(D)} \sum _{i=1}^{W(D)}{\left(y_i - \frac{1}{W(D)} \sum _{j=1}^{W(D)} y_j \right)}^{2}\)

\(I_{\mathrm{MSE}}\left(D\right) = \frac{1}{W(D)} \sum _{i \in D}{w_i \left(y_i - \frac{1}{W(D)} \sum _{j \in D} w_j y_j \right)}^{2}\)

\(W(S) = \sum_{s \in S} 1\), which is equivalent to the number of elements in \(S\)

\(W(S) = \sum_{s \in S} w_s\)

Prediction Stage

Given decision forest regression model and vectors \(x_1, \ldots, x_r\), the problem is to calculate the responses for those vectors. To solve the problem for each given query vector \(x_i\), the algorithm finds the leaf node in a tree in the forest that gives the response by that tree as the mean of dependent variables. The forest predicts the response as the mean of responses from trees.

Out-of-bag Error

Decision forest regression follows the algorithmic framework for calculating the decision forest out-of-bag (OOB) error, where aggregation of the out-of-bag predictions in all trees and calculation of the OOB error of the decision forest is done as follows:

  • For each vector \(x_i\) in the dataset \(X\), predict its response \(\hat{y_i}\) as the mean of prediction from the trees that contain \(x_i\) in their OOB set:

    \(\hat{y_i} = \frac{1}{{|B}_{i}|}\sum _{b=1}^{|B_i|}\hat{y_{ib}}\), where \(B_i= \bigcup{T_b}: x_i \in \overline{D_b}\) and \(\hat{y_{ib}}\) is the result of prediction \(x_i\) by \(T_b\).

  • Calculate the OOB error of the decision forest T as the Mean-Square Error (MSE):

    \[OOB(T) = \frac{1}{|{D}^{\text{'}}|}\sum _{{y}_{i} \in {D}^{\text{'}}}\sum {(y_i-\hat{y_i})}^{2}, \text{where } {D}^{\text{'}}={\bigcup}_{b=1}^{B}\overline{{D}_{b}}\]
  • If OOB error value per each observation is required, then calculate the prediction error for \(x_i\):

    \[OOB(x_i) = {(y_i-\hat{y_i})}^{2}\]

Batch Processing

Decision forest regression follows the general workflow described in Decision Forest.

Training

For the description of the input and output, refer to Regression Usage Model.

In addition to the decision forest parameters described in Batch Processing, the training algorithm for decision forest regression has the following parameters:

Training Parameters for Decision Forest Regression (Batch Processing)

Parameter

Default Value

Description

algorithmFPType

float

The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

method

defaultDense

The computation method used by the decision forest regression.

For CPU:

  • defaultDense - default performance-oriented method

  • hist - inexact histogram computation method

For GPU:

Output

In addition to the output of regression described in Regression Usage Model, decision forest regression calculates the result of decision forest. For more details, refer to Batch Processing.

Prediction

For the description of the input and output, refer to Regression Usage Model.

In addition to the parameters of regression, decision forest regression has the following parameters at the prediction stage:

Prediction Parameters for Decision Forest Regression (Batch Processing)

Parameter

Default Value

Description

algorithmFPType

float

The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

method

defaultDense

The computation method used by the decision forest regression. The only prediction method supported so far is the default dense method.