Gradient lifting tree classification training

Gradient boosting tree is an iterative decision tree algorithm that is widely used for classification and regression. The algorithm is composed of multiple decision trees.Different from the random forest algorithm, the random forest gets the final result after voting the results of multiple decision trees, and the gradient lifting tree learns the conclusions of all the previous trees, and builds a weak learner at each step of the iteration to make up for the shortcomings of the original model, so it is more accurate. In addition, the gradient boosting tree is sensitive to outliers. And the classification here can only be used for binary classification.

The method carries out a data Training Procedure of a gradient lifting tree classification method, and a model can be obtained according to data characteristics, and then the model is used for prediction.

When creating a gradient lift classification training task, you need to set the following parameters:

Training Dataset: required parameter. The Dataset to be trained accesses Connection Info, including Data Type, Connect Parameter, Dataset name, etc. You can connect HBase data, dsf data, and Local Data.
Data Query Conditions: optional parameter; the specified data can be filtered out for corresponding analysis according to the Query Conditions; attribute conditions and Spatial Query are supported. E.g. SmID <100 and BBOX(the_geom, 120,30,121,31)。
Explanatory Fields: a required parameter, the field of the explanatory variable. Enter one or more explanatory fields of the training Dataset as the independent variables of the model, which can help predict the category.
Modeling field: a required parameter, which is used to train the field of the model, that is, the dependent variable. This field corresponds to a known (trained) value of a variable that will be used to make predictions at unknown locations.
Depth of the tree: An optional parameter, or the maximum number of partitions made into the tree. The value range is 0-30 and the default value is 30. If you use a larger maximum depth, more divisions will be created, which may increase the likelihood of overfitting the model.
Maximum Iterations: The maximum Iterations must be greater than 0. The default value is 100.
Percent of data used during training: Optional parameter. It specifies the percentage of elements used for each gradient lift tree. The value range is 0-1.0. The default value is 1.0, which means 100% of data. Use a lower percentage of Input Data per tree: You can increase the speed of the tool for large Datasets.
Leaf Node Splitting Threshold: Optional parameter, the minimum number of observations required to retain a leaf (i.e., a terminal node on a tree that is not further split). The value range is 0, and the default value is 1. For very large data, increasing these values will reduce the runtime of the tool.
Model Save Directory: optional parameter; save the model with good Training Result to this address. An empty value indicates that the Model will not be saved.

After executing the training task, the following Result Parameter is output:

gbtModelCharacteristics: Properties of the gradient boosting tree classification model.
Variable: The Field array of the gradient lift tree classification model, which refers to the field of the independent variable in the training model.
feature Importances: The importance of a field, which refers to the degree of influence of the feature of the respective variable on the dependent variable.
f1: weighted f1-measure.
accuracy: Weighted accuracy.
weightedPrecision: The weighted precision.
weightedRecall: The weighted recall.
ClassificationDiagnostics: Classification result diagnostics. Including f1Score, precision, recall, true positive rate and false positive rate for each classification category.