Forest-based regression training

In a forest-based approach, creating a model and performing predictions for a continuous variable is a regression approach. The data Training Procedure based on the forest regression method can be used to obtain the forest model according to the data characteristics, and then used for prediction.

When you create a forest-based regression training task, you need to set the following parameters:

Training Dataset: required parameter. The Dataset to be trained accesses Connection Info, including Data Type, Connect Parameter, Dataset name, etc. You can connect HBase data, dsf data, and Local Data. Use '--key=value' to set multiple values using ' 'space Separated. If the connection HBase data is --providerType=hbase --hbase.zookeepers=192.168.12.34:2181 --hbase.catalog=demo --dataset=dltb; The connection dsf data is --providerType=dsf --path=hdfs://ip:9000/dsfdata ; Local Data is --providerType=dsf --path=/home/dsfdata
Data Query Conditions: optional parameter; the specified data can be filtered out for corresponding analysis according to the Query Conditions; attribute conditions and Spatial Query are supported. E.g. SmID < 100 and BBOX(the_geom, 120,30,121,31)。
Explanatory Fields: a required parameter, the field of the explanatory variable. Enter one or more explanatory fields of the training Dataset as the independent variables of the model, which can help predict the category.
Modeling field: a required parameter, which is used to train the field of the model, that is, the dependent variable. This field corresponds to a known (trained) value of a variable that will be used to make predictions at unknown locations.
Number of Trees: Optional parameter, the number of trees to be created in the forest model. Increasing the number of trees will generally produce more accurate model predictions, but will increase the time for model computation. Range of values & gt; 0, default 100.
Depth of the tree: An optional parameter, or the maximum number of partitions made into the tree. The value range is 0-30 and the default value is 30. If you use a larger maximum depth, more divisions will be created, which may increase the likelihood of overfitting the model.
Leaf Node Splitting Threshold: Optional parameter, the minimum number of observations required to retain a leaf (i.e., a terminal node on a tree that is not further split). The value range is > 0 and the default value is 5. For very large data, increasing these values will reduce the runtime of the tool.
Percent of data used per decision tree: Optional parameter that specifies the percentage of elements to be used for each decision tree. The value range is 0-1.0. The default value is 1.0, which means 100% of the data. Use a lower percentage of Input Data for each decision tree: This speeds up the tool for large Datasets.
Distance explanatory variable Dataset: optional parameter, supports point, line and Region Dataset, calculates the Closest distance between the elements of the given Dataset and the elements in the training Dataset, and automatically creates a list of explanatory variables.
Model Save Directory: optional parameter; save the model with good Training Result to this address. An empty value indicates that the Model will not be saved.

After executing the training task, the following Result Parameter is output:

fbModelCharacteristics: The attributes of the forest model, including the number of trees numTrees, the minimum sample number leafSize of the leaf nodes of the tree, the maximum depth of the tree maxTreesDepth, and the minimum depth of the tree minTreesDepth.
Variable: The Field array of the forest model, which refers to the field of the independent variable in the training model.
Variable Importances: The field importance refers to the degree of influence of the respective variable characteristics on the dependent variable.
MSE: mean square error, the mean of the squared error between the predicted value and the true value.
RMSE: RMSE, the mean of the square root of the error between the predicted value and the true value.
Mae: mean absolute error, the mean of the absolute value of the error between the predicted value and the true value.
R2: coefficient of determination. According to the value of R2, the quality of the model can be judged. The value range is 0,1. Generally speaking, the larger the R2 is, the better the fitting effect of the model is. R2 reflects how accurate it is, because with the increase of the number of samples, R2 will inevitably increase, which can not really quantitatively explain the degree of accuracy, but can only be roughly quantitative.
explained Variance: explains the variance.