Support Vector Machine Classification Training

Support vector machine (SVM) is a common supervised learning classification algorithm in machine learning, which is a binary classification model. Its purpose is to find a hyperplane to segment the sample, and to separate the positive and negative examples in the sample by the hyperplane. The principle of segmentation is to maximize the margin. Support vector machine method can solve the practical problems in classification such as small sample, nonlinear, high dimension and local minimum point. It is widely used in image processing, data mining and other fields.

According to the data training procedure of support vector machine classification, the model can be obtained according to the characteristics of the data, and then used for prediction.

When creating a support vector machine classification training task, you need to set the following parameters:

Training Dataset: Required parameter. The dataset to be trained accesses connection info, including data type, connect parameter, dataset name, etc. You can connect HBase data, dsf data, and local data.
Data Query Conditions: Optional parameter; the specified data can be filtered out for corresponding analysis according to the query conditions; attribute conditions and spatial query are supported. E.g. SmID<100 and BBOX(the_geom, 120,30,121,31).
Explanatory Fields: Required parameter, the field of the explanatory variable. Enter one or more explanatory fields of the training dataset as the independent variables of the model, which can help predict the results.
Modeling field: Required parameter, which is used to train the field of the model, that is, the dependent variable. This field corresponds to a known (trained) value of a variable that will be used to make predictions at unknown locations.
Maximum Iterations: Optional parameter, value range >0. Default is 100.
Regularization parameter: Optional parameter, value range ≥ 0, default value is 0.0. It is mainly used to prevent overfitting.
Distance explanatory variable Dataset: Optional parameter, supports point, line and region dataset, calculates the closest distance between the elements of the given dataset and the elements in the training dataset, and automatically creates a list of explanatory variables.
Model Save Directory: Optional parameter; save the model with good training result to this address. If it is empty, the model will not be saved.

After executing the training task, the following result parameter is output:

Variable: The field array of the model, which refers to the field of the independent variable in the training model.
Coefficient: Regression coefficient.
NumClasses: The number of classes.
AreaUnderROC: The area under the ROC curve, namely AUC, is a standard used to measure the quality of the classification model. The value is between 0.5 and 1.0. A larger AUC means that the model works better.
Area UnderPR: The area under the PR curve. Like the ROC curve, the PR curve is also one of the indicators that can measure the quality of the model.
Accuracy: Accuracy rate.