.\" Text automatically generated by txt2man .TH mlpack_random_forest 1 "12 December 2020" "mlpack-3.4.2" "User Commands" .SH NAME \fBmlpack_random_forest \fP- random forests .SH SYNOPSIS .nf .fam C \fBmlpack_random_forest\fP [\fB-m\fP \fIunknown\fP] [\fB-l\fP \fIstring\fP] [\fB-D\fP \fIint\fP] [\fB-g\fP \fIdouble\fP] [\fB-n\fP \fIint\fP] [\fB-N\fP \fIint\fP] [\fB-a\fP \fIbool\fP] [\fB-s\fP \fIint\fP] [\fB-d\fP \fIint\fP] [\fB-T\fP \fIstring\fP] [\fB-L\fP \fIstring\fP] [\fB-t\fP \fIstring\fP] [\fB-V\fP \fIbool\fP] [\fB-M\fP \fIunknown\fP] [\fB-p\fP \fIstring\fP] [\fB-P\fP \fIstring\fP] [\fB-h\fP \fB-v\fP] .fam T .fi .fam T .fi .SH DESCRIPTION This program is an implementation of the standard random forest classification algorithm by Leo Breiman. A random forest can be trained and saved for later use, or a random forest may be loaded and predictions or class probabilities for points may be generated. .PP The training set and associated labels are specified with the '\fB--training_file\fP (\fB-t\fP)' and '\fB--labels_file\fP (\fB-l\fP)' parameters, respectively. The labels should be in the range [0, num_classes - 1]. Optionally, if '\fB--labels_file\fP (\fB-l\fP)' is not specified, the labels are assumed to be the last dimension of the training dataset. .PP When a model is trained, the '\fB--output_model_file\fP (\fB-M\fP)' output parameter may be used to save the trained model. A model may be loaded for predictions with the '\fB--input_model_file\fP (\fB-m\fP)'parameter. The '\fB--input_model_file\fP (\fB-m\fP)' parameter may not be specified when the '\fB--training_file\fP (\fB-t\fP)' parameter is specified. The '\fB--minimum_leaf_size\fP (\fB-n\fP)' parameter specifies the minimum number of training points that must fall into each leaf for it to be split. The '\fB--num_trees\fP (\fB-N\fP)' controls the number of trees in the random forest. The \(cq\fB--minimum_gain_split\fP (\fB-g\fP)' parameter controls the minimum required gain for a decision tree node to split. Larger values will force higher-confidence splits. The '\fB--maximum_depth\fP (\fB-D\fP)' parameter specifies the maximum depth of the tree. The '\fB--subspace_dim\fP (\fB-d\fP)' parameter is used to control the number of random dimensions chosen for an individual node's split. If \(cq\fB--print_training_accuracy\fP (\fB-a\fP)' is specified, the calculated accuracy on the training set will be printed. .PP Test data may be specified with the '\fB--test_file\fP (\fB-T\fP)' parameter, and if performance measures are desired for that test set, labels for the test points may be specified with the '\fB--test_labels_file\fP (\fB-L\fP)' parameter. Predictions for each test point may be saved via the '\fB--predictions_file\fP (\fB-p\fP)'output parameter. Class probabilities for each prediction may be saved with the \(cq\fB--probabilities_file\fP (\fB-P\fP)' output parameter. .PP For example, to train a random forest with a minimum leaf size of 20 using 10 trees on the dataset contained in 'data.csv'with labels 'labels.csv', saving the output random forest to 'rf_model.bin' and printing the training error, one could call .PP $ \fBmlpack_random_forest\fP \fB--training_file\fP data.csv \fB--labels_file\fP labels.csv \fB--minimum_leaf_size\fP 20 \fB--num_trees\fP 10 \fB--output_model_file\fP rf_model.bin \fB--print_training_accuracy\fP .PP Then, to use that model to classify points in 'test_set.csv' and print the test error given the labels 'test_labels.csv' using that model, while saving the predictions for each point to 'predictions.csv', one could call .PP $ \fBmlpack_random_forest\fP \fB--input_model_file\fP rf_model.bin \fB--test_file\fP test_set.csv \fB--test_labels_file\fP test_labels.csv \fB--predictions_file\fP predictions.csv .RE .PP .SH OPTIONAL INPUT OPTIONS .TP .B \fB--help\fP (\fB-h\fP) [\fIbool\fP] Default help info. .TP .B \fB--info\fP [\fIstring\fP] Print help on a specific option. Default value ''. .TP .B \fB--input_model_file\fP (\fB-m\fP) [\fIunknown\fP] Pre-trained random forest to use for classification. .TP .B \fB--labels_file\fP (\fB-l\fP) [\fIstring\fP] Labels for training dataset. .TP .B \fB--maximum_depth\fP (\fB-D\fP) [\fIint\fP] Maximum depth of the tree (0 means no limit). Default value 0. .TP .B \fB--minimum_gain_split\fP (\fB-g\fP) [\fIdouble\fP] Minimum gain needed to make a split when building a tree. Default value 0. .TP .B \fB--minimum_leaf_size\fP (\fB-n\fP) [\fIint\fP] Minimum number of points in each leaf node. Default value 1. .TP .B \fB--num_trees\fP (\fB-N\fP) [\fIint\fP] Number of trees in the random forest. Default value 10. .TP .B \fB--print_training_accuracy\fP (\fB-a\fP) [\fIbool\fP] If set, then the accuracy of the model on the training set will be predicted (verbose must also be specified). .TP .B \fB--seed\fP (\fB-s\fP) [\fIint\fP] Random seed. If 0, 'std::time(NULL)' is used. Default value 0. .TP .B \fB--subspace_dim\fP (\fB-d\fP) [\fIint\fP] Dimensionality of random subspace to use for each split. '0' will autoselect the square root of data dimensionality. Default value 0. .TP .B \fB--test_file\fP (\fB-T\fP) [\fIstring\fP] Test dataset to produce predictions for. .TP .B \fB--test_labels_file\fP (\fB-L\fP) [\fIstring\fP] Test dataset labels, if accuracy calculation is desired. .TP .B \fB--training_file\fP (\fB-t\fP) [\fIstring\fP] Training dataset. .TP .B \fB--verbose\fP (\fB-v\fP) [\fIbool\fP] Display informational messages and the full list of parameters and timers at the end of execution. .TP .B \fB--version\fP (\fB-V\fP) [\fIbool\fP] Display the version of mlpack. .SH OPTIONAL OUTPUT OPTIONS .TP .B \fB--output_model_file\fP (\fB-M\fP) [\fIunknown\fP] Model to save trained random forest to. .TP .B \fB--predictions_file\fP (\fB-p\fP) [\fIstring\fP] Predicted classes for each point in the test set. .TP .B \fB--probabilities_file\fP (\fB-P\fP) [\fIstring\fP] Predicted class probabilities for each point in the test set. .SH ADDITIONAL INFORMATION For further information, including relevant papers, citations, and theory, consult the documentation found at http://www.mlpack.org or included with your distribution of mlpack.