mlpack_logistic_regression(26 December 2016) | mlpack_logistic_regression(26 December 2016) |
NAME¶
mlpack_logistic_regression - l2-regularized logistic regression and predictionSYNOPSIS¶
mlpack_logistic_regression [-h] [-v]
DESCRIPTION¶
An implementation of L2-regularized logistic regression using either the L-BFGS optimizer or SGD (stochastic gradient descent). This solves the regression problemy = (1 / 1 + e^-(X * b))where y takes values 0 or 1.
This program allows loading a logistic regression model from a file (-i) or training a logistic regression model given training data (-t), or both those things at once. In addition, this program allows classification on a test dataset (-T) and will save the classification results to the given output file (-o). The logistic regression model itself may be saved with a file specified using the -m option.
The training data given with the -t option should have class labels as its last dimension (so, if the training data is in CSV format, labels should be the last column). Alternately, the -l (--labels_file) option may be used to specify a separate file of labels.
When a model is being trained, there are many options. L2 regularization (to prevent overfitting) can be specified with the -l option, and the optimizer used to train the model can be specified with the --optimizer option. Available options are 'sgd' (stochastic gradient descent), 'lbfgs' (the L-BFGS optimizer), and 'minibatch-sgd' (minibatch stochastic gradient descent). There are also various parameters for the optimizer; the --max_iterations parameter specifies the maximum number of allowed iterations, and the --tolerance (-e) parameter specifies the tolerance for convergence. For the SGD and mini-batch SGD optimizers, the --step_size parameter controls the step size taken at each iteration by the optimizer. The batch size for mini-batch SGD is controlled with the --batch_size (-b) parameter. If the objective function for your data is oscillating between Inf and 0, the step size is probably too large. There are more parameters for the optimizers, but the C++ interface must be used to access these.
For SGD, an iteration refers to a single point, and for mini-batch SGD, an iteration refers to a single batch. So to take a single pass over the dataset with SGD, --max_iterations should be set to the number of points in the dataset.
Optionally, the model can be used to predict the responses for another matrix of data points, if --test_file is specified. The --test_file option can be specified without --input_file, so long as an existing logistic regression model is given with --model_file. The output predictions from the logistic regression model are stored in the file given with --output_predictions.
This implementation of logistic regression does not support the general multi-class case but instead only the two-class case. Any responses must be either 0 or 1.
OPTIONAL INPUT OPTIONS¶
- --batch_size (-b) [int]
- Batch size for mini-batch SGD. Default value
- 50.
-
--decision_boundary (-d) [double] Decision boundary for prediction; if the logistic function for a point is less than the boundary, the class is taken to be 0; otherwise, the class is 1. Default value 0.5.
- --help (-h)
- Default help info.
- --info [string]
- Get help on a specific module or option. Default value ''. --input_model_file (-m) [string] File containing existing model (parameters). Default value ''.
- --labels_file (-l) [string]
- A file containing labels (0 or 1) for the points in the training set (y). Default value ''.
- --lambda (-L) [double]
- L2-regularization parameter for training. Default value 0.
- --max_iterations (-n) [int]
- Maximum iterations for optimizer (0 indicates no limit). Default value 10000.
- --optimizer (-O) [string]
- Optimizer to use for training ('lbfgs' or ’sgd'). Default value 'lbfgs'.
- --step_size (-s) [double]
- Step size for SGD and mini-batch SGD optimizers. Default value 0.01.
- --test_file (-T) [string]
- File containing test dataset. Default value ’'.
- --tolerance (-e) [double]
- Convergence tolerance for optimizer. Default value 1e-10. --training_file (-t) [string] A file containing the training set (the matrix of predictors, X). Default value ''.
- --verbose (-v)
- Display informational messages and the full list of parameters and timers at the end of execution.
- --version (-V)
- Display the version of mlpack.
OPTIONAL OUTPUT OPTIONS¶
- --output_file (-o) [string]
- If --test_file is specified, this file is where the predictions for the test set will be saved. Default value ''. --output_model_file (-M) [string] File to save trained logistic regression model to. Default value ''. --output_probabilities_file (-p) [string] If --test_file is specified, this file is where the class probabilities for the test set will be saved. Default value ''.