.\" Text automatically generated by txt2man
.TH mlpack_preprocess_split 1 "11 January 2024" "mlpack-4.3.0" "User Commands"
.SH NAME
\fBmlpack_preprocess_split \fP- split data
.SH SYNOPSIS
.nf
.fam C
 \fBmlpack_preprocess_split\fP \fB-i\fP \fIunknown\fP [\fB-I\fP \fIunknown\fP] [\fB-S\fP \fIbool\fP] [\fB-s\fP \fIint\fP] [\fB-z\fP \fIbool\fP] [\fB-r\fP \fIdouble\fP] [\fB-V\fP \fIbool\fP] [\fB-T\fP \fIunknown\fP] [\fB-L\fP \fIunknown\fP] [\fB-t\fP \fIunknown\fP] [\fB-l\fP \fIunknown\fP] [\fB-h\fP \fB-v\fP] 
.fam T
.fi
.fam T
.fi
.SH DESCRIPTION


This utility takes a dataset and optionally labels and splits them into a
training set and a test set. Before the split, the points in the dataset are
randomly reordered. The percentage of the dataset to be used as the test set
can be specified with the '\fB--test_ratio\fP (\fB-r\fP)' parameter; the default is 0.2
(20%).
.PP
The output training and test matrices may be saved with the '\fB--training_file\fP
(\fB-t\fP)' and '\fB--test_file\fP (\fB-T\fP)' output parameters.
.PP
Optionally, labels can also be split along with the data by specifying the
\(cq\fB--input_labels_file\fP (\fB-I\fP)' parameter. Splitting labels works the same way as
splitting the data. The output training and test labels may be saved with the
\(cq\fB--training_labels_file\fP (\fB-l\fP)' and '\fB--test_labels_file\fP (\fB-L\fP)' output parameters,
respectively.
.PP
So, a simple example where we want to split the dataset 'X.csv' into
\(cqX_train.csv' and 'X_test.csv' with 60% of the data in the training set and
40% of the dataset in the test set, we could run 
.PP
$ \fBmlpack_preprocess_split\fP \fB--input_file\fP X.csv \fB--training_file\fP X_train.csv
\fB--test_file\fP X_test.csv \fB--test_ratio\fP 0.4
.PP
Also by default the dataset is shuffled and split; you can provide the
\(cq\fB--no_shuffle\fP (\fB-S\fP)' option to avoid shuffling the data; an example to avoid
shuffling of data is:
.PP
$ \fBmlpack_preprocess_split\fP \fB--input_file\fP X.csv \fB--training_file\fP X_train.csv
\fB--test_file\fP X_test.csv \fB--test_ratio\fP 0.4 \fB--no_shuffle\fP
.PP
If we had a dataset 'X.csv' and associated labels 'y.csv', and we wanted to
split these into 'X_train.csv', 'y_train.csv', 'X_test.csv', and 'y_test.csv',
with 30% of the data in the test set, we could run
.PP
$ \fBmlpack_preprocess_split\fP \fB--input_file\fP X.csv \fB--input_labels_file\fP y.csv
\fB--test_ratio\fP 0.3 \fB--training_file\fP X_train.csv \fB--training_labels_file\fP
y_train.csv \fB--test_file\fP X_test.csv \fB--test_labels_file\fP y_test.csv
.PP
To maintain the ratio of each class in the train and test sets,
the'\fB--stratify_data\fP (\fB-z\fP)' option can be used.
.PP
$ \fBmlpack_preprocess_split\fP \fB--input_file\fP X.csv \fB--training_file\fP X_train.csv
\fB--test_file\fP X_test.csv \fB--test_ratio\fP 0.4 \fB--stratify_data\fP
.RE
.PP

.SH REQUIRED INPUT OPTIONS 

.TP
.B
\fB--input_file\fP (\fB-i\fP) [\fIunknown\fP]
Matrix containing data.  
.SH OPTIONAL INPUT OPTIONS 

.TP
.B
\fB--help\fP (\fB-h\fP) [\fIbool\fP]
Default help info. 
.TP
.B
\fB--info\fP [string]
Print help on a specific option. Default value ''. 
.TP
.B
\fB--input_labels_file\fP (\fB-I\fP) [\fIunknown\fP]
Matrix containing labels. 
.TP
.B
\fB--no_shuffle\fP (\fB-S\fP) [\fIbool\fP]
Avoid shuffling the data before splitting. 
.TP
.B
\fB--seed\fP (\fB-s\fP) [\fIint\fP]
Random seed (0 for \fBstd::time\fP(NULL)). Default value 0. 
.TP
.B
\fB--stratify_data\fP (\fB-z\fP) [\fIbool\fP]
Stratify the data according to labels 
.TP
.B
\fB--test_ratio\fP (\fB-r\fP) [\fIdouble\fP]
Ratio of test set; if not set,the ratio defaults to 0.2 Default value 0.2. 
.TP
.B
\fB--verbose\fP (\fB-v\fP) [\fIbool\fP]
Display informational messages and the full list of parameters and timers at the end of execution. 
.TP
.B
\fB--version\fP (\fB-V\fP) [\fIbool\fP]
Display the version of mlpack.  
.SH OPTIONAL OUTPUT OPTIONS 

.TP
.B
\fB--test_file\fP (\fB-T\fP) [\fIunknown\fP]
Matrix to save test data to. 
.TP
.B
\fB--test_labels_file\fP (\fB-L\fP) [\fIunknown\fP]
Matrix to save test labels to. 
.TP
.B
\fB--training_file\fP (\fB-t\fP) [\fIunknown\fP]
Matrix to save training data to. 
.TP
.B
\fB--training_labels_file\fP (\fB-l\fP) [\fIunknown\fP]
Matrix to save train labels to.
.SH ADDITIONAL INFORMATION

For further information, including relevant papers, citations, and theory,
consult the documentation found at http://www.mlpack.org or included with your
distribution of mlpack.