mlpack_kmeans(1) | User Commands | mlpack_kmeans(1) |
NAME¶
mlpack_kmeans - k-means clusteringSYNOPSIS¶
mlpack_kmeans -c int -i string [-a string] [-e bool] [-P bool] [-I string] [-E bool] [-l bool] [-m int] [-p double] [-r bool] [-S int] [-s int] [-V bool] [-C string] [-o string] [-h -v]
DESCRIPTION¶
This program performs K-Means clustering on the given dataset. It can return the learned cluster assignments, and the centroids of the clusters. Empty clusters are not allowed by default; when a cluster becomes empty, the point furthest from the centroid of the cluster with maximum variance is taken to fill that cluster.Optionally, the Bradley and Fayyad approach ("Refining initial points for k-means clustering", 1998) can be used to select initial points by specifying the '--refined_start (-r)' parameter. This approach works by taking random samplings of the dataset; to specify the number of samplings, the '--samplings (-S)' parameter is used, and to specify the percentage of the dataset to be used in each sample, the '--percentage (-p)' parameter is used (it should be a value between 0.0 and 1.0).
There are several options available for the algorithm used for each Lloyd iteration, specified with the '--algorithm (-a)' option. The standard O(kN) approach can be used ('naive'). Other options include the Pelleg-Moore tree-based algorithm ('pelleg-moore'), Elkan's triangle-inequality based algorithm ('elkan'), Hamerly's modification to Elkan's algorithm ('hamerly'), the dual-tree k-means algorithm ('dualtree'), and the dual-tree k-means algorithm using the cover tree ('dualtree-covertree').
The behavior for when an empty cluster is encountered can be modified with the ’--allow_empty_clusters (-e)' option. When this option is specified and there is a cluster owning no points at the end of an iteration, that cluster's centroid will simply remain in its position from the previous iteration. If the '--kill_empty_clusters (-E)' option is specified, then when a cluster owns no points at the end of an iteration, the cluster centroid is simply filled with DBL_MAX, killing it and effectively reducing k for the rest of the computation. Note that the default option when neither empty cluster option is specified can be time-consuming to calculate; therefore, specifying either of these parameters will often accelerate runtime.
Initial clustering assignments may be specified using the ’--initial_centroids_file (-I)' parameter, and the maximum number of iterations may be specified with the '--max_iterations (-m)' parameter.
As an example, to use Hamerly's algorithm to perform k-means clustering with k=10 on the dataset 'data.csv', saving the centroids to 'centroids.csv' and the assignments for each point to 'assignments.csv', the following command could be used:
$ kmeans --input_file data.csv --clusters 10 --output_file assignments.csv --centroid_file centroids.csv
To run k-means on that same dataset with initial centroids specified in ’initial.csv' with a maximum of 500 iterations, storing the output centroids in 'final.csv' the following command may be used:
$ kmeans --input_file data.csv --initial_centroids_file initial.csv --clusters 10 --max_iterations 500 --centroid_file final.csv
REQUIRED INPUT OPTIONS¶
- --clusters (-c) [int]
- Number of clusters to find (0 autodetects from initial centroids).
- --input_file (-i) [string]
- Input dataset to perform clustering on.
OPTIONAL INPUT OPTIONS¶
- --algorithm (-a) [string]
- Algorithm to use for the Lloyd iteration ('naive', 'pelleg-moore', 'elkan', 'hamerly', 'dualtree', or 'dualtree-covertree'). Default value 'naive'.
- --allow_empty_clusters (-e) [bool]
- Allow empty clusters to be persist.
- --help (-h) [bool]
- Default help info.
- --in_place (-P) [bool]
- If specified, a column containing the learned cluster assignments will be added to the input dataset file. In this case, --output_file is overridden. (Do not use in Python.)
- --info [string]
- Get help on a specific module or option. Default value ''.
- --initial_centroids_file (-I) [string]
- Start with the specified initial centroids. Default value ''.
- --kill_empty_clusters (-E) [bool]
- Remove empty clusters when they occur.
- --labels_only (-l) [bool]
- Only output labels into output file.
- --max_iterations (-m) [int]
- Maximum number of iterations before k-means terminates. Default value 1000.
- --percentage (-p) [double]
- Percentage of dataset to use for each refined start sampling (use when --refined_start is specified). Default value 0.02.
- --refined_start (-r) [bool]
- Use the refined initial point strategy by Bradley and Fayyad to choose initial points.
- --samplings (-S) [int]
- Number of samplings to perform for refined start
- (use when --refined_start is specified).
- Default value 100.
- --seed (-s) [int]
- Random seed. If 0, 'std::time(NULL)' is used. Default value 0.
- --verbose (-v) [bool]
- Display informational messages and the full list of parameters and timers at the end of execution.
- --version (-V) [bool]
- Display the version of mlpack.
OPTIONAL OUTPUT OPTIONS¶
- --centroid_file (-C) [string]
- If specified, the centroids of each cluster will be written to the given file. Default value ''.
- --output_file (-o) [string]
- Matrix to store output labels or labeled data to. Default value ''.
ADDITIONAL INFORMATION¶
For further information, including relevant papers, citations, and theory, consult the documentation found at http://www.mlpack.org or included with your distribution of mlpack.18 November 2018 | mlpack-3.0.4 |