mlpack_kmeans(26 December 2016) | mlpack_kmeans(26 December 2016) |
NAME¶
mlpack_kmeans - k-means clusteringSYNOPSIS¶
mlpack_kmeans [-h] [-v]
DESCRIPTION¶
This program performs K-Means clustering on the given dataset, storing the learned cluster assignments either as a column of labels in the file containing the input dataset or in a separate file. Empty clusters are not allowed by default; when a cluster becomes empty, the point furthest from the centroid of the cluster with maximum variance is taken to fill that cluster.Optionally, the Bradley and Fayyad approach ("Refining initial points for k-means clustering", 1998) can be used to select initial points by specifying the --refined_start (-r) option. This approach works by taking random samples of the dataset; to specify the number of samples, the --samples parameter is used, and to specify the percentage of the dataset to be used in each sample, the --percentage parameter is used (it should be a value between 0.0 and 1.0).
There are several options available for the algorithm used for each Lloyd iteration, specified with the --algorithm (-a) option. The standard O(kN) approach can be used ('naive'). Other options include the Pelleg-Moore tree-based algorithm ('pelleg-moore'), Elkan's triangle-inequality based algorithm ('elkan'), Hamerly's modification to Elkan's algorithm ('hamerly'), the dual-tree k-means algorithm ('dualtree'), and the dual-tree k-means algorithm using the cover tree ('dualtree-covertree').
The behavior for when an empty cluster is encountered can be modified with the --allow_empty_clusters (-e) option. When this option is specified and there is a cluster owning no points at the end of an iteration, that cluster's centroid will simply remain in its position from the previous iteration. If the --kill_empty_clusters (-E) option is specified, then when a cluster owns no points at the end of an iteration, the cluster centroid is simply filled with DBL_MAX, killing it and effectively reducing k for the rest of the computation. Note that the default option when neither empty cluster option is specified can be time-consuming to calculate; therefore, specifying -e or -E will often accelerate runtime.
As of October 2014, the --overclustering option has been removed. If you want this support back, let us know---file a bug at https://github.com/mlpack/mlpack/ or get in touch through another means.
REQUIRED INPUT OPTIONS¶
- --clusters (-c) [int]
- Number of clusters to find (0 autodetects from initial centroids).
- --input_file (-i) [string]
- Input dataset to perform clustering on.
OPTIONAL INPUT OPTIONS¶
- --algorithm (-a) [string]
- Algorithm to use for the Lloyd iteration ('naive', 'pelleg-moore', 'elkan', 'hamerly', ’dualtree', or 'dualtree-covertree'). Default value 'naive'.
- --allow_empty_clusters (-e)
- Allow empty clusters to be persist.
- --help (-h)
- Default help info.
- --in_place (-P)
- If specified, a column containing the learned cluster assignments will be added to the input dataset file. In this case, --outputFile is overridden.
- --info [string]
- Get help on a specific module or option. Default value ''. --initial_centroids (-I) [string] Start with the specified initial centroids. Default value ''.
- --kill_empty_clusters (-E)
- Remove empty clusters when they occur.
- --labels_only (-l)
- Only output labels into output file.
- --max_iterations (-m) [int]
- Maximum number of iterations before k-means terminates. Default value 1000.
- --percentage (-p) [double]
- Percentage of dataset to use for each refined start sampling (use when --refined_start is specified). Default value 0.02.
- --refined_start (-r)
- Use the refined initial point strategy by Bradley and Fayyad to choose initial points.
- --samplings (-S) [int]
- Number of samplings to perform for refined start (use when --refined_start is specified). Default value 100.
- --seed (-s) [int]
- Random seed. If 0, 'std::time(NULL)' is used. Default value 0.
- --verbose (-v)
- Display informational messages and the full list of parameters and timers at the end of execution.
- --version (-V)
- Display the version of mlpack.
OPTIONAL OUTPUT OPTIONS¶
--centroid_file (-C) [string] If specified, the centroids of each cluster will be written to the given file. Default value ’'.- --output_file (-o) [string]
- File to write output labels or labeled data to. Default value ''.