NAME¶
kernel_pca - kernel principal components analysis
SYNOPSIS¶
kernel_pca [-h] [-v] -i string -k string -o string [-b double] [-c] [-D double] [-S double] [-d int] [-n] [-O double] [-s string] -V
DESCRIPTION¶
This program performs Kernel Principal Components Analysis (KPCA) on the
specified dataset with the specified kernel. This will transform the data onto
the kernel principal components, and optionally reduce the dimensionality by
ignoring the kernel principal components with the smallest eigenvalues.
For the case where a linear kernel is used, this reduces to regular PCA.
For example, the following will perform KPCA on the 'input.csv' file using the
gaussian kernel and store the transformed date in the 'transformed.csv' file.
$
kernel_pca -i input.csv
-k gaussian
-o
transformed.csv
The kernels that are supported are listed below:
- •
- 'linear': the standard linear dot product (same as normal PCA):
K(x, y) = x^T y
- •
- 'gaussian': a Gaussian kernel; requires bandwidth: K(x, y) =
exp(-(|| x - y || ^ 2) / (2 * (bandwidth ^ 2)))
- •
- 'polynomial': polynomial kernel; requires offset and degree: K(x,
y) = (x^T y + offset) ^ degree
- •
- 'hyptan': hyperbolic tangent kernel; requires scale and offset:
K(x, y) = tanh(scale * (x^T y) + offset)
- •
- 'laplacian': Laplacian kernel; requires bandwidth: K(x, y) =
exp(-(|| x - y ||) / bandwidth)
- •
- 'epanechnikov': Epanechnikov kernel; requires bandwidth: K(x, y) =
max(0, 1 - || x - y ||^2 / bandwidth^2)
- •
- 'cosine': cosine distance: K(x, y) = 1 - (x^T y) / (|| x || * || y
||)
The parameters for each of the kernels should be specified with the options
--bandwidth,
--kernel_scale,
--offset, or
--degree
(or a combination of those options).
Optionally, the nyström method ("Using the Nystroem method to speed
up kernel machines", 2001) can be used to calculate the kernel matrix by
specifying the
--nystroem_method (
-n) option. This approach
works by using a subset of the data as basis to reconstruct the kernel matrix;
to specify the sampling scheme, the
--sampling parameter is used, the
sampling scheme for the nyström method can be chosen from the following
list: kmeans, random, ordered.
REQUIRED OPTIONS¶
- --input_file (-i) [string]
- Input dataset to perform KPCA on.
- --kernel (-k) [string]
- The kernel to use; see the above documentation for the list of usable
kernels.
- --output_file (-o) [string]
- File to save modified dataset to.
OPTIONS¶
- --bandwidth (-b) [double]
- Bandwidth, for 'gaussian' and 'laplacian' kernels. Default value 1.
- --center (-c)
- If set, the transformed data will be centered about the origin.
- --degree (-D) [double]
- Degree of polynomial, for 'polynomial' kernel. Default value 1.
- --help (-h)
- Default help info.
- --info [string]
- Get help on a specific module or option. Default value ''.
--kernel_scale ( -S) [double] Scale, for 'hyptan'
kernel. Default value 1.
- --new_dimensionality (-d) [int]
- If not 0, reduce the dimensionality of the output dataset by ignoring the
dimensions with the smallest eigenvalues. Default value 0.
- --nystroem_method (-n)
- If set, the nystroem method will be used.
- --offset (-O) [double]
- Offset, for 'hyptan' and 'polynomial' kernels. Default value 0.
- --sampling (-s) [string]
- Sampling scheme to use for the nystroem method: 'kmeans', 'random',
'ordered' Default value 'kmeans'.
- --verbose (-v)
- Display informational messages and the full list of parameters and timers
at the end of execution.
- --version (-V)
- Display the version of mlpack.
For further information, including relevant papers, citations, and theory,
consult the documentation found at
http://www.mlpack.org or included with your
distribution of MLPACK.