.TH "nstutorial" 3 "Tue Sep 9 2014" "Version 1.0.10" "MLPACK" \" -*- nroff -*- .ad l .nh .SH NAME nstutorial \- NeighborSearch tutorial (k-nearest-neighbors) .SH "Introduction" .PP Nearest-neighbors search is a common machine learning task\&. In this setting, we have a \fBquery\fP and a \fBreference\fP dataset\&. For each point in the \fBquery\fP dataset, we wish to know the $k$ points in the \fBreference\fP dataset which are closest to the given query point\&. .PP Alternately, if the query and reference datasets are the same, the problem can be stated more simply: for each point in the dataset, we wish to know the $k$ nearest points to that point\&. .PP \fBmlpack\fP provides: .PP .IP "\(bu" 2 a \fBsimple command-line executable\fP to run nearest-neighbors search (and furthest-neighbors search) .IP "\(bu" 2 a \fBsimple C++ interface\fP to perform nearest-neighbors search (and furthest-neighbors search) .IP "\(bu" 2 a \fBgeneric, extensible, and powerful C++ class (NeighborSearch)\fP for complex usage .PP .SH "Table of Contents" .PP A list of all the sections this tutorial contains\&. .PP .IP "\(bu" 2 \fBIntroduction\fP .IP "\(bu" 2 \fBTable of Contents\fP .IP "\(bu" 2 \fBCommand-Line 'allknn'\fP .IP " \(bu" 4 \fBOne dataset, 5 nearest neighbors\fP .IP " \(bu" 4 \fBQuery and reference dataset, 10 nearest neighbors\fP .IP " \(bu" 4 \fBOne dataset, 3 nearest neighbors, leaf size of 15 points\fP .PP .IP "\(bu" 2 \fBThe 'AllkNN' class\fP .IP " \(bu" 4 \fB5 nearest neighbors on a single dataset\fP .IP " \(bu" 4 \fB10 nearest neighbors on a query and reference dataset\fP .IP " \(bu" 4 \fBNaive (exhaustive) search for 6 nearest neighbors on one dataset\fP .PP .IP "\(bu" 2 \fBThe extensible 'NeighborSearch' class\fP .IP " \(bu" 4 \fBSortPolicy policy class\fP .IP " \(bu" 4 \fBMetricType policy class\fP .IP " \(bu" 4 \fBTreeType policy class\fP .PP .IP "\(bu" 2 \fBFurther documentation\fP .PP .SH "Command-Line 'allknn'" .PP The simplest way to perform nearest-neighbors search in \fBmlpack\fP is to use the allknn executable\&. This program will perform nearest-neighbors search and place the resultant neighbors into one file and the resultant distances into another\&. The output files are organized such that the first row corresponds to the nearest neighbors of the first query point, with the first column corresponding to the nearest neighbor, and so forth\&. .PP Below are several examples of simple usage (and the resultant output)\&. The '-v' option is used so that output is given\&. Further documentation on each individual option can be found by typing .PP .PP .nf $ allknn --help .fi .PP .SS "One dataset, 5 nearest neighbors" .PP .nf $ allknn -r dataset\&.csv -n neighbors_out\&.csv -d distances_out\&.csv -k 5 -v [INFO ] Loading 'dataset\&.csv' as CSV data\&. [INFO ] Loaded reference data from 'dataset\&.csv'\&. [INFO ] Building reference tree\&.\&.\&. [INFO ] Trees built\&. [INFO ] Computing 5 nearest neighbors\&.\&.\&. [INFO ] Neighbors computed\&. [INFO ] Re-mapping indices\&.\&.\&. [INFO ] Saving CSV data to 'distances_out\&.csv'\&. [INFO ] Saving CSV data to 'neighbors_out\&.csv'\&. [INFO ] [INFO ] Execution parameters: [INFO ] distances_file: distances_out\&.csv [INFO ] help: false [INFO ] info: "" [INFO ] k: 5 [INFO ] leaf_size: 20 [INFO ] naive: false [INFO ] neighbors_file: neighbors_out\&.csv [INFO ] query_file: "" [INFO ] reference_file: dataset\&.csv [INFO ] single_mode: false [INFO ] verbose: true [INFO ] [INFO ] Program timers: [INFO ] computing_neighbors: 0\&.152495s [INFO ] total_time: 0\&.201274s [INFO ] tree_building: 0\&.005050s .fi .PP .PP Convenient program timers are given for different parts of the calculation at the bottom of the output, as well as the parameters the simulation was run with\&. Now, if we look at the output files: .PP .PP .nf $ head neighbors_out\&.csv 14,5,13,16,27 90,79,80,15,10 39,84,10,123,1 81,43,109,12,37 15,1,79,90,10 0,14,16,13,27 90,79,11,1,15 41,45,12,37,49 11,81,13,6,15 41,7,45,49,47 $ head distances_out\&.csv 7\&.09614421e-04,2\&.05940173e-03,4\&.05346068e-03,4\&.66175278e-03,1\&.09757665e-02 8\&.92190948e-04,1\&.69442242e-03,2\&.82750475e-03,4\&.06590850e-03,7\&.54169243e-03 5\&.91539406e-03,6\&.83482612e-03,8\&.02877800e-03,9\&.04907425e-03,1\&.61458442e-02 7\&.15652913e-03,9\&.18228524e-03,1\&.00540941e-02,1\&.07541171e-02,1\&.28892864e-02 5\&.37535983e-03,9\&.05721409e-03,9\&.89017184e-03,1\&.01457735e-02,1\&.14021593e-02 2\&.05940173e-03,5\&.14437192e-03,9\&.97483954e-03,1\&.02463627e-02,1\&.44355783e-02 4\&.27355419e-03,6\&.36750547e-03,6\&.72478577e-03,8\&.77323532e-03,1\&.04530549e-02 1\&.99935847e-03,3\&.88240331e-03,4\&.19118273e-03,9\&.30693568e-03,1\&.21237481e-02 2\&.15454276e-03,8\&.18895210e-03,1\&.18360450e-02,1\&.25135454e-02,1\&.27783327e-02 8\&.43087996e-03,1\&.22946325e-02,1\&.60472209e-02,1\&.88661413e-02,1\&.89727686e-02 .fi .PP .PP So, the nearest neighbor to point 0 is point 14, with a distance of 7\&.096144e-4\&. The second nearest neighbor to point 0 is point 5, with a distance of 2\&.059402e-3\&. The third nearest neighbor to point 5 is point 16, with a distance of 9\&.9748395e-3\&. .SS "Query and reference dataset, 10 nearest neighbors" .PP .nf $ allknn -q query_dataset\&.csv -r reference_dataset\&.csv -n neighbors_out\&.csv \ > -d distances_out\&.csv -k 10 -v [INFO ] Loading 'reference_dataset\&.csv' as CSV data\&. [INFO ] Loaded reference data from 'reference_dataset\&.csv'\&. [INFO ] Building reference tree\&.\&.\&. [INFO ] Loading 'query_dataset\&.csv' as CSV data\&. [INFO ] Query data loaded from 'query_dataset\&.csv'\&. [INFO ] Building query tree\&.\&.\&. [INFO ] Tree built\&. [INFO ] Computing 10 nearest neighbors\&.\&.\&. [INFO ] Neighbors computed\&. [INFO ] Re-mapping indices\&.\&.\&. [INFO ] Saving CSV data to 'distances_out\&.csv'\&. [INFO ] Saving CSV data to 'neighbors_out\&.csv'\&. [INFO ] [INFO ] Execution parameters: [INFO ] distances_file: distances_out\&.csv [INFO ] help: false [INFO ] info: "" [INFO ] k: 10 [INFO ] leaf_size: 20 [INFO ] naive: false [INFO ] neighbors_file: neighbors_out\&.csv [INFO ] query_file: query_dataset\&.csv [INFO ] reference_file: reference_dataset\&.csv [INFO ] single_mode: false [INFO ] verbose: true [INFO ] [INFO ] Program timers: [INFO ] computing_neighbors: 0\&.000081s [INFO ] total_time: 0\&.062828s [INFO ] tree_building: 0\&.004949s .fi .PP .SS "One dataset, 3 nearest neighbors, leaf size of 15 points" .PP .nf $ allknn -r dataset\&.csv -n neighbors_out\&.csv -d distances_out\&.csv -k 3 -l 15 -v [INFO ] Loading 'dataset\&.csv' as CSV data\&. [INFO ] Loaded reference data from 'dataset\&.csv'\&. [INFO ] Building reference tree\&.\&.\&. [INFO ] Trees built\&. [INFO ] Computing 3 nearest neighbors\&.\&.\&. [INFO ] Neighbors computed\&. [INFO ] Re-mapping indices\&.\&.\&. [INFO ] Saving CSV data to 'distances_out\&.csv'\&. [INFO ] Saving CSV data to 'neighbors_out\&.csv'\&. [INFO ] [INFO ] Execution parameters: [INFO ] distances_file: distances_out\&.csv [INFO ] help: false [INFO ] info: "" [INFO ] k: 3 [INFO ] leaf_size: 15 [INFO ] naive: false [INFO ] neighbors_file: neighbors_out\&.csv [INFO ] query_file: "" [INFO ] reference_file: dataset\&.csv [INFO ] single_mode: false [INFO ] verbose: true [INFO ] [INFO ] Program timers: [INFO ] computing_neighbors: 0\&.105119s [INFO ] total_time: 0\&.145321s [INFO ] tree_building: 0\&.005690s .fi .PP .PP Further documentation on options should be found by using the --help option\&. .SH "The 'AllkNN' class" .PP The 'AllkNN' class is, specifically, a typedef of the more extensible NeighborSearch class, querying for nearest neighbors using the Euclidean distance\&. .PP .PP .nf typedef NeighborSearch AllkNN; .fi .PP .PP Using the AllkNN class is particularly simple; first, the object must be constructed and given a dataset\&. Then, the method is run, and two matrices are returned: one which holds the indices of the nearest neighbors, and one which holds the distances of the nearest neighbors\&. These are of the same structure as the output --neighbors_file and --distances_file for the CLI interface (see above)\&. A handful of examples of simple usage of the AllkNN class are given below\&. .SS "5 nearest neighbors on a single dataset" .PP .nf #include using namespace mlpack::neighbor; // Our dataset matrix, which is column-major\&. extern arma::mat data; AllkNN a(data); // The matrices we will store output in\&. arma::Mat resultingNeighbors; arma::mat resultingDistances; a\&.Search(5, resultingNeighbors, resultingDistances); .fi .PP .PP The output of the search is stored in resultingNeighbors and resultingDistances\&. .SS "10 nearest neighbors on a query and reference dataset" .PP .nf #include using namespace mlpack::neighbor; // Our dataset matrices, which are column-major\&. extern arma::mat queryData, referenceData; AllkNN a(referenceData, queryData); // The matrices we will store output in\&. arma::Mat resultingNeighbors; arma::mat resultingDistances; a\&.Search(10, resultingNeighbors, resultingDistances); .fi .PP .SS "Naive (exhaustive) search for 6 nearest neighbors on one dataset" This example uses the O(n^2) naive search (not the tree-based search)\&. .PP .PP .nf #include using namespace mlpack::neighbor; // Our dataset matrix, which is column-major\&. extern arma::mat dataset; AllkNN a(dataset, true); // The matrices we will store output in\&. arma::Mat resultingNeighbors; arma::mat resultingDistances; a\&.Search(6, resultingNeighbors, resultingDistances); .fi .PP .PP Needless to say, naive search can be very slow\&.\&.\&. .SH "The extensible 'NeighborSearch' class" .PP The NeighborSearch class is very extensible, having the following template arguments: .PP .PP .nf template< typename SortPolicy = NearestNeighborSort, typename MetricType = mlpack::metric::EuclideanDistance, typename TreeType = mlpack::tree::BinarySpaceTree, QueryStat > > class NeighborSearch; .fi .PP .PP By choosing different components for each of these template classes, a very arbitrary neighbor searching object can be constructed\&. .SS "SortPolicy policy class" The SortPolicy template parameter allows specification of how the NeighborSearch object will decide which points are to be searched for\&. The \fBmlpack::neighbor::NearestNeighborSort\fP class is a well-documented example\&. A custom SortPolicy class must implement the same methods which NearestNeighborSort does: .PP .PP .nf static size_t SortDistance(const arma::vec& list, double newDistance); static bool IsBetter(const double value, const double ref); template static double BestNodeToNodeDistance(const TreeType* queryNode, const TreeType* referenceNode); template static double BestPointToNodeDistance(const arma::vec& queryPoint, const TreeType* referenceNode); static const double WorstDistance(); static const double BestDistance(); .fi .PP .PP The \fBmlpack::neighbor::FurthestNeighborSort\fP class is another implementation, which is used to create the 'AllkFN' typedef class, which finds the furthest neighbors, as opposed to the nearest neighbors\&. .SS "MetricType policy class" The MetricType policy class allows the neighbor search to take place in any arbitrary metric space\&. The \fBmlpack::metric::LMetric\fP class is a good example implementation\&. A MetricType class must provide the following functions: .PP .PP .nf // Empty constructor is required\&. MetricType(); // Compute the distance between two points\&. template double Evaluate(const VecType& a, const VecType& b); .fi .PP .PP Internally, the NeighborSearch class keeps an instantiated MetricType class (which can be given in the constructor)\&. This is useful for a metric like the Mahalanobis distance (\fBmlpack::metric::MahalanobisDistance\fP), which must store state (the covariance matrix)\&. Therefore, you can write a non-static MetricType class and use it seamlessly with NeighborSearch\&. .SS "TreeType policy class" The NeighborSearch class also allows a custom tree to be used\&. The standard MLPACK tree, \fBmlpack::tree::BinarySpaceTree\fP, is also highly extensible in its own right, and its documentation should be consulted for more information\&. Currently, the NeighborSearch tree requires a tree which only has left and right children, and no points in nodes (only in leaves), but this support is planned to be extended\&. .PP A simple usage of the TreeType policy could be to use a different type of bound with the tree\&. For instance, you could use a ball bound instead of a rectangular bound: .PP .PP .nf // Construct a NeighborSearch object with ball bounds\&. NeighborSearch< NearestNeighborSort, metric::EuclideanDistance, tree::BinarySpaceTree, QueryStat > > neighborSearch(dataset); .fi .PP .PP It is important to note that the NeighborSearch class requires use of the QueryStat tree statistic to function properly\&. Therefore, if you write a custom tree, be sure it can accept the QueryStat type\&. See the \fBmlpack::tree::BinarySpaceTree\fP documentation for more information on tree statistics\&. .SH "Further documentation" .PP For further documentation on the NeighborSearch class, consult the \fBcomplete API documentation\fP\&.