.TH "emst_tutorial" 3 "Tue Sep 9 2014" "Version 1.0.10" "MLPACK" \" -*- nroff -*- .ad l .nh .SH NAME emst_tutorial \- EMST Tutorial .SH "Introduction" .PP The Euclidean Minimum Spanning Tree problem is widely used in machine learning and data mining applications\&. Given a set $S$ of points in $\mathbf{R}^d$, our task is to compute lowest weight spanning tree in the complete graph on $S$ with edge weights given by the Euclidean distance between points\&. .PP Among other applications, the EMST can be used to compute hierarchical clusterings of data\&. A \fIsingle-linkage clustering\fP can be obtained from the EMST by deleting all edges longer than a given cluster length\&. This technique is also referred to as a \fIFriends-of-Friends\fP clustering in the astronomy literature\&. .PP MLPACK includes an implementation of \fBDual-Tree Boruvka\fP which uses $kd$-trees by default; this is the empirically and theoretically fastest EMST algorithm\&. In addition, the implementation supports the use of different trees via templates\&. For more details, see the following paper: .PP .PP .nf @inproceedings{march2010fast, title={Fast {E}uclidean minimum spanning tree: algorithm, analysis, and applications}, author={March, William B\&. and Ram, Parikshit and Gray, Alexander G\&.}, booktitle={Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '10)}, pages={603--612}, year={2010}, organization={ACM} } .fi .PP .PP \fBmlpack\fP provides: .PP .IP "\(bu" 2 a \fBsimple command-line executable\fP to compute the EMST of a given data set .IP "\(bu" 2 a \fBsimple C++ interface\fP to compute the EMST .PP .SH "Table of Contents" .PP A list of all the sections this tutorial contains\&. .PP .IP "\(bu" 2 \fBIntroduction\fP .IP "\(bu" 2 \fBTable of Contents\fP .IP "\(bu" 2 \fBCommand-Line 'EMST'\fP .IP "\(bu" 2 \fBThe 'DualTreeBoruvka' class\fP .IP "\(bu" 2 \fBFurther documentation\fP .PP .SH "Command-Line 'EMST'" .PP The emst executable in \fBmlpack\fP will compute the EMST of a given set of points and store the resulting edge list to a file\&. .PP The output file contains an edge list representation of the MST in an $n-1 \times 3 $ matrix, where the first and second columns are labels of points and the third column is the edge weight\&. The edges are sorted in order of increasing weight\&. .PP Below are several examples of simple usage (and the resultant output)\&. The '-v' option is used so that verbose output is given\&. Further documentation on each individual option can be found by typing .PP .PP .nf $ emst --help .fi .PP .PP .PP .nf $ emst --input_file=dataset\&.csv --output_file=edge_list\&.csv -v [INFO ] Reading in data\&. [INFO ] Loading 'dataset\&.csv' as CSV data\&. [INFO ] Data read, building tree\&. [INFO ] Tree built, running algorithm\&. [INFO ] 4 edges found so far\&. [INFO ] 5 edges found so far\&. [INFO ] Total spanning tree length: 1002\&.45 [INFO ] Saving CSV data to 'edge_list\&.csv'\&. [INFO ] [INFO ] Execution parameters: [INFO ] help: false [INFO ] info: "" [INFO ] input_file: dataset\&.csv [INFO ] leaf_size: 1 [INFO ] naive: false [INFO ] output_file: edge_list\&.csv [INFO ] verbose: true [INFO ] [INFO ] Program timers: [INFO ] emst/mst_computation: 0\&.000179s [INFO ] emst/tree_building: 0\&.000061s [INFO ] total_time: 0\&.052641s .fi .PP .PP The code performs at most $\log N$ iterations for $N$ data points\&. It will print an update on the number of MST edges found after each iteration\&. Convenient program timers are given for different parts of the calculation at the bottom of the output, as well as the parameters the simulation was run with\&. .PP .PP .nf $ cat dataset\&.csv 0, 0 1, 1 3, 3 0\&.5, 0 1000, 0 1001, 0 $ cat edge_list\&.csv 0\&.0000000000e+00,3\&.0000000000e+00,5\&.0000000000e-01 4\&.0000000000e+00,5\&.0000000000e+00,1\&.0000000000e+00 1\&.0000000000e+00,3\&.0000000000e+00,1\&.1180339887e+00 1\&.0000000000e+00,2\&.0000000000e+00,2\&.8284271247e+00 2\&.0000000000e+00,4\&.0000000000e+00,9\&.9700451353e+02 .fi .PP .PP The input points are labeled 0-5\&. The output tells us that the MST connects point 0 to point 3, point 4 to point 5, point 1 to point 3, point 1 to point 2, and point 2 to point 4, with the corresponding edge weights given in the third column\&. The total length of the MST is also given in the verbose output\&. .PP Note that it is also possible to compute the EMST using a naive ( $O(N^2)$) algorithm for timing and comparison purposes\&. .SH "The 'DualTreeBoruvka' class" .PP The 'DualTreeBoruvka' class contains our implementation of the Dual-Tree Boruvka algorithm\&. .PP The class has two constructors: the first takes the data set, constructs the tree (where the type of tree constructed is the TreeType template parameter), and computes the MST\&. The second takes data set and an already constructed tree\&. .PP The class provides one method that performs the MST computation: .PP .nf void ComputeMST(const arma::mat& results); .fi .PP .PP This method stores the computed MST in the matrix results in the format given above\&. .SH "Further documentation" .PP For further documentation on the DualTreeBoruvka class, consult the \fBcomplete API documentation\fP\&.