.\" -*- mode: troff; coding: utf-8 -*-
.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
.ie n \{\
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\" ========================================================================
.\"
.IX Title "VNL-JOIN 1"
.TH VNL-JOIN 1 2024-05-27 "" vnlog
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH NAME
vnl\-join \- joins two log files on a particular field
.SH SYNOPSIS
.IX Header "SYNOPSIS"
.Vb 7
\& $ cat a.vnl
\& # a b
\& AA 11
\& bb 12
\& CC 13
\& dd 14
\& dd 123
\&
\& $ cat b.vnl
\& # a c
\& aa 1
\& cc 3
\& bb 4
\& ee 5
\& \- 23
\&
\& Try to join unsorted data on field \*(Aqa\*(Aq:
\& $ vnl\-join \-j a a.vnl b.vnl
\& # a b c
\& join: /dev/fd/5:3: is not sorted: CC 13
\& join: /dev/fd/6:3: is not sorted: bb 4
\&
\& Sort the data, and join on \*(Aqa\*(Aq:
\& $ vnl\-join \-\-vnl\-sort \- \-j a a.vnl b.vnl | vnl\-align
\& # a  b c
\& bb  12 4
\&
\& Sort the data, and join on \*(Aqa\*(Aq, ignoring case:
\& $ vnl\-join \-i \-\-vnl\-sort \- \-j a a.vnl b.vnl | vnl\-align
\& # a b c
\& AA 11 1
\& bb 12 4
\& CC 13 3
\&
\& Sort the data, and join on \*(Aqa\*(Aq. Also print the unmatched lines from both files:
\& $ vnl\-join \-a1 \-a2 \-\-vnl\-sort \- \-j a a.vnl b.vnl | vnl\-align
\& # a  b   c
\& \-   \-   23
\& AA   11 \-
\& CC   13 \-
\& aa  \-    1
\& bb   12  4
\& cc  \-    3
\& dd  123 \-
\& dd   14 \-
\& ee  \-    5
\&
\& Sort the data, and join on \*(Aqa\*(Aq. Print the unmatched lines from both files,
\& Output ONLY column \*(Aqc\*(Aq from the 2nd input:
\& $ vnl\-join \-a1 \-a2 \-o 2.c \-\-vnl\-sort \- \-j a a.vnl b.vnl | vnl\-align
\& # c
\& 23
\& \-
\& \-
\&  1
\&  4
\&  3
\& \-
\& \-
\&  5
.Ve
.SH DESCRIPTION
.IX Header "DESCRIPTION"
.Vb 7
\&  Usage: vnl\-join [join options]
\&                  [\-\-vnl\-sort \-|[sdfgiMhnRrV]+]
\&                  [ \-\-vnl\-[pre|suf]fix[1|2] xxx    |
\&                    \-\-vnl\-[pre|suf]fix xxx,yyy,zzz |
\&                    \-\-vnl\-autoprefix               |
\&                    \-\-vnl\-autosuffix ]
\&                  logfile1 logfile2
.Ve
.PP
This tool joins two vnlog files on a given field. \f(CW\*(C`vnl\-join\*(C'\fR is a
wrapper around the GNU coreutils \f(CW\*(C`join\*(C'\fR tool. Since this is a wrapper, most
commandline options and behaviors of the \f(CW\*(C`join\*(C'\fR tool are present; consult the
\&\fBjoin\fR\|(1) manpage for detail. The differences from GNU coreutils \f(CW\*(C`join\*(C'\fR are
.IP \(bu 4
The input and output to this tool are vnlog files, complete with a legend
.IP \(bu 4
The columns are referenced by name, not index. So instead of saying
.Sp
.Vb 1
\&  join \-j1
.Ve
.Sp
to join on the first column, you say
.Sp
.Vb 1
\&  join \-j time
.Ve
.Sp
to join on column "time".
.IP \(bu 4
\&\f(CW\-1\fR and \f(CW\-2\fR are supported, but \fImust\fR refer to the same field. Since
vnlog knows the identify of each field, it makes no sense for \f(CW\-1\fR and \f(CW\-2\fR
to be different. So pass \f(CW\*(C`\-j\*(C'\fR instead, it makes more sense in this context.
.IP \(bu 4
\&\f(CW\*(C`\-a\-\*(C'\fR is available as a shorthand for \f(CW\*(C`\-a1 \-a2\*(C'\fR: this is a full outer join,
printing unmatched records from both of the inputs. Similarly, \f(CW\*(C`\-v\-\*(C'\fR is
available as a shorthand for \f(CW\*(C`\-v1 \-v2\*(C'\fR: this will output \fIonly\fR the unique
records in both of the inputs.
.IP \(bu 4
\&\f(CW\*(C`vnl\-join\*(C'\fR\-specific options are available to adjust the field-naming in the
output:
.Sp
.Vb 8
\&  \-\-vnl\-prefix1
\&  \-\-vnl\-suffix1
\&  \-\-vnl\-prefix2
\&  \-\-vnl\-suffix2
\&  \-\-vnl\-prefix
\&  \-\-vnl\-suffix
\&  \-\-vnl\-autoprefix
\&  \-\-vnl\-autosuffix
.Ve
.Sp
See "Field names in the output" below for details.
.IP \(bu 4
A \f(CW\*(C`vnl\-join\*(C'\fR\-specific option \f(CW\*(C`\-\-vnl\-sort\*(C'\fR is available to sort the input
and/or output. See below for details.
.IP \(bu 4
By default we call the \f(CW\*(C`join\*(C'\fR tool to do the actual work. If the underlying
tool has a different name or lives in an odd path, this can be specified by
passing \f(CW\*(C`\-\-vnl\-tool TOOL\*(C'\fR
.IP \(bu 4
If no \f(CW\*(C`\-o\*(C'\fR is given, we output the join field, the remaining fields in
logfile1, the remaining fields in logfile2, .... This is what \f(CW\*(C`\-o auto\*(C'\fR does,
except we also handle empty vnlogs correctly.
.IP \(bu 4
\&\f(CW\*(C`\-e\*(C'\fR is not supported because vnlog uses \f(CW\*(C`\-\*(C'\fR to represent undefined fields.
.IP \(bu 4
\&\f(CW\*(C`\-\-header\*(C'\fR is not supported because vnlog assumes a specific header
structure, and \f(CW\*(C`vnl\-join\*(C'\fR makes sure that this header is handled properly
.IP \(bu 4
\&\f(CW\*(C`\-t\*(C'\fR is not supported because vnlog assumes whitespace-separated fields
.IP \(bu 4
\&\f(CW\*(C`\-\-zero\-terminated\*(C'\fR is not supported because vnlog assumes newline-separated
records
.IP \(bu 4
Rather than only 2\-way joins, this tool supports N\-way joins for any N > 2. See
below for details.
.PP
Past that, everything \f(CW\*(C`join\*(C'\fR does is supported, so see that man page for
detailed documentation. Note that all non-legend comments are stripped out,
since it's not obvious where they should end up.
.SS "Field names in the output"
.IX Subsection "Field names in the output"
By default, the field names in the output match those in the input. This is what
you want most of the time. It is possible, however that a column name adjustment
is needed. One common use case for this is if the files being joined have
identically-named columns, which would produce duplicate columns in the output.
Example: we fixed a bug in a program, and want to compare the results before and
after the fix. The program produces an x\-y trajectory as a function of time, so
both the bugged and the bug-fixed programs produce a vnlog with a legend
.PP
.Vb 1
\& # time x y
.Ve
.PP
Joining this on \f(CW\*(C`time\*(C'\fR will produce a vnlog with a legend
.PP
.Vb 1
\& # time x y x y
.Ve
.PP
which is confusing, and \fInot\fR what you want. Instead, we invoke \f(CW\*(C`vnl\-join\*(C'\fR as
.PP
.Vb 1
\& vnl\-join \-\-vnl\-suffix1 _buggy \-\-vnl\-suffix2 _fixed \-j time buggy.vnl fixed.vnl
.Ve
.PP
And in the output we get a legend
.PP
.Vb 1
\& # time x_buggy y_buggy x_fixed y_fixed
.Ve
.PP
Much better.
.PP
Note that \f(CW\*(C`vnl\-join\*(C'\fR provides several ways of specifying this. The above works
\&\fIonly\fR for 2\-way joins. An alternate syntax is available for N\-way joins, a
comma-separated list. The same could be expressed like this:
.PP
.Vb 1
\& vnl\-join \-a\- \-\-vnl\-suffix _buggy,_fixed \-j time buggy.vnl fixed.vnl
.Ve
.PP
Finally, if passing in structured filenames, \f(CW\*(C`vnl\-join\*(C'\fR can infer the desired
syntax from the filenames. The same as above could be expressed even simpler:
.PP
.Vb 1
\& vnl\-join \-\-vnl\-autosuffix \-j time buggy.vnl fixed.vnl
.Ve
.PP
This works by looking at the set of passed in filenames, and stripping out the
common leading and trailing strings.
.SS "Sorting of input and output"
.IX Subsection "Sorting of input and output"
The GNU coreutils \f(CW\*(C`join\*(C'\fR tool expects sorted columns because it can then take
only a single pass through the data. If the input isn't sorted, then we can use
normal shell substitutions to sort it:
.PP
.Vb 1
\& $ vnl\-join \-j key <(vnl\-sort \-s \-k key a.vnl) <(vnl\-sort \-s \-k key b.vnl)
.Ve
.PP
For convenience \f(CW\*(C`vnl\-join\*(C'\fR provides a \f(CW\*(C`\-\-vnl\-sort\*(C'\fR option. This allows the
above to be equivalently expressed as
.PP
.Vb 1
\& $ vnl\-join \-j key \-\-vnl\-sort \- a.vnl b.vnl
.Ve
.PP
The \f(CW\*(C`\-\*(C'\fR after the \f(CW\*(C`\-\-vnl\-sort\*(C'\fR indicates that we want to sort the \fIinput\fR
only. If we also want to sort the output, pass the short codes \f(CW\*(C`sort\*(C'\fR accepts
instead of the \f(CW\*(C`\-\*(C'\fR. For instance, to sort the input for \f(CW\*(C`join\*(C'\fR and to sort the
output numerically, in reverse, do this:
.PP
.Vb 1
\& $ vnl\-join \-j key \-\-vnl\-sort rg a.vnl b.vnl
.Ve
.PP
The reason this shorthand exists is to work around a quirk of \f(CW\*(C`join\*(C'\fR. The sort
order is \fIassumed\fR by \f(CW\*(C`join\*(C'\fR to be lexicographical, without any way to change
this. For \f(CW\*(C`sort\*(C'\fR, this is the default sort order, but \f(CW\*(C`sort\*(C'\fR has many options
to change the sort order, options which are sorely missing from \f(CW\*(C`join\*(C'\fR. A
real-world example affected by this is the joining of numerical data. If you
have \f(CW\*(C`a.vnl\*(C'\fR:
.PP
.Vb 4
\& # time a
\& 8 a
\& 9 b
\& 10 c
.Ve
.PP
and \f(CW\*(C`b.vnl\*(C'\fR:
.PP
.Vb 3
\& # time b
\& 9  d
\& 10 e
.Ve
.PP
Then you cannot use \f(CW\*(C`vnl\-join\*(C'\fR directly to join the data on time:
.PP
.Vb 6
\& $ vnl\-join \-j time a.vnl b.vnl
\& # time a b
\& join: /dev/fd/4:3: is not sorted: 10 c
\& join: /dev/fd/5:2: is not sorted: 10 e
\& 9 b d
\& 10 c e
.Ve
.PP
Instead you must re-sort both files lexicographically, \fIand\fR then (because you
almost certainly want to) sort it back into numerical order:
.PP
.Vb 5
\& $ vnl\-join \-j time <(vnl\-sort \-s \-k time a.vnl) <(vnl\-sort \-s \-k time b.vnl) |
\&   vnl\-sort \-s \-n \-k time
\& # time a b
\& 9 b d
\& 10 c e
.Ve
.PP
Yuck. The shorthand described earlier makes the interface part of this
palatable:
.PP
.Vb 4
\& $ vnl\-join \-j time \-\-vnl\-sort n a.vnl b.vnl
\& # time a b
\& 9 b d
\& 10 c e
.Ve
.PP
Note that the input sort is stable: \f(CW\*(C`vnl\-join\*(C'\fR will invoke \f(CW\*(C`vnl\-sort \-s\*(C'\fR.
If you want a stable post-sort, you need to ask for it with \f(CW\*(C`\-\-vnl\-sort
s...\*(C'\fR.
.SS "N\-way joins"
.IX Subsection "N-way joins"
The GNU coreutils \f(CW\*(C`join\*(C'\fR tool is inherently designed to join \fIexactly\fR two
files. \f(CW\*(C`vnl\-join\*(C'\fR extends this capability by chaining together a number of
\&\f(CW\*(C`join\*(C'\fR invocations to produce a generic N\-way join. This works exactly how you
would expect with the following caveats:
.IP \(bu 4
Full outer joins are supported by passing \f(CW\*(C`\-a\-\*(C'\fR, but no other \f(CW\*(C`\-a\*(C'\fR option is
supported. This is possible, but wasn't obviously worth the trouble.
.IP \(bu 4
\&\f(CW\*(C`\-v\*(C'\fR is not supported. Again, this is possible, but wasn't obviously worth the
trouble.
.IP \(bu 4
Similarly, \f(CW\*(C`\-o\*(C'\fR is not supported. This is possible, but wasn't obviously worth
the trouble, especially since the desired behavior can be obtained by
post-processing with \f(CW\*(C`vnl\-filter\*(C'\fR.
.SH "BUGS AND CAVEATS"
.IX Header "BUGS AND CAVEATS"
The underlying \f(CW\*(C`sort\*(C'\fR tool assumes lexicographic ordering, and matches fields
purely based on their textual contents. This means that for the purposes of
joining, \f(CW10\fR, \f(CW10.0\fR and \f(CW1.0e1\fR are all considered different. If needed,
you can normalize your keys with something like this:
.PP
.Vb 1
\& vnl\-filter \-p x=\*(Aqsprintf("%f",x)\*(Aq
.Ve
.SH COMPATIBILITY
.IX Header "COMPATIBILITY"
I use GNU/Linux\-based systems exclusively, but everything has been tested
functional on FreeBSD and OSX in addition to Debian, Ubuntu and CentOS. I can
imagine there's something I missed when testing on non-Linux systems, so please
let me know if you find any issues.
.SH "SEE ALSO"
.IX Header "SEE ALSO"
\&\fBjoin\fR\|(1)
.SH REPOSITORY
.IX Header "REPOSITORY"
https://github.com/dkogan/vnlog/
.SH AUTHOR
.IX Header "AUTHOR"
Dima Kogan \f(CW\*(C`<dima@secretsauce.net>\*(C'\fR
.SH "LICENSE AND COPYRIGHT"
.IX Header "LICENSE AND COPYRIGHT"
Copyright 2018 Dima Kogan \f(CW\*(C`<dima@secretsauce.net>\*(C'\fR
.PP
This library is free software; you can redistribute it and/or modify it under
the terms of the GNU Lesser General Public License as published by the Free
Software Foundation; either version 2.1 of the License, or (at your option) any
later version.