NAME¶
msort - sort records in complex ways
SYNOPSIS¶
msort <options> [<input file>]
DESCRIPTION¶
msort is a program for sorting text files in sophisticated ways. It was
developed initially for alphabetizing dictionaries of languages in which the
ordering may be quite different from English but has many other uses.
msort allows you to sort blocks of text delimited in a number of ways
rather than just lines and to specify particular fields of a record as sort
keys using either their position, counted from either end, or by matching
regular expressions to their tags.
msort is capable of sorting on multiple keys, so that when two records
tie on one key, the tie may be broken on another. Any or all keys may be
optional. How absent optional keys are ordered with respect to present keys
may be set separately for each key.
msort allows you to specify arbitrary sort orders and to define virtually
unlimited numbers of multigraphs of effectively unlimited length. The sort
order and multigraphs are defined separately for each key. If your system has
locale support, you can also use locale collation rules instead of specify
your own sort order.
msort provides twelve types of key comparison: lexicographic, numeric,
numeric string, hybrid, by string length, by angle, by date, by domain name,
by time, by ISO8601 date/time stamp, by month name, and random.
What month names are used is a bit complicated. If the
-s flag is used on
the same key and its argument is the name of a file, the month names are read
from the file, which should be in the same format as a sort order definition
file. If the
-s flag is used and its argument is a locale name, the
month names recognized will be the month names and abbreviations associated
with the specified locale. If the
-s flag is not used the month names
recognized will be the month names and abbreviations associated with the
current locale. If your system does not have locale support and you do not use
the
-s flag to read the month names from a file, the month names
recognized will be the English month names and abbreviations.
msort can reverse the characters in a key, allowing it to be used to
generate reverse dictionaries.
A choice of sorting algorithms is provided.
msort fully supports Unicode. The text to be sorted, and all
specifications, should be in UTF-8 Unicode. (If you have plain ASCII text,
this is not a problem as ASCII is a subset of Unicode.) Full Unicode
case-folding is available, in Turkic and non-Turkic variants. Unicode
normalization is performed before sorting.
For usage information, execute
msort with no arguments.
Full information about
msort is currently to be found in the reference
manual, which is distributed as a PDF (Portable Document Format) file. If a
copy is not available locally, you can download it from msort's home page:
http://billposer.org/Software/msort.html
OPTIONS¶
- -h,--help
- Print usage message
- -v,--version
- Print version message
- -D,--defaults
- List defaults
- -F,--general-options
- List general command line options
- -G,--gnu-equivalences
- List equivalents for GNU sort command line options.
- -H,--informational-options
- List informational command line options
- -K,--key-specific-options
- List key-specific command line options
- -L,--limits
- List limits
- -N,--number-systems
- List the supported number systems.
General options¶
- -b,--block
- A record is terminated by two or more newlines
- -l,--line
- A record consists of a single line
- -r,--record-separator <separator>
- A record is terminated by separator character
- -O,--fixed-size-record <bytes>
- A record consists of the specified number of bytes.
- -d,--field-separators <character>+
- Fields are delimited by the named character(s)
- -w,--whole
- Sort on the entire text of the record
- -a,--algorithm <algorithm>
- Use the specified sort algorithm. The choices are: I(nsertionSort),
M(ergeSort), Q(uickSort), and S(hellSort). Note that InsertionSort and
MergeSort are stable, while QuickSort and ShellSort are unstable. The
default is QuickSort.
- -M,-initial-maximum-records <records>
- Set initial maximum number of records
- -m,--line-end-carriage-return
- End-of-line in the input data is marked by Carriage Return (0x0D) as on
the Macintosh rather than by Line Feed (0x0A) as on Unix systems.
- -I,--invert-globally
- Invert sense of comparisons globally
- -B,--BMP
- No characters fall outside the Basic Multingual Plane (that is, have
values greater than 0xFFFF).
- -Z,--skip-first-record
- Copy the first record in the input to the output without sorting it. This
is useful for sorting files with a header.
- -p,--reserve-private-use-area
- Do not make internal use of the Private Use areas. By default, multigraphs
are assigned internally to codepoints in the Supplementary Private Use
areas if full Unicode is in use or to codepoints in the Private Use area
if input is restricted to the Basic Multilingual Plane by means of the
-B option. If your input makes use of the Private Use areas, this
option prevents interference with your input. In this case, multigraphs
will be assigned to the Low and High Surrogate areas (0xD800-0xDFFF). Note
that this limits the number of multigraphs to 2,048.
- -P,--random-seed <seed>
- Set the seed for the random number generator. If not set here, it is set
to a value determined by the time. The seed used is reported in the log.
This option allows runs to be replicated.
- -Q,--check-only
- Check whether the input is already sorted. Do not generate any output.
Exit status is 0 if input is already sorted, 11 if not sorted.
- -1,--in <input file name>
- -2,--out <output file name>
- If the output file is the same as the input file, the input file will be
overwritten. The input file will not be overwritten if the run is
unsuccessful.
- -j,--suppress-log
- Suppress output to the log. If this flag is given before there is any
output to the log from a command line flag, nothing will be written to the
log and the log file will not be created. If a command line flag generates
a log message before this flag is processed, the log file will be created
but no log messages will be written to it once this flag is processed. To
guarantee that no attempt will be made to open a log file, give this flag
first.
- -q,--quiet
- Be quiet - do not chat while working
- -u,--unicode-normalization <mode>
- Select Unicode normalization mode. The choices of mode are: c for
normalization form C (NFC), d for normalization form D (NFD),
C for normalization form KC (NFKC), D for normalization form
KD (NFKD), and n for no normalization. The default is NFC.
Key specific options¶
- -e,--character-range <m,n>
- Sort on characters m through n. Positive indices start from one. Negative
indices indicate position with respect to the end of the record. For
example, the range 3,-2 consists of the third character through the
next-to-last character.
- -n,--position <POS>(,<POS>)
- Sort on the specified POS or contiguous range of POSs, where a POS is of
the form <field number>(.<character number>). Both counts
begin at one. Field numbers but not character numbers may be negative, in
which case they are counted from the right. Thus, 1.2 is the second
character of the first field; -2.1 is the first character of the next to
last field.
- -t,--tag <tag regexp>
- Sort on the field with the specified tag
- -o,--optional <comparison>
- Optional: compare as (<,=,>) to present key if absent
- -C,--fold-case
- Fold case
- -z,--fold-case-turkic
- Fold case with additional Turkic conversions.
- -c,--comparison-type <comparison type>
- a(ngle),l(exicographic), i(so8601 date/time), t(ime), D(omain name/email
address), d(ate), m(onth name), n(umeric), N(umeric string),s(ize),
h(hybrid), r(andom)
- -y,--number-system <number system>
- Specifies the number system expected for this key. This affects only
numeric and numeric string keys. There are two special values. If the
number system is "all", records may contain any number system
that msort can interpret. Different records may contain different number
systems. If the number system is "any", records may contain any
writing system that msort can interpret, but all records must make use of
the same number system. msort sets the number system on the basis
of the first record.
- -f,--date-format <date format>
- Permutation of ymd with separators, e.g. y-m-d for international date
format, m/d/y for American date format, or a permutation of yd with
separators, e.g. y-d, for day-of-year dates. All three components may be
numbers in any available number system. The month field may also be a
month name, determined by the same devices as independent month name
fields.
- -W,--sort-order-file-separators <file name>
- Read the list of characters to be treated as separators in the sort order
definition file.
- -S,--substitutions <file name>
- Read substitutions from named file
- -s,--sort-order <file name>|<locale
name>|"locale"
- If the argument is a file name, it is taken to be a sort order file and
the sort order for the key is read from the file. If the argument is a
locale name, the collation rules for that locale are used. If the argument
is "locale", the collation rules for the current locale are
used.
- -T,--transformations <(d)(e)(s)>
- Apply the specified transformations. d specifies that diacritics
are to be stripped. Separately encoded combining diacritics are removed.
Characters with diacritics represented by single codepoints are replaced
with the corresponding ASCII character without the diacritics, if there is
one. e specifies that enclosed characters, that is, characters
within circles or parentheses, are to be replaced with the corresponding
plain ASCII character if there is one. s specifies that characters
in special styles are to be replaced with the corresponding plain ASCII
character if there is one. Stylistic equivalents include: small capitals
(e.g. U+1D04), script forms (e.g. U+212C), black letter forms (e.g.
U+212D), Arabic presentation forms (e.g. U+FE81), Hebrew presentation
forms (e.g. U+FB1D), fullwidth forms (e.g. U+FF01), halfwidth forms (e.g.
U+FF7B), and the mathematical alphanumeric symbols (e.g. U+1D400).
- -x,--exclusion-file <file name>
- Read exclusions from named file
- -X,--exclude-characters <exclusions>
- Exclude specified characters
- -i,--invert-locally
- Invert sense of comparisons
- -R,--reverse-key
- Reverse characters of key
- -A,--first-character-only
- Ignore all but the first character of the field, after substitutions,
exclusions, etc.
Note: long options may not be available on your system.
SEE ALSO¶
sort(1), uninum(3)
AUTHOR¶
Bill Poser (billposer@alum.mit.edu)
LICENSE¶
GNU General Public License (
http://www.gnu.org/licenses/gpl.html), version 3.