.TH UTFCHECK 1 "2018 Sep 01" UTFCHECK "User Commands"
.SH NAME
utfcheck \- Check a file to verify that it is valid UTF-8 or ASCII
.SH SYNOPSIS
.br
.B utfcheck
[\-a] [\-q] [\-\-expurgated] [\-i \fIinput_file.beta\fP] [\-o \fIoutput_file.utf8\fP]
.SH DESCRIPTION
\fButfcheck\fP(1)
reads an input file and prints messages about contents that might
be unexpected (even if legal Unicode) in a UTF-8 or ASCII file,
such as embedded control characters or Unicode "noncharacters".
No diagnostic messages are printed for the control characters
horizontal tab, vertical tab, line feed, or form feed.  A final
summary will indicate if null, carriage return, or escape
characters were read.
.PP
.B utfcheck
will detect a UTF-16 big-endian or little-endian Byte Order Mark
at the beginning of a file and quit if it sees one.  There is no
support for parsing UTF-16 files beyond initial detection of the
Byte Order Mark.
.SH OPTIONS
.TP 6
\-a
Test for a pure ASCII file.  ASCII control characters are allowed,
but \fButfcheck\fP will fail if it encounters a byte with value
greater than hexadecimal 7F (the delete control character).
.TP
\-i
Specify the input file. The default is STDIN.
.TP
\-o
Specify the output file. The default is STDOUT.
.TP
\-q
Quiet mode.  Do not print any output unless an illegal byte sequence
is detected.
.TP
\-\-expurgated
Check a UTF-8 file against the "expurgated" version of the Unicode Standard,
the one without the Byte Order Mark, after Monty Python's "Bookshop"
skit with the "expurgated" version of \fIOlsen's Standard Book of
British Birds,\fP the one without the gannet\(embecause the customer
didn't like them.  (But they've all got the Byte Order Mark.  It's a
standard part of the Unicode Standard, the Byte Order Mark.  It's in
all the books.)\| This option is not abbreviated, to keep the user mindful
of the questionable nature of testing for the lack of something even though
it is a legitimate part of the Unicode Standard.  \fButfcheck\fP will fail
if this option is selected and the UTF-8 Byte Order Mark (officially the
zero width no-break space) is detected anywhere in the input file.
.PP
Sample usage:
.PP
.RS
utfcheck \-i \fImy_input_file.txt\fP \-o \fImy_output_file.log\fP
.RE
.SH MESSAGES
.SS "IMMEDIATE MESSAGES"
Some uncommon characters are noted immediately as they are encountered.
Some are fatal errors and some are not, as noted below.
The messages associated with them follow.
.TP 5
.B ASCII-CONTROL: U+\fInnnn\fP
The file contains ASCII control characters in the range U+0001 through
U+001F, inclusive, except for Horizontal Tab, Line Feed, Vertical Tab,
Form Feed, New Line, Carriage Return; or the file contains the
Delete character (U+007F).
.TP
.B ASCII-NULL
The file contains an ASCII NULL character (U+0000).
.TP
.B BINARY-DATA: 0x\fInn\fP
The file contains a byte value that is not part of a well-formed UTF-8
character.  This is considered a fatal error and the program will terminate
with exit status EXIT_FAILURE.
.TP
.B NON-ASCII-DATA: 0x\fInn\fP
The \fB\-a\fP (ASCII only) option was selected and the file contains non-ASCII
data (i.e., a byte with the high bit set).  This is considered a fatal error
and the program will terminate with exit status EXIT_FAILURE.
.TP
.B SURROGATE-PAIR-CODE-POINT: 0x\fInn\|.\|.\|.\fP (U+\fInnnn\fP)
The file contains a Unicode surrogate pair code point encoded as UTF-8
(U+D800 through U+DFFF, inclusive).  Surrogate code points are used
with UTF-16 files, so they should never appear in UTF-8 files.
The byte values are printed first, and then the UTF-8 converted Unicode
code point is printed in parentheses.
This is considered a fatal error and the program will terminate with
exit status EXIT_FAILURE.
.TP
.B UTF-16-BE: Unsupported
The file begins with a big-endian UTF-16 Byte Order Mark.
Because \fButfcheck\fP does not support UTF-16, this is considered
a fatal error and the program will terminate with exit status EXIT_FAILURE.
.TP
.B UTF-16-LE: Unsupported
The file begins with a little-endian UTF-16 Byte Order Mark.
Because \fButfcheck\fP does not support UTF-16, this is considered
a fatal error and the program will terminate with exit status EXIT_FAILURE.
.TP
.B UTF-8-BOM-BEGIN
The file begins with a Byte Order Mark (U+FEFF) in UTF-8 form.
If the \fB\-\-expurgated\fP option is selected and this condition
is detected, this is considered a fatal error and the program will
terminate with exit status EXIT_FAILURE; otherwise, the program continues.
.TP
.B UTF-8-BOM-EMBEDDED
The file contains a Byte Order Mark (U+FEFF) after the start of the file.
If the \fB\-\-expurgated\fP option is selected and this condition
is detected, this is considered a fatal error and the program will
terminate with exit status EXIT_FAILURE; otherwise, the program continues.
.TP
.B UTF-8-CONTROL: 0x\fInn\|.\|.\|.\fP (U+\fInnnn\fP)
The file contains a UTF-8 control character (U+0080 through U+009F, inclusive).
The byte values are printed first, and then the UTF-8 converted Unicode
code point is printed in parentheses.
.TP
.B UTF-8-NONCHARACTER: 0x\fInn\|.\|.\|.\fP (U+\fInnnn\fP)
The file contains a Unicode "noncharacter".  This can be a code point
in the range U+FDD0 through U+FDEF, inclusive, or the last two code points
of any Unicode plane, from Plane 0 through Plane 16, inclusive.
The byte values are printed first, and then the UTF-8 converted Unicode
code point is printed in parentheses.
Note that a noncharacter is allowable in well-formed Unicode files,
so this condition is not considered an error.
.SS "END OF FILE SUMMARY"
If the \fB\-q\fP option is not selected and the program has not encountered
a fatal error before reaching the end of the input stream, \fButfcheck\fP
prints a summary of the file contents after the input stream has reached
its end.  This will begin with the line "FILE-SUMMARY:".  This is followed by
a line beginning with "Character-Set: " followed by one of "ASCII", "UTF-8",
"UTF-16-BE" (UTF-16 Big Endian), "UTF-16-LE" (UTF-16 Little Endian),
or "BINARY".  (Note that UTF-16 parsing is not currently implemented,
so the UTF-16-BE and UTF-16-LE types will not appear in this final summary
at present.)  The following messages can appear in this end of file summary
if the program encountered the corresponding types of Unicode code points.
.TP 5
.B BOM-AT-START
The file begins with a UTF-8 Byte Order Mark (U+FEFF).
.TP
.B BOM-AFTER-START
The file contains a UTF-8 Byte Order Mark (U+FEFF) after the start of the file.
.TP
.B CONTAINS-NULLS
The file contains null characters (U+0000).
.TP
.B CONTAINS-CARRIAGE_RETURN
The file contains carriage returns (U+000D).
.TP
.B CONTAINS-CONTROL_CHARACTERS
The file contains ASCII control characters in the range U+0001 through
U+001F, inclusive, except for Horizontal Tab, Line Feed, Vertical Tab,
Form Feed, New Line, or Carriage Return; or contains the Delete character
(U+007F) or control characters in the range U+0080 through U+009F, inclusive.
.TP
.B CONTAINS-ESCAPE_SEQUENCES
The file contains at least one ASCII escape character (U+001B), which
is interpreted to be part of an escape sequence (for example, a VT-100 or
ANSI terminal control sequence).
.TP
.B Plane-0-PUA: \fIn\fP characters
Number of Plane 0 Private Use Area characters in file.
.TP
.B Plane-15-PUA: \fIn\fP characters
Number of Plane 15 Private Use Area characters in file.
.TP
.B Plane-16-PUA: \fIn\fP characters
Number of Plane 16 Private Use Area characters in file.
.SH "EXIT STATUS"
.B utfcheck
will exit with a status of EXIT_SUCCESS if the input file only contains
valid text, or with a status of EXIT_FAILURE if it contains invalid bytes.
.SH FILES
ASCII or UTF-8 text files.
.SH AUTHOR
.B utfcheck
was written by Paul Hardy.
.SH LICENSE
.B utfcheck
is Copyright \(co 2018 Paul Hardy.
.PP
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
.SH BUGS
No known bugs exist.