NAME¶
catdoc - reads MS-Word file and puts its content as plain text on standard
output
SYNOPSIS¶
catdoc [
-vlu8btawxV] [
-m number] [
-s
charset] [
-d charset] [
-f output-format]
file
DESCRIPTION¶
catdoc behaves much like
cat(1) but it reads MS-Word file and
produces human-readable text on standard output. Optionally it can use
latex(1) escape sequences for characters which have special meaning for
LaTeX. It also makes some effort to recognize MS-Word tables, although it
never tries to write correct headers for LaTeX tabular environment. Additional
output formats, such is HTML can be easily defined.
catdoc doesn't attempt to extract formatting information other than
tables from MS-Word document, so different output modes means mainly that
different characters should be escaped and different ways used to represent
characters, missing from output charset. See CHARACTER SUBSTITUTION below
catdoc uses internal
unicode(7) representation of text, so it is
able to convert texts when charset in source document doesn't match charset on
target system. See CHARACTER SETS below.
If no file names supplied,
catdoc processes its standard input unless it
is terminal. It is unlikely that somebody could type Word document from
keyboard, so if
catdoc invoked without arguments and stdin is not
redirected, it prints brief usage message and exits. Processing of standard
input (even among other files) can be forced using dash '-' as file name.
By default,
catdoc wraps lines which are more than 72 chars long and
separates paragraphs by blank lines. This behavior can be turned of by
-w switch. In
wide mode
catdoc prints each paragraph as one
long line, suitable for import into word processors which perform word
wrapping.
OPTIONS¶
- -a
- - shortcut for -f ascii. Produces ASCII text as output.
Separates table columns with TAB
- -b
- - process broken MS-Word file. Normally, catdoc checks
if first 8 bytes of file is Microsoft OLE signature. If so, it
processes file, otherwise it just copies it to stdin. It is intended to
use catdoc as filter for viewing all files with .doc
extension.
- -dcharset
- - specifies destination charset name. Charset file has
format described in CHARACTER SETS below and should have .txt
extension and reside in catdoc library directory (
${exec_prefix}/lib/catdoc). By default, current locale charset is used
if langinfo support compiled in.
- -fformat
- - specifies output format as described in CHARACTER
SUBSTITUTION below. catdoc comes with two output formats - ascii
and tex. You can add your own if you wish.
- -l
- Causes catdoc to list names of available charsets to
the stdout and exit successfully.
- -mnumber
- Specifies right margin for text (default 72). -m 0
is equivalent to -w
- -scharset
- Specifies source charset. (one used in Word document), if
Word document doesn't contain UTF-16 text. When reading rtf documents, it
is typically not necessary, because rtf documents contain ansicpg
specification. But it can be set wrong by Word (I've seen RTF documents on
Russian, where cp1252 was specified). In this case this option would take
precedence over charset, specified in the document. But source_charset
statement in the configuration file have less priority than charset in the
document.
- -t
- - shortcut for -f tex
converts all printable chars, which have special meaning for
LaTeX(1) into appropriate control sequences. Separates table
columns by &.
- -u
- - declares that Word document contain UNICODE (UTF-16)
representation of text (as some Word-97 documents). If catdoc fails to
correct Word document with default charset, try this option.
- -8
- - declares is Word document is 8 bit. Just in case that
catdoc
recognizes file format incorrectly.
- -w
- disables word wrapping. By default catdoc output is
split into lines not longer than 72 (or number, specified by -m option)
characters and paragraphs are separated by blank line. With this option
each paragraph is one long line.
- -x
- causes catdoc to output unknown UNICODE character as
\xNNNN, instead of question marks.
- -v
- causes catdoc to print some useless information about word
document structure to stdout before actual start of text.
- -V
- outputs catdoc version
CHARACTER SETS¶
When processing MS-Word file
catdoc uses information about two character
sets, typically different
- input and output. They are stored in plain text files in
catdoc data
directory. Character set files should contain two whitespace-separated
hexadecimal numbers - 8-bit code in character set and 16-bit Unicode code.
Anything from hash mark to end of line is ignored, as well as blank lines.
catdoc distribution includes some of these character sets. Additional
character set definitions, directly usable by
catdoc can be obtained
from ftp.unicode.org. Charset files have
.txt suffix, which shouldn't
be specified in command-line or configuration files.
Note that
catdoc is distributed with Cyrillic charsets as default. If you
are not Russian, you probably don't want it, an should reconfigure catdoc at
compile time or in runtime configuration file.
When dealing with documents with charsets other than default, remember that
Microsoft never uses ISO charsets. While letters in, say cp1252 are at the
same position as in ISO-8859-1, some punctuation signs would be lost, if you
specify ISO-8859-1 as input charset. If you use cp1252, catdoc would deal with
those signs as described in CHARACTER SUBSTITUTION below.
CHARACTER SUBSTITUTION¶
catdoc converts MS-Word file into following internal Unicode
representation:
- 1. Paragraphs are separated by ASCII Line Feed symbol
(0x000A)
- 2. Table cells within row are separated by ASCII Field
Separator symbol
- (0x001C)
- 3. Table rows are separated by ASCII Record Separator
(0x001E)
- 4. All printable characters, including whitespace are
represented with their
- respective UNICODE codes.
This UNICODE representation is subsequently converted into 8-bit text in target
character set using following four-step algorithm:
- 1. List of special characters is searched for given Unicode
character.
- If found, then appropriate multi-character sequence is
output instead of character.
- 2. If there is an equivalent in target character set, it is
output.
- 3. Otherwise, replacement list is searched and, if there is
multi-character
- substitution for this UNICODE char, it is output.
- 4. If all above fails, "Unknown char" symbol
(question mark) is output.
Lists of special characters and list of substitution are character
set-independent, because special chars should be escaped regardless of their
existence in target character set (usually, they are parts of US-ASCII, and
therefore exist in any character set) and replacement list is searched only
for those characters, which are not found in target character set.
These lists are stored in
catdoc data directory in files with prefix of
format name. These files have following format:
Each line can be either comment (starting with hash mark) or contain hexadecimal
UNICODE value, separated by whitespace from string, which would be substituted
instead of it. If string contain no whitespace it can be used as is, otherwise
it should be enclosed in single or double quotes. Usual backslash sequences
like
'\n',
'\t' can be used in these string.
RUNTIME CONFIGURATION¶
Upon startup catdoc reads its system-wide configuration file
/etc/catdocrc
and then user-specific configuration file
${HOME}/.catdocrc.
These files can contain following directives:
- source_charset = charset-name
- Sets default source charset, which would be used if no
-s option specified. Consult configuration of nearby windows
workstation to find one you need.
- target_charset = charset-name
-
Sets default output charset. You probably know, which one you use.
- charset_path = directory-list
- colon-separated list of directories, which are searched for
charset files. This allows you to install additional charsets in your home
directory. If first directory component of path is ~ it is replaced by
contents of HOME environment variable. On MS-DOS platform, if
directory name starts with %s, it is replaced with directory of executable
file. Empty element in list (i.e. two consequitve colons) is considered
current directory.
- map_path = directory-list
- colon-separated list of directories, which are searched for
special character map and replacement map. Same substitution rules as in
charset_path are applied.
- format = format name
- Output format which would be used by default. catdoc
comes with two formats - ascii and tex but nothing prevents
you from writing your own format (set two map files - special character
map and replacement map).
- unknown_char = character specification
- sets character to output instead of unknown Unicode
character (default '?') Character specification can have one of two form -
character enclosed in single quotes or hexadecimal code.
- use_locale =(yes|no)
- Enables or disables automatic selection of output charset
(default yes),
based on system locale settings (if enabled at compile time). If automatic
detection is enabled, than output charset settings in the configuration
files (but not in the command line) are ignored, and current system locale
charset is used instead. There are no automatic choice of input charset,
based of locale language, because most modern Word files (since Word 97)
are Unicode anyway
BUGS¶
Doesn't handle fast-saves properly. Prints footnotes as separate paragraphs at
the end of file, instead of producing correct LaTeX commands. Cannot
distinguish between empty table cell and end of table row.
SEE ALSO¶
xls2csv(1),
cat(1),
strings(1),
utf(4),
unicode(7)
AUTHOR¶
V.B.Wagner <vitus@45.free.net>