NAME¶
konwert - interface for various character encoding conversions
SYNOPSIS¶
konwert FILTER [
FILE]... [
-o DEST |
-O]
DESCRIPTION¶
Konwert allows filtering multiple files through multiple filters. It filters the
specified
FILEs, or stdin if none are given.
Simple
FILTER is the name of an executable file from the directory
~/.konwert/filters or the system-wide one, normally
/usr/share/konwert/filters. Such program itself filters stdin to
stdout.
The filtering rule can be more complex:
konwert FILTER1+FILTER2 means
konwert
FILTER1 |
konwert FILTER2.
konwert FORMAT1-FORMAT2, unless such filter exists,
tries to find a common
FORMAT3, such that both filters
FORMAT1-FORMAT3 and
FORMAT3-FORMAT1
do exist.
konwert FILTER/ARG/... passes arguments to
the filter. Arguments can also be specified here:
FORMAT1/ARGS-FORMAT2. The meaning of
arguments depends on the particular filter.
konwert '(COMMAND ARGS...)' executes this arbitrary
shell command. This is useful with
-o or
-O options. The command
cannot contain the string
)+, which will terminate this filter's
specification.
OPTIONS¶
- -o DEST
- output goes to this file/directory instead of stdout
- -O
- every input file is replaced with its translation
- --help
- display help and exit
- --version
- output version information and exit
Redirecting output to one of the source files with either
-o or
> instead of
-O will corrupt it! Option
-O creates a
temporary file in
/tmp and later copies it back onto the source.
CHARACTER ENCODING CONVERSIONS¶
You can convert text between any two charsets, for example
konwert
cp437-iso2.
Characters unavailable in the target charset will be substituted with
approximations with available ones. The approximations need not be single
characters.
The following character sets are currently supported:
- ascii
- 7bit ASCII
- utf8 = unicode
- Unicode UTF-8
- iso1 = isolatin1
- ISO-8859-1 aka ISO Latin 1 (Western European)
- iso2 = isolatin2
- ISO-8859-2 aka ISO Latin 2 (Central European)
- iso3 = isolatin3
- ISO-8859-3 aka ISO Latin 3 (Esperanto)
- iso4 = isolatin4
- ISO-8859-4 aka ISO Latin 4 (Baltic)
- iso5 = isolatincyr
- ISO-8859-5 (Cyrillic)
- iso6 = isolatinarabic
- ISO-8859-6 (Arabic)
- iso7 = isolatingreek
- ISO-8859-7 (Greek)
- iso8 = isolatinhebrew
- ISO-8859-8 (Hebrew)
- iso9 = isolatin5 = isolatintur
- ISO-8859-9 aka ISO Latin 5 (Turkish)
- iso10 = isolatin6 = isolatinnordic
- ISO-8859-10 aka ISO Latin 6 (Nordic)
- iso12 = isolatin7 = isolatinceltic
- ISO-8859-12 aka ISO Latin 6 (Celtic) - Draft
- iso13 = isolatin8 = isolatinbaltic
- ISO-8859-13 aka ISO Latin 6 (Baltic) - Draft
- iso14 = isolatin9 = isolatinsami
- ISO-8859-14 aka ISO Latin 6 (Sámi) - Draft
- iso15
- ISO-8859-15 - Draft
- koi8r
- KOI8-R (Russian)
- koi8u
- KOI8-U (Ukrainian, Byelorussian)
- koi8uni
- KOI8-Uni (Cyrillic)
- cp1250 = wince = winlatin2
- Windows CP-1250 aka Win Latin 2 (Central European)
- cp1251 = wincyr
- Windows CP-1251 (Cyrillic)
- cp1252 = winwest = winlatin1
- Windows CP-1252 aka Win Latin 1 (Western European)
- cp1253 = wingr
- Windows CP-1253 (Greek)
- cp1254 = wintur
- Windows CP-1254 (Turkish)
- cp1255 = winhebrew
- Windows CP-1255 (Hebrew)
- cp1256 = winarabic
- Windows CP-1256 (Arabic)
- cp1257 = winbaltic
- Windows CP-1257 (Baltic)
- cp1258 = winviet
- Windows CP-1258 (Vietnamese)
- cp437 = icmeng
- DOS CP-437 (English)
- cp737 = dosgreek
- DOS CP-737 (Greek)
- cp775 = dosbaltic
- DOS CP-775 (Baltic)
- cp850 = doswest = doslatin1
- DOS CP-850 aka DOS Latin 1 (Western European)
- cp852 = dosce = doslatin2
- DOS CP-852 aka DOS Latin 2 (Central European)
- cp855 = doscyr
- DOS CP-855 (Cyrillic)
- cp857 = dostur
- DOS CP-857 (Turkish)
- cp860 = dosportugal
- DOS CP-860 (Portugal)
- cp861 = dosiceland
- DOS CP-861 (Icelandic)
- cp862 = doshebrew
- DOS CP-862 (Hebrew)
- cp863 = doscanadfr
- DOS CP-863 (Canadian French)
- cp864 = dosarabic
- DOS CP-864 (Arabic)
- cp865 = dosnordic
- DOS CP-865 (Nordic)
- cp866 = dosrussian
- DOS CP-866 (Russian)
- cp869 = dosgreek2
- DOS CP-869 (Greek2)
- cp874 = dosthai
- DOS CP-874 (Thai)
- mac
- Macintosh Roman (Western European)
- macce
- Macintosh Central European
- maccyr
- Macintosh Cyrillic
- macgreek
- Macintosh Greek
- maciceland
- Macintosh Icelandic
- mactur
- Macintosh Turkish
- csk,
- cyfromat,
- dhn,
- fidomazovia,
- iea,
- logic,
- mazovia,
- microvex
- DOS charsets for Polish
- amigapl,
- fat,
- xjp
- Amiga charsets for Polish
- kamenicky
- DOS charset for Czech and Slovak
- wingreek
- WinGreek (Windows font-based encoding for ancient Greek)
- babelpl
- TeX [polish]{babel}:
"a"c"e"l"n"o"s"z"r
- ciachy
- TeX \prefixing: /a/c/e/l/n/o/s/x/z
- xmetodo
- Esperanto: cx gx hx jx sx ux (vx w)
- hmetodo
- Esperanto: ch gh hh jh sh u
- antauxcxap
- Esperanto: ^c ^g ^h ^j ^s ^u (~u)
- postcxap
- Esperanto: c^ g^ h^ j^ s^ u^ (u~)
- apostrofoj
- Esperanto: c' g' h' j' s' u'
- malapostrofoj
- Esperanto: c` g` h` j` s` u`
- viscii
- VISCII (Vietnamese)
- viqri
- Vietnamese Quoted Readable Implicit
- htmldec
- SGML/HTML character references (decimal): Æ ě
→
- htmlhex
- SGML/HTML character references (hexadecimal): Æ ě
→
- htmlent
- SGML/HTML character entities (names): Æ &ecaron
→
- html
- All three above (only as input format)
- tex
- TeX with some LaTeX or AMS-TeX extensions. There is no distinction between
normal and math mode - you will probably have to insert some $'s
manually.
- mnemonic
- RFC 1345 mnemonics preceded by &
- mnemonic1
- RFC 1345 mnemonics preceded by `
- any/LANGUAGE (e.g. any/pl-iso2)
- This special input format will detect the encoding automatically, basing
on the frequencies of characters found in text. Every language is
associated with a set of possible encodings used for it and average
frequencies of its letters (excluding ASCII letters). The best fitting
encoding is used for conversion. Currently supported languages are
cs (Czech), de (German), el (Greek), eo
(Esperanto), es (Spanish), fr (French), he (Hebrew),
it (Italian), pl (Polish), pt (Portuguese), ru
(Russian), and sv (Swedish).
- varpl
- Mixed Polish ISO-8859-2, CP-1250, and UTF-8. If you are reading Polish
newsgroups I suggest putting it as a filter in your newsreader (for speed
improvement it's better to call it directly, rather than through
konwert).
- vareo
- Mixed various Esperanto encodings.
OPTIONS CONTROLLING THE ABOVE CONVERSIONS¶
- /1 (e.g. konwert iso2-ascii/1)
- Each unavailable character will be replaced only with a single approximate
char, not string. This is useful with the filterm program or with
preformatted text. This option is automatically turned on when a filter is
used as output for filterm.
- /html
- Text is assumed to be HTML. The characters " & < >
resulting from other characters' approximations will be properly escaped
as " & < >. The <META
http-equiv="content-type" content="text/html;
charset=..."> header will be fixed if present.
- /htmldec
- Convert META as above. Unavailable characters will be encoded in
&#Unicode;.
- /htmlhex
- Convert META as above. Unavailable characters will be encoded in
hexadecimal &#xUnicode;.
- /tex
- Unavailable characters will be described in TeX. Characters # $ % &
\ ^ _ { | } ~ resulting from some characters' approximations will be
properly escaped into \# \$ \% \& $\backslash$ \^{} \_ \{ $|$
\} \\~{}.
- /asciichar
- Recognizes some ASCII representations of characters, e.g. (c) ...
1/2 >=.
- /rosyjski
- Russian text will be replaced with its Polish phonetic transcription.
Some output filters can use the language information for choosing better
approximations of unavailable letters, for example
/de (German):
ä →
ae instead of
a.
OTHER FILTERS¶
- any/LANGUAGE-test
- Detects the encoding, but instead of text conversion only shows the
encoding's name. The additional option /all shows all possible
encodings, sorted from better to worse ones.
- cr
- lf
- crlf
- Force specific end-of-line marker convention. cr = Macintosh,
lf = Unix and Amiga, crlf = Windows and DOS. The input
convention is detected automatically.
- expand
- Expands tabs into spaces (uses the textutils program expand).
- unexpand
- Compresses spaces into tabs (uses the textutils program unexpand).
- rmspacesateol
- Removes spaces and tabs at end of line.
- qp-8bit
- 8bit-qp
- MIME Quoted Printable encoding: =A3=F3d=BC.
- rtf-8bit
- 8bit-rtf
- Rich Text Format: \\'a3\\'f3d\\'9f.
- txt-htmlchar
- Escapes " & < > into SGML/HTML entities
" & < >. Useful for including
a text file inside HTML <PRE> </PRE> tags.
- htmlchar-txt
- Reverse.
- rot13
- Guvf vf n qrzbafgengvba bs ebg13.
- toupper
- tolower
- Self-explanatory. Currently ASCII only.
- prn7pl
- Converts polish chars to control sequences for EPSON-compatible printer.
Using only 7-bit chars, backspacing printer's head and vertical
positioning chars ,.'` it creates pseudo-polish gryphs. You can specify
options: /nlq (default) which optimalizes output for better quality
printers and /draft - useful for ex. for 9-nails printer.
FILES¶
- /usr/share/konwert/filters/*
- ~/.konwert/filters/*
SEE ALSO¶
trs(1),
filterm(1)
BUGS¶
APPLE character in mac* charsets, and CH and ch characters in koi8cs are not
preserved in conversion even when they are available. Also they don't respect
the /1 option. Reason: they are not in Unicode.
COPYRIGHT¶
Konwert is a package for conversion between various character encodings.
Copyright (c) 1998 Marcin 'Qrczak' Kowalczyk
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place, Suite 330, Boston, MA 02111-1307 USA
AUTHOR¶
__("< Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.home.ml.org/
\__/ GCS/M d- s+:-- a21 C+++>+++$ UL++>++++$ P+++ L++>++++$ E->++
^^ W++ N+++ o? K? w(---) O? M- V? PS-- PE++ Y? PGP->+ t
QRCZAK 5? X- R tv-- b+>++ DI D- G+ e>++++ h! r--%>++ y-