.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.43)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\" ========================================================================
.\"
.IX Title "LaTeX::ToUnicode 3pm"
.TH LaTeX::ToUnicode 3pm "2023-12-14" "perl v5.36.0" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
LaTeX::ToUnicode \- Convert LaTeX commands to Unicode
.SH "VERSION"
.IX Header "VERSION"
version 0.54
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
.Vb 1
\&  use LaTeX::ToUnicode qw( convert debuglevel $endcw );
\&
\&  # simple examples:
\&  convert( \*(Aq{\e"a}\*(Aq              ) eq \*(Aqä\*(Aq;      # true
\&  convert( \*(Aq{\e"a}\*(Aq, entities=>1 ) eq \*(Aq&#00EF;\*(Aq; # true
\&  convert( \*(Aq"a\*(Aq, german=>1      ) eq \*(Aqä\*(Aq;      # true, \`german\*(Aq package syntax
\&  convert( \*(Aq"a\*(Aq,                ) eq \*(Aq"a\*(Aq;      # false, not enabled by default
\&  
\&  # more generally:
\&  my $latexstr;
\&  my $unistr = convert($latexstr);  # get literal (binary) Unicode characters
\&
\&  my $entstr = convert($latexstr, entities=>1);          # get &#xUUUU;
\&  
\&  my $htmstr = convert($latexstr, entities=>1, html=>1); # also html markup
\&  
\&  my $unistr = convert($latexstr, hook=>\e&my_hook); # user\-defined hook
\&  
\&  # if nonzero, dumps various info; perhaps other levels in the future.
\&  LaTeX::ToUnicode::debuglevel($verbose);
\&
\&  # regexp for terminating TeX control words, e.g., in hooks.
\&  my $endcw = $LaTeX::ToUnicode::endcw;
\&  $string =~ s/\e\enewline$endcw/ /g; # translate \enewline to space
.Ve
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
This module provides a method to convert LaTeX markups for accents etc.
into their Unicode equivalents. It translates some commands for special
characters or accents into their Unicode (or \s-1HTML\s0) equivalents and
removes formatting commands. It is not at all bulletproof or complete.
.PP
This module is intended to convert fragments of LaTeX source, such as
bibliography entries and abstracts, into plain text (or, optionally,
simplistic \s-1HTML\s0). It is not a document conversion system. Math, tables,
figures, sectioning, etc., are not handled in any way, and mostly left
in their TeX form in the output. The translations assume standard LaTeX
meanings for characters and control sequences; macros in the input are
not considered.
.PP
The aim for all the output is utter simplicity and minimalism, not
faithful translation. For example, although Unicode has a code point for
a thin space, the LaTeX \f(CW\*(C`\ethinspace\*(C'\fR (etc.) command is translated to
the empty string; such spacing refinements desirable in the TeX output
are, in our experience, generally not desired in the \s-1HTML\s0 output from
this tool.
.PP
As another example, TeX \f(CW\*(C`%\*(C'\fR comments are not removed, even on lines by
themselves, because they may be inside verbatim blocks, and we don't
attempt to keep any such context. In practice, TeX comments are rare in
the text fragments intended to be handled, so removing them in advance
has not been a great burden.
.PP
As another example, LaTeX ties, \f(CW\*(C`~\*(C'\fR characters, are replaced with
normal spaces (exception: unless they follow a \f(CW\*(C`/\*(C'\fR character or at the
beginning of a line, when they're assumed to be part of a url or a
pathname), rather than a no-break space character, because in our
experience most ties intended for the TeX output would just cause
trouble in plain text or \s-1HTML.\s0
.PP
Regarding normal whitespace: all leading and trailing horizontal
whitespace (that is, \s-1SPC\s0 and \s-1TAB\s0) is removed. All internal horizontal
whitespace sequences are collapsed to a single space.
.PP
After the conversions, all brace characters (\f(CW\*(C`{}\*(C'\fR) are simply removed
from the returned string. This turns out to be a significant convenience
in practice, since many LaTeX commands which take arguments don't need
to do anything for our purposes except output the argument.
.PP
On the other hand, backslashes are not removed. This is so the caller
can check for \f(CW\*(C`\e\e\*(C'\fR and thus discover untranslated commands. Of course
there are many other constructs that might not be translated, or
translated wrongly. There is no escaping the need to carefully look at
the output.
.PP
Suggestions and bug reports are welcome for practical needs; we know
full well that there are hundreds of commands not handled that could be.
Virtually all the behavior mentioned here would be easily made
customizable, if there is a need to do so.
.SH "FUNCTIONS"
.IX Header "FUNCTIONS"
.ie n .SS "convert( $latex_string, %options )"
.el .SS "convert( \f(CW$latex_string\fP, \f(CW%options\fP )"
.IX Subsection "convert( $latex_string, %options )"
Convert the text in \f(CW$latex_string\fR into a plain(er) Unicode string.
Escape sequences for accented and special characters (e.g., \f(CW\*(C`\ei\*(C'\fR,
\&\f(CW\*(C`\e"a\*(C'\fR, ...) are converted. A few basic formatting commands (e.g.,
\&\f(CW\*(C`{\eit ...}\*(C'\fR) are removed. See the LaTeX::ToUnicode::Tables submodule
for the full conversion tables.
.PP
These keys are recognized in \f(CW%options\fR:
.ie n .IP """entities""" 4
.el .IP "\f(CWentities\fR" 4
.IX Item "entities"
Output \f(CW\*(C`&#xUUUU;\*(C'\fR entities (valid in \s-1XML\s0); in this case, also convert
the <, >, \f(CW\*(C`&\*(C'\fR metacharacters to entities. Recognized non-ASCII
Unicode characters in the original input are also converted to entities,
not only the translations from TeX commands.
.Sp
The default is to output literal (binary) Unicode characters, and
not change any metacharacters.
.ie n .IP """german""" 4
.el .IP "\f(CWgerman\fR" 4
.IX Item "german"
If this option is set, the commands introduced by the package `german'
(e.g. \f(CW\*(C`"a\*(C'\fR eq \f(CW\*(C`ä\*(C'\fR, note the missing backslash) are also
handled.
.ie n .IP """html""" 4
.el .IP "\f(CWhtml\fR" 4
.IX Item "html"
If this option is set, the output is simplistic html rather than plain
text. This affects only a few things: 1) the output of urls from
\&\f(CW\*(C`\eurl\*(C'\fR and \f(CW\*(C`\ehref\*(C'\fR; 2) the output of markup commands like
\&\f(CW\*(C`\etextbf\*(C'\fR (but nested markup commands don't work); 3) two other
random commands, \f(CW\*(C`\eenquote\*(C'\fR and \f(CW\*(C`\epath\*(C'\fR, because they are needed.
.ie n .IP """hook""" 4
.el .IP "\f(CWhook\fR" 4
.IX Item "hook"
The value must be a function that takes two arguments and returns a
string. The first argument is the incoming string (may be multiple
lines), and the second argument is a hash reference of options, exactly
what was passed to this \f(CW\*(C`convert\*(C'\fR function. Thus the hook can detect
whether html is needed.
.Sp
The hook is called (almost) right away, before any of the other
conversions have taken place. That way the hook can make use of the
predefined conversions instead of repeating them. The only changes made
to the input string before the hook is called are trivial: leading and
trailing whitespace (space and tab) on each line are removed, and, for
\&\s-1HTML\s0 output, incoming ampersand, less-than, and greater-than characters
are replaced with their entities.
.Sp
Any substitutions that result in Unicode code points must use
\&\f(CW\*(C`\e\ex{nnnn}\*(C'\fR on the right hand side: that's two backslashes and a
four-digit hex number.
.Sp
As an example, here is a skeleton of the hook function for TUGboat:
.Sp
.Vb 2
\&  sub LaTeX_ToUnicode_convert_hook {
\&    my ($string,$options) = @_;
\&
\&    my $endcw = $LaTeX::ToUnicode::endcw;
\&    die "no endcw regexp in LaTeX::ToUnicode??" if ! $endcw;
\&
\&    ...
\&    $string =~ s/\e\enewline$endcw/ /g;
\&
\&    # TUB\*(Aqs \eacro{} takes an argument, but we do nothing with it.
\&    # The braces will be removed by convert().
\&    $string =~ s/\e\eacro$endcw//g;
\&    ...
\&    $string =~ s/\e\eCTAN$endcw/CTAN/g;
\&    $string =~ s/\e\eDash$endcw/\e\ex{2014}/g; # em dash; replacement is string
\&    ...
\&
\&    # ignore \ebegin{abstract} and \eend{abstract} commands.
\&    $string =~ s,\e\e(begin|end)$endcw\e{abstract\e}\es*,,g;
\&
\&    # Output for our url abbreviations, and other commands, depends on
\&    # whether we\*(Aqre generating plain text or HTML.
\&    if ($options\->{html}) {
\&        # HTML.
\&        # \etbsurl{URLBASE} \-> <a href="https://URLBASE">URLBASE</a>
\&        $string =~ s,\e\etbsurl$endcw\e{([^}]*)\e}
\&                    ,<a href="https://$1">$1</a>,gx;
\&        ...
\&        # varepsilon, and no line break at hyphen.
\&        $string =~ s,\e\eeTeX$endcw,\e\ex{03B5}<nobr>\-</nobr>TeX,g;
\&
\&    } else {
\&        # for plain text, we can just prepend the protocol://.
\&        $string =~ s,\e\etbsurl$endcw,https://,g;
\&        ...
\&        $string =~ s,\e\eeTeX$endcw,\e\ex{03B5}\-TeX,g;
\&    }
\&    ...
\&    return $string;
\&  }
.Ve
.Sp
As shown here for \f(CW\*(C`\eeTeX\*(C'\fR (an abbreviation macro defined in the
TUGboat style files), if markup is desired in the output, the
substitutions must be different for \s-1HTML\s0 and plain text. Otherwise, the
desired \s-1HTML\s0 markup is transliterated as if it were plain text. Or else
the translations must be extended so that TeX markup can be used on the
rhs to be replaced with the desired \s-1HTML\s0 (\f(CW\*(C`&lt;nobr&gt;\*(C'\fR in this case).
.Sp
For the full definition (and plenty of additional information),
see the file \f(CW\*(C`ltx2crossrefxml\-tugboat.cfg\*(C'\fR in the TUGboat source
repository at
<https://github.com/TeXUsersGroup/tugboat/tree/trunk/capsules/crossref>.
.Sp
The hook function is specified in the \f(CW\*(C`convert()\*(C'\fR call like this:
.Sp
.Vb 1
\&  LaTeX::ToUnicode::convert(..., { hook => \e&LaTeX_ToUnicode_convert_hook })
.Ve
.ie n .SS "debuglevel( $level )"
.el .SS "debuglevel( \f(CW$level\fP )"
.IX Subsection "debuglevel( $level )"
Output debugging information if \f(CW$level\fR is nonzero.
.ie n .SS "$endcw"
.el .SS "\f(CW$endcw\fP"
.IX Subsection "$endcw"
A predefined regexp for terminating TeX control words (not control
symbols!). Can be used in, for example, hook functions:
.PP
.Vb 2
\&  my $endcw = $LaTeX::ToUnicode::endcw;
\&  $string =~ s/\e\enewline$endcw/ /g; # translate \enewline to space
.Ve
.PP
It's defined as follows:
.PP
.Vb 1
\&  our $endcw = qr/(?<=[a\-zA\-Z])(?=[^a\-zA\-Z]|$)\es*/;
.Ve
.PP
That is, look behind for an alphabetic character, then look ahead for a
non-alphabetic character (or end of line), then consume whitespace.
Fingers crossed.
.SH "AUTHOR"
.IX Header "AUTHOR"
Gerhard Gossen <gerhard.gossen@googlemail.com>,
Boris Veytsman <boris@varphi.com>,
Karl Berry <karl@freefriends.org>
.PP
<https://github.com/borisveytsman/bibtexperllibs>
.SH "COPYRIGHT AND LICENSE"
.IX Header "COPYRIGHT AND LICENSE"
Copyright 2010\-2023 Gerhard Gossen, Boris Veytsman, Karl Berry
.PP
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl5 programming language system itself.