'\" t .\" Title: herold .\" Author: Michael Fuchs .\" Generator: DocBook XSL Stylesheets v1.79.2 .\" Date: 06/13/2022 .\" Manual: User Commands .\" Source: herold .\" Language: English .\" .TH "HEROLD" "1" "06/13/2022" "herold" "User Commands" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" herold \- HTML to DocBook converter .SH "SYNOPSIS" .HP \w'\fBherold\fR\ 'u \fBherold\fR [OPTIONS] .SH "DESCRIPTION" .PP The reuse of HTML content in presentation\-neutral form is a frequent problem\&. One possible solution is to convert HTML to DocBook XML, because DocBook is a semantic markup language for documentation, which enables its users to create document content that captures the logical structure of the content\&. .PP The command line tool herold can be used to convert HTML to DocBook\&. Because HTML elements are often used not as intended, the possibilities for such a transformation are somewhat limited\&. herold is part of the dbdoclet suite of tools\&. For more information visit \m[blue]\fBhttp://www\&.dbdoclet\&.org\fR\m[]\&. .SH "OPTIONS" .PP \-\-docbook\-add\-index, \-x .RS 4 Automatically add an index element at the end of the document\&. .RE .PP \-\-docbook\-decompose\-tables, \-T .RS 4 Decomposes the tables from the HTML code into single paragraphs\&. This can be useful, if a document contains a lot of tables for formatting reasons\&. .RE .PP \-\-docbook\-encoding, \-d .RS 4 Specifies the encoding of the generated DocBook XML files\&. .RE .PP \-\-docbook\-root\-element, \-r .RS 4 The root element of the document\&. Possible values are: book, article, reference, part, chapter or section\&. The default value for this option is \*(Aqarticle\*(Aq .RE .PP \-\-docbook\-title, \-t .RS 4 The title for the resulting document\&. .RE .PP \-\-in, \-i .RS 4 Specifies the HTML input file\&. .RE .PP \-\-help, \-h .RS 4 Prints a help page on the console\&. .RE .PP \-\-html\-encoding, \-s .RS 4 Specifies the encoding of the HTML source files, such as ISO\-8859\-1\&. .RE .PP \-\-out, \-o .RS 4 Specifies the DocBook XML destination file\&. .RE .PP \-\-profile, \-p .RS 4 A profile file with predefined settings\&. .RE .PP \-\-verbose, v .RS 4 Enables the verbosity for the console output\&. .RE .PP \-\-version, \-V .RS 4 Displays the version of herold\&. .RE .SH "CONFIGURATION" .PP The details of a transformation are controlled by a profile file\&. A profile file offers more possibilities to influence the transformation than the command line arguments\&. The following example shows a typical profile file\&. .sp .if n \{\ .RS 4 .\} .nf transformation html2docbook; section section\-detection { attribute\-class = ["^MsoHeading(\ed+)$"]; section\-numbering\-pattern = "((\ed+\e\&.)+)?\ed*\e\&.?\ep{Z}*"; } section list\-detection { itemized\-attribute\-class = ["^MsoListBullet(\ew*)$", "Aufzhlung(\ew+)$]; itemized\-strip\-prefix = [ "\-", "o", "\eu00b7" ]; ordered\-attribute\-class = ["^MsoListNumbered(\ew*)$"]; ordered\-strip\-prefix = [ "\ed+\e\&.\es+" ]; } section HTML { encoding = "windows\-1252"; exclude = [ "//p[starts\-with(@class, \*(AqMsoToc\*(Aq)]", "" ]; } section DocBook { abstract = """Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua\&. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat\&. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur\&. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum\&.sed, dolor amet\&."""; add\-index = true; author\-email = "me@somewhere\&.de"; author\-firstname = "Michael"; author\-surname = "Fuchs"; chunk\-elements = [ "chapter", "section", "appendix" ]; // Syntax: chunk\-\-depth = ; chunk\-section\-depth = 3; collapse\-protected\-space = "true"; copyright\-holder = "Ingenieurbüro Michael Fuchs"; copyright\-year = "2015"; corporation = ""; create\-condition\-attribute = false; create\-prolog = true; create\-remap\-attribute = false; create\-xref\-label = false; decompose\-tables = false; detect\-trapped\-br = true; documentation\-id = "doc01"; document\-element = "book"; encoding = "UTF\-8"; hyphenation\-char = "soft\-hyphen"; image\-data\-formats = [ "gif", "base64" ]; image\-path = "\&./figures"; language = "de"; release\-info = "Version 3\&.1"; table\-style = "all"; title = "Tutorial"; title\-normalize\-space = true; use\-absolute\-image\-path = false; } .fi .if n \{\ .RE .\} .SS "Syntax" .PP A profile file consists mainly of sections\&. Sections are used to group parameters which share the same context\&. Every section must start with the keyword \fIsection\fR followed by the name of the section\&. After the name comes the block of parameters, which is surrounded by curly braces\&. Parameters can be of type String, Number, Boolean or Array\&. Strings must be framed with double quotes\&. If the String contains newlines, use three double quotes instead of one\&. Arrays are framed with square brackets\&. Inside an array, the elements must be comma separated\&. Every assignment must be finished by a semicolon\&. Multi line comments have the form \fI/* my comment */\fR , single line comments look like \fI// my comment\en\fR\&. .SS "Mandatory Elements" .PP A profile for herold must start with the line transformation html2docbook;\&. .SS "Section HTML" .PP The section HTML defines parameters, which control the loading and parsing of the HTML input data\&. .PP .PP encoding .RS 4 The character set used to read the input stream\&. .RE .PP exclude .RS 4 Defines an array of xpath expressions\&. All matches are removed from the HTML DOM tree before transformation\&. .RE .SS "Section DocBook" .PP .PP abstract .RS 4 The text for the abstract element of the info section\&. If the text is structured with newlines, use three double quotes as delimiters\&. If the text starts with a "<" character, it is embedded into an abstract element, otherwise the text is embedded into an para element inside of an abstract element\&. The text will parsed and can contain DocBook elements\&. .RE .PP add\-index .RS 4 If set to true, an index element is inserted at the end of the DocBook XML\&. .RE .PP author\-email .RS 4 The email address of the author\&. If this parameter is set, it is used to create an info section at the beginning of the document\&. .RE .PP author\-firstname .RS 4 The firstname of the author\&. If this parameter is set, it is used to create an info section at the beginning of the document\&. .RE .PP author\-surname .RS 4 The surname of the author\&. If this parameter is set, it is used to create an info section at the beginning of the document\&. .RE .PP chunk\-elements .RS 4 Defines an array of element names\&. If an element of this list is detected while writing the output, the element and all child nodes will be written to a separate file\&. This new file will be included into the parent file with an \fBxi:include\fR tag\&. Recursive structures result in recursive includes\&. You might want to use this, if you are transforming big HTML files and the resulting DocBook XML file becomes uncomfortable large\&. .RE .PP chunk\-\-depth .RS 4 Defines the depth for a chunk element, until the chunking should be executed, eg chunk\-section\-depth = 3\&. If an element defined for chunking is nested recursivley, you might want to control the depth to which the chunking should be done\&. The default depth is 1, which means only the topmost element is separated\&. .RE .PP create\-xref\-label .RS 4 if set to false, anchor elements doesn\*(Aqt get a xreflabel attribute\&. .RE .PP decompose\-tables .RS 4 If set to true, tables structures will be ignored\&. The content of the table cells will be inserted into the DocBook XML as a sequence of paragraphs\&. This parameter can be useful if your HTML contains tables for formatting purposes\&. Normally you want to get rid of them, because they tamper the logical structure\&. .RE .PP document\-element .RS 4 The document element you want to use\&. Must be one of article, book, part or reference\&. .RE .PP encoding .RS 4 The character set which will be used for writing the output file\&. .RE .PP image\-data\-formats .RS 4 An array of image formats\&. These formats will be inserted as imageobject elements, additionally to the format found in the src attribute of the corresponding img element\&. The original format is inserted twice with the roles "html" and "fo"\&. The other formats are inserted as "html\-" and "fo\-"\&. .RE .PP title .RS 4 The title of the resulting document\&. If this parameter is undefined, herold tries to dected the title from the head section of the HTML data\&. .RE .PP use\-absolute\-image\-path .RS 4 If you want absolute image paths in the fileref attribute of the imagedata element, set this parameter to true\&. .RE .SS "Section node" .PP The mapping of HTML elements to DocBook element can be fine tuned by using node sections\&. If you have HTML code which looks like the following fragment: .sp .if n \{\ .RS 4 .\} .nf
  1. Step 1
  2. Step 2
  3. Step 3
.fi .if n \{\ .RE .\} .sp The resulting DocBook XML after the transformation would normally look like: .sp .if n \{\ .RS 4 .\} .nf Step 1 Step 2 Step 3 .fi .if n \{\ .RE .\} .sp But what you would like to have is something like: .sp .if n \{\ .RS 4 .\} .nf Step 1 Step 2 Step 3 .fi .if n \{\ .RE .\} .sp To achieve this, you can use the following rules in our profile: .sp .if n \{\ .RS 4 .\} .nf node "//ol[@class=\*(Aqprocedure\*(Aq]" { map\-to = "procedure"; } node "//ol[@class=\*(Aqprocedure\*(Aq]/li" { map\-to = "step"; } .fi .if n \{\ .RE .\} .sp After the keyword node follows a xpath expression which is matched against the document element of the HTML file (typically )\&. The parameter map\-to defines the DocBook element, which is used instead of the default mapping element\&. .SS "Section attribute" .PP An attribute section is more or less the same as a node section\&. Instead of redefining the mapping of a HTML element to a DocBook element, the mapping for an attribute is changed\&. The following section maps an attribute class=\*(Aqprocedure\*(Aq to role=\*(Aqprocedure\*(Aq\&. .sp .if n \{\ .RS 4 .\} .nf attribute "//@class[contains(\&., \*(Aqprocedure\*(Aq)]" { map\-to = "role"; } .fi .if n \{\ .RE .\} .sp .SS "Section section\-detection" .PP The section \fIsection\-detection\fR is used to detect section elements in HTML code and to strip off any numbering prefix from the titles\&. .PP Many authoring tools allow deeply nested sections\&. While exporting HTML, it happens, that the nesting becomes deeper than six levels\&. HTML provides header elements for up to six levels, h1\-h6, but no h7 or even more\&. At this point, the formatting is normally done with the help of CSS and div or p elements\&. herold is able to detect the header element of HTML, but it can not know about the export format of a specific tool\&. To solve this problem even for some cases, you can specify the parameter \fIattribute\-class\fR\&. It consists of a list of regular expressions, which are matched against the class attribute of each HTML element\&. If a match is found, the element is considered as a section element\&. The regular expression can have group, which is interpreted as level indicator\&. The group must be the first group and it must match against a number, e\&.g\&. ^heading(\ed+)$\&. If the level can not be detected, a level of seven is assumed\&. .PP Because DocBook XSL stylesheets take care of the section numbering while transforming the DocBook XML to a specific output, it is often necessary to strip the numbering already defined in the HTML page\&. Otherwise you end up with two numbering texts in front of your titles\&. To help herold with the detection of numbering patterns, use the parameter \fIsection\-numbering\-pattern\fR\&. .PP .PP attribute\-class .RS 4 A regular expression, which is applied to every p and div element\&. If the expression matches, the current element is handled as a section element\&. If the regular expression has groups, the first group will be used as nesting level, otherwise level seven is assumed\&. .RE .PP section\-numbering\-pattern .RS 4 Normally you want to get rid of the section numbering that comes with the HTML data, because it becomes part of the title text in DocBook\&. The section numbers will the appear twice in your target media\&. One from HTML and one from the DocBook XSL processing\&. The parameter section\-numbering\-pattern defines a regular expression, which is matched against the beginning of every section title\&. If it matches, the matching part is removed\&. .RE .SS "Section list\-detection" .PP Sometimes lists are not represented with ul, ol or dl tags, but they are represented as p tags with additional css formatting\&. If you use a tool, which creates or exports HTML with such a construct, the conversion will end up with para elements, instead of the corresponding list elements in DocBook\&. To recreate the lists in some cases, you can use the section \fIlist\-detection\fR\&. The parameters \fIitemized\-attribute\-class\fR and \fIordered\-attribute\-class\fR let you define lists of regular expression, which should match against the class attribute of listitem elements in the HTML\&. herold tries to rebuild the proper list structure from this information, even for nested lists\&. .SH "COPYRIGHT" .PP Copyright 2001\-2015 Michael Fuchs\&. License GPLv3+: GNU GPL version 3 or later \m[blue]\fBhttp://gnu\&.org/licenses/gpl\&.html\fR\m[]\&. This is free software: you are free to change and redistribute it\&. There is NO WARRANTY, to the extent permitted by law\&. .SH "AUTHOR" .PP \fBMichael Fuchs\fR .RS 4 Software Engineer .RE