NAME¶
htmlstrip - Strip HTML markup code
SYNOPSIS¶
htmlstrip [
-o outputfile] [
-O level] [
-b blocksize] [
-v] [
inputfile]
DESCRIPTION¶
HTMLstrip reads
inputfile or from "stdin" and strips the
contained HTML markup. Use this program to shrink and compactify your HTML
files in a safe way.
Recognized Content Types¶
There are three disjunct types of content which are recognized by HTMLstrip
while parsing:
- HTML Tag (tag)
- This is just a single HTML tag, i.e. a string beginning
with a opening angle bracket directly followed by an identifier,
optionally followed by attributes and ending with a closing angle
bracket.
- Preformatted (pre)
- This is any contents enclosed in one of the following
container tags:
1. <nostrip>
2. <pre>
3. <xmp>
The non-HTML-3.2-conforming "<nostrip>" tag is special here:
It acts like "<pre>" as a protection container for
HTMLstrip but is also stripped from the output. Use this as a pseudo-block
which just preserves its body for the HTMLstrip processing but itself is
removed from the output.
- Plain Text (txt)
- This is anything not falling into one of the two other
categories, i.e any content both outside of preformatted areas and outside
of HTML tags.
Supported Stripping Levels¶
The amount of stripping can be controlled by a optimization level, specified via
option
-O (see below). Higher levels also include all of the lower
levels. The following stripping is done on each level:
- Level 0:
- No real stripping, just removing the sharp/comment-lines
("#...") [txt,tag]. Such lines are a standard feature of WML, so
this is always done.
- Level 1:
- Minimal stripping: Same as level 0 plus stripping of blank
and empty lines [txt].
- Level 2:
- Good stripping: Same as level 1 plus compression of
multiple whitespaces (more then one in sequence) to single whitespaces
[txt,tag] and stripping of trailing whitespaces at the of of a line
[txt,tag,pre].
This level is the default because while providing good optimization
the HTML markup is not destroyed and remains human readable.
- Level 3:
- Best stripping: Same as level 2 plus stripping of leading
whitespaces on a line [txt]. This can also be recommended when you still
want to make sure that the HTML markup is not destroyed in any case. But
the resulting code is a little bit ugly because of the removed
whitespaces.
- Level 4:
- Expert stripping: Same as level 3 plus stripping of HTML
comment lines (``"<!-- ... -->"'') and crunching of HTML
tag endsi [tag]. BE CAREFUL HERE: Comment lines are widely
used for hiding some Java or JavaScript code for browsers which are not
capable of ignoring those stuff. When using this optimization level make
sure all your JavaScript code is hided correctly by adding HTMLstrip's
"<nostrip>" tags around the comment delimiters.
- Level 5:
- Crazy stripping: Same as level 4 plus wrapping lines around
to fit in an 80 column view window. This saves some newlines but both
leads to really unreadable markup code and opens the window for a lot of
problems when this code is used to layout the page in a browser. Use
with care. This is only experimental!
Additionally the following global strippings are done:
- "^\n":
- A leading newline is always stripped.
- "<suck>":
- The "<suck>" tag just absorbs itself and
all whitespaces around it. This is like the backslash for
line-continuation, but is done in Pass 8, i.e. really at the end. Use this
inside HTML tag definitions to absorb whitespaces, for instance around
%body when used inside "<table>" structures which at some
point are newline-sensitive in Netscape Navigator.
OPTIONS¶
- -o outputfile
- This redirects the output to outputfile. Usually the
output will be send to "stdout" if no such option is specified
or outputfile is ""-"".
- -O level
- This sets the optimization/stripping level, i.e. how much
HTMLstrip should compress the contents.
- -b blocksize
- For efficiency reasons, input is divided into blocks of
16384 chars. If you have some performance problems, you may try to change
this value. Any value between 1024 and 32766 is allowed. With a value of
0, input is not divided into blocks.
- -v
- This sets verbose mode where some processing information
will be given on the console.
AUTHORS¶
Ralf S. Engelschall
rse@engelschall.com
www.engelschall.com
Denis Barbier
barbier@engelschall.com