NAME¶
MojoMojo::Declaw - Cleans HTML as well as CSS of scripting and other executable
contents, and neutralises XSS attacks. Derived from HTML::Defang version 1.01.
SYNOPSIS¶
my $InputHtml = "<html><body></body></html>";
my $Defang = MojoMojo::Declaw->new(
context => $Self,
fix_mismatched_tags => 1,
tags_to_callback => [ br embed img ],
tags_callback => \&DefangTagsCallback,
url_callback => \&DefangUrlCallback,
css_callback => \&DefangCssCallback,
attribs_to_callback => [ qw(border src) ],
attribs_callback => \&DefangAttribsCallback
);
my $SanitizedHtml = $Defang->defang($InputHtml);
# Callback for custom handling specific HTML tags
sub DefangTagsCallback {
my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;
return 1 if $lcTag eq 'br'; # Explicitly defang this tag, eventhough safe
return 0 if $lcTag eq 'embed'; # Explicitly whitelist this tag, eventhough unsafe
return 2 if $lcTag eq 'img'; # I am not sure what to do with this tag, so process as HTML::Defang normally would
}
# Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
sub DefangUrlCallback {
my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;
return 0 if $$AttrValR =~ /safesite.com/i; # Explicitly allow this URL in tag attributes or stylesheets
return 1 if $$AttrValR =~ /evilsite.com/i; # Explicitly defang this URL in tag attributes or stylesheets
}
# Callback for custom handling style tags/attributes
sub DefangCssCallback {
my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
my $i = 0;
foreach (@$Selectors) {
my $SelectorRule = $$SelectorRules[$i];
foreach my $KeyValueRules (@$SelectorRule) {
foreach my $KeyValueRule (@$KeyValueRules) {
my ($Key, $Value) = @$KeyValueRule;
$$KeyValueRule[2] = 1 if $Value =~ '!important'; # Comment out any '!important' directive
$$KeyValueRule[2] = 1 if $Key =~ 'position' && $Value =~ 'fixed'; # Comment out any 'position=fixed;' declaration
}
}
$i++;
}
}
# Callback for custom handling HTML tag attributes
sub DefangAttribsCallback {
my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
$$AttrValR = '0' if $lcAttrKey eq 'border'; # Change all 'border' attribute values to zero.
return 1 if $lcAttrKey eq 'src'; # Defang all 'src' attributes
return 0;
}
DESCRIPTION¶
This module accepts an input HTML and/or CSS string and removes any executable
code including scripting, embedded objects, applets, etc., and neutralises any
XSS attacks. A whitelist based approach is used which means only HTML known to
be safe is allowed through.
HTML::Defang uses a custom html tag parser. The parser has been designed and
tested to work with nasty real world html and to try and emulate as close as
possible what browsers actually do with strange looking constructs. The test
suite has been built based on examples from a range of sources such as
http://ha.ckers.org/xss.html and
http://imfo.ru/csstest/css_hacks/import.php
to ensure that as many as possible XSS attack scenarios have been dealt with.
HTML::Defang can make callbacks to client code when it encounters the following:
- •
- When a specified tag is parsed
- •
- When a specified attribute is parsed
- •
- When a URL is parsed as part of an HTML attribute, or CSS
property value.
- •
- When style data is parsed, as part of an HTML style
attribute, or as part of an HTML <style> tag.
The callbacks include details about the current tag/attribute that is being
parsed, and also gives a scalar reference to the input HTML. Querying
pos() on the input HTML should indicate where the module is with
parsing. This gives the client code flexibility in working with HTML::Declaw.
HTML::Declaw can defang whole tags, any attribute in a tag, any URL that appear
as an attribute or style property, or any CSS declaration in a declaration
block in a style rule. This helps one to precisely block the most specific
unwanted elements in the contents(for example, block just an offending
attribute instead of the whole tag), while retaining any safe HTML/CSS.
CONSTRUCTOR¶
- MojoMojo::Declaw->new(%Options)
- Constructs a new HTML::Declaw object. The following options
are supported:
- Options
- tags_to_callback
- Array reference of tags for which a call back should be
made. If a tag in this array is parsed, the subroutine
tags_callback() is invoked.
- attribs_to_callback
- Array reference of tag attributes for which a call back
should be made. If an attribute in this array is parsed, the subroutine
attribs_callback() is invoked.
- tags_callback
- Subroutine reference to be invoked when a tag listed in
@$tags_to_callback is parsed.
- attribs_callback
- Subroutine reference to be invoked when an attribute listed
in @$attribs_to_callback is parsed.
- url_callback
- Subroutine reference to be invoked when a URL is detected
in an HTML tag attribute or a CSS property.
- css_callback
- Subroutine reference to be invoked when CSS data is found
either as the contents of a 'style' attribute in an HTML tag, or as the
contents of a <style> HTML tag.
- fix_mismatched_tags
- This property, if set, fixes mismatched tags in the HTML
input. By default, tags present in the default %mismatched_tags_to_fix
hash are fixed. This set of tags can be overridden by passing in an array
reference $mismatched_tags_to_fix to the constructor. Any opened tags in
the set are automatically closed if no corresponding closing tag is found.
If an unbalanced closing tag is found, that is commented out.
- mismatched_tags_to_fix
- Array reference of tags for which the code would check for
matching opening and closing tags. See the property
$fix_mismatched_tags.
- context
- You can pass an arbitrary scalar as a 'context' value
that's then passed as the first parameter to all callback functions. Most
commonly this is something like '$Self'
- Debug
- If set, prints debugging output.
CALLBACK METHODS¶
- COMMON PARAMETERS
- A number of the callbacks share the same parameters. These
common parameters are documented here. Certain variables may have specific
meanings in certain callbacks, so be sure to check the documentation for
that method first before referring this section.
- $context
- You can pass an arbitrary scalar as a 'context' value
that's then passed as the first parameter to all callback functions. Most
commonly this is something like '$Self'
- $Defang
- Current HTML::Declaw instance
- $OpenAngle
- Opening angle(<) sign of the current tag.
- $lcTag
- Lower case version of the HTML tag that is currently being
parsed.
- $IsEndTag
- Has the value '/' if the current tag is a closing tag.
- $AttributeHash
- A reference to a hash containing the attributes of the
current tag and their values. Each value is a scalar reference to the
value, rather than just a scalar value. You can add attributes (remember
to make it a scalar ref, eg $AttributeHash{"newattr"} =
\"newval"), delete attributes, or modify attribute values in
this hash, and any changes you make will be incorporated into the output
HTML stream.
The attribute values will have any entity references decoded before being
passed to you, and any unsafe values we be re-encoded back into the HTML
stream.
So for instance, the tag:
<div title="<"Hi there <">
Will have the attribute hash:
{ title => \q[<"Hi there <] }
And will be turned back into the HTML on output:
<div title="<"Hi there <">
- $CloseAngle
- Anything after the end of last attribute including the
closing HTML angle(>)
- $HtmlR
- A scalar reference to the input HTML. The input HTML is
parsed using m/\G$SomeRegex/c constructs, so to continue from where
HTML:Defang left, clients can use m/\G$SomeRegex/c for further processing
on the input. This will resume parsing from where HTML::Declaw left. One
can also use the pos() function to determine where HTML::Declaw
left off. This combined with the add_to_output() method should give
reasonable flexibility for the client to process the input.
- $OutR
- A scalar reference to the processed output HTML so
far.
- tags_callback($context, $Defang,
$OpenAngle , $lcTag, $IsEndTag,
$AttributeHash, $CloseAngle,
$HtmlR, $OutR)
- If $Defang->{tags_callback} exists, and HTML::Declaw has
parsed a tag preset in $Defang->{tags_to_callback}, the above callback
is made to the client code. The return value of this method determines
whether the tag is defanged or not. More details below.
- Return values
- 0
- The current tag will not be defanged.
- 1
- The current tag will be defanged.
- 2
- The current tag will be processed normally by HTML:Defang
as if there was no callback method specified.
- attribs_callback($context, $Defang,
$lcTag, $lcAttrKey, $AttrVal,
$HtmlR, $OutR)
- If $Defang->{attribs_callback} exists, and HTML::Declaw
has parsed an attribute present in $Defang->{attribs_to_callback}, the
above callback is made to the client code. The return value of this method
determines whether the attribute is defanged or not. More details
below.
- Method parameters
- $lcAttrKey
- Lower case version of the HTML attribute that is currently
being parsed.
- $AttrVal
- Reference to the HTML attribute value that is currently
being parsed.
See $AttributeHash for details of decoding.
- Return values
- 0
- The current attribute will not be defanged.
- 1
- The current attribute will be defanged.
- 2
- The current attribute will be processed normally by
HTML:Defang as if there was no callback method specified.
- url_callback($context, $Defang,
$lcTag , $lcAttrKey, $AttrVal,
$AttributeHash, $HtmlR,
$OutR)
- If $Defang->{url_callback} exists, and HTML::Declaw has
parsed a URL, the above callback is made to the client code. The return
value of this method determines whether the attribute containing the URL
is defanged or not. URL callbacks can be made from <style> tags as
well style attributes, in which case the particular style declaration will
be commented out. More details below.
- Method parameters
- $lcAttrKey
- Lower case version of the HTML attribute that is currently
being parsed. However if this callback is made as a result of parsing a
URL in a style attribute, $lcAttrKey will be set to the string
style, or will be set to undef if this callback is made as a
result of parsing a URL inside a style tag.
- $AttrVal
- Reference to the URL value that is currently being
parsed.
- $AttributeHash
- A reference to a hash containing the attributes of the
current tag and their values. Each value is a scalar reference to the
value, rather than just a scalar value. You can add attributes (remember
to make it a scalar ref, eg $AttributeHash{"newattr"} =
\"newval"), delete attributes, or modify attribute values in
this hash, and any changes you make will be incorporated into the output
HTML stream. Will be set to undef if the callback is made due to
URL in a <style> tag or attribute.
- Return values
- 0
- The current URL will not be defanged.
- 1
- The current URL will be defanged.
- 2
- The current URL will be processed normally by HTML:Defang
as if there was no callback method specified.
- css_callback($context, $Defang,
$Selectors , $SelectorRules,
$lcTag, $IsAttr, $OutR)
- If $Defang->{css_callback} exists, and HTML::Declaw has
parsed a <style> tag or style attribtue, the above callback is made
to the client code. The return value of this method determines whether a
particular declaration in the style rules is defanged or not. More details
below.
- Method parameters
- $Selectors
- Reference to an array containing the selectors in a style
tag or attribute.
- $SelectorRules
- Reference to an array containing the style declaration
blocks of all selectors in a style tag or attribute. Consider the below
CSS:
a { b:c; d:e}
j { k:l; m:n}
The declaration blocks will get parsed into the following data structure:
[
[
[ "b", "c", 2],
[ "d", "e", 2]
],
[
[ "k", "l", 2],
[ "m", "n", 2]
]
]
So, generally each property:value pair in a declaration is parsed into an
array of the form
["property", "value", X]
where X can be 0, 1 or 2, and 2 the default value. A client can manipulate
this value to instruct HTML::Declaw to defang this property:value pair.
0 - Do not defang
1 - Defang the style:property value
2 - Process this as if there is no callback specified
- $IsAttr
- True if the currently processed item is a style attribute.
False if the currently processed item is a style tag.
METHODS¶
- PUBLIC METHODS
- defang($InputHtml)
- Cleans up $InputHtml of any executable code including
scripting, embedded objects, applets, etc., and defang any XSS
attacks.
- Method parameters
- $InputHtml
- The input HTML string that needs to be sanitized.
Returns the cleaned HTML. If fix_mismatched_tags is set, any tags that appear in
@$mismatched_tags_to_fix that are unbalanced are automatically commented or
closed.
- add_to_output($String)
- Appends $String to the output after the current parsed tag
ends. Can be used by client code in callback methods to add HTML text to
the processed output. If the HTML text needs to be defanged, client code
can safely call HTML::Declaw-> defang() recursively from within
the callback.
- Method parameters
- $String
- The string that is added after the current parsed tag
ends.
- defang_and_add_to_output
- defang and add result to output
- INTERNAL METHODS
- Generally these methods never need to be called by users of
the class, because they'll be called internally as the appropriate tags
are encountered, but they may be useful for some users in some cases.
- defang_script($OutR, $HtmlR,
$TagOps , $OpenAngle, $IsEndTag,
$Tag, $TagTrail, $Attributes,
$CloseAngle)
- This method is invoked when a <script> tag is parsed.
Defangs the <script> opening tag, and any closing tag. Any scripting
content is also commented out, so browsers don't display them.
Returns 1 to indicate that the <script> tag must be defanged.
- Method parameters
- $OutR
- A reference to the processed output HTML before the tag
that is currently being parsed.
- $HtmlR
- A scalar reference to the input HTML.
- $TagOps
- Indicates what operation should be done on a tag. Can be
undefined, integer or code reference. Undefined indicates an unknown tag
to HTML::Declaw, 1 indicates a known safe tag, 0 indicates a known unsafe
tag, and a code reference indicates a subroutine that should be called to
parse the current tag. For example, <style> and <script> tags
are parsed by dedicated subroutines.
- $OpenAngle
- Opening angle(<) sign of the current tag.
- $IsEndTag
- Has the value '/' if the current tag is a closing tag.
- $Tag
- The HTML tag that is currently being parsed.
- $TagTrail
- Any space after the tag, but before attributes.
- $Attributes
- A reference to an array of the attributes and their values,
including any surrouding spaces. Each element of the array is added by
'push' calls like below.
push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];
- $CloseAngle
- Anything after the end of last attribute including the
closing HTML angle(>)
- defang_style($OutR, $HtmlR,
$TagOps , $OpenAngle, $IsEndTag,
$Tag, $TagTrail, $Attributes,
$CloseAngle, $IsAttr)
- Builds a list of selectors and declarations from HTML style
tags as well as style attributes in HTML tags and calls
defang_stylerule() to do the actual defanging.
Returns 0 to indicate that style tags must not be defanged.
- Method parameters
- $IsAttr
- Whether we are currently parsing a style attribute or style
tag. $IsAttr will be true if we are currently parsing a style
attribute.
For a description of other parameters, see documentation of
defang_script() method
- cleanup_style($StyleString)
- Helper function to clean up CSS data. This function
directly operates on the input string without taking a copy.
- Method parameters
- $StyleString
- The input style string that is cleaned.
- defang_stylerule($SelectorsIn,
$StyleRules, $lcTag, $IsAttr,
$HtmlR, $OutR)
- Defangs style data.
- Method parameters
- $SelectorsIn
- An array reference to the selectors in the style
tag/attribute contents.
- $StyleRules
- An array reference to the declaration blocks in the style
tag/attribute contents.
- $lcTag
- Lower case version of the HTML tag that is currently being
parsed.
- $IsAttr
- Whether we are currently parsing a style attribute or style
tag. $IsAttr will be true if we are currently parsing a style
attribute.
- $HtmlR
- A scalar reference to the input HTML.
- $OutR
- A scalar reference to the processed output so far.
- defang_attributes($OutR, $HtmlR,
$TagOps , $OpenAngle, $IsEndTag,
$Tag, $TagTrail, $Attributes,
$CloseAngle)
- Defangs attributes, defangs tags, does tag, attrib, css and
url callbacks.
- Method parameters
- For a description of the method parameters, see
documentation of defang_script() method
- cleanup_attribute($AttributeString)
- Helper function to cleanup attributes
- Method parameters
- $AttributeString
- The value of the attribute.
get_applicable_charset¶
Get the charset from the content meta attribute?
SEE ALSO¶
HTML::Defang, <
http://mailtools.anomy.net/>,
<
http://htmlcleaner.sourceforge.net/>,
HTML::StripScripts,
HTML::Detoxifier,
HTML::Sanitizer,
HTML::Scrubber
AUTHOR¶
Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to Rob Mueller
<cpan@robm.fastmail.fm> for initial code, guidance and support and bug
fixes.
COPYRIGHT AND LICENSE¶
HTML::Declaw is a modifed version of HTML::Defang which has the following
license:
Copyright (C) 2003-2009 by The FastMail Partnership
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.