.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.42) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "CompactTree 3pm" .TH CompactTree 3pm "2022-06-28" "perl v5.34.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" XML::CompactTree \- builder of compact tree structures from XML documents .SH "VERSION" .IX Header "VERSION" Version 0.03 .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 2 \& use XML::CompactTree; \& use XML::LibXML::Reader; \& \& my $reader = XML::LibXML::Reader\->new(location => $url); \& ... \& my $tree = XML::CompactTree::readSubtreeToPerl($reader); \& ... .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" This module provides functions that use XML::LibXML::Reader to parse an \s-1XML\s0 document into a parse tree formed of nested arrays (and hashes). .PP It aims to be fast in doing that and to presreve all relevant information from the \s-1XML\s0 (including namespaces, document order, mixed content, etc.). It sacrifices user friendliness for speed. .PP \&\s-1IMPORTANT:\s0 There is an even more efficient \s-1XS\s0 implementation of this module called XML::CompactTree::XS with 100% equivalent functionality. .SH "PURPOSE" .IX Header "PURPOSE" I wrote this module because I noticed that repeated calls to methods implemented in C (\s-1XS\s0) were very expensive in Perl. .PP Therefore traversing a large \s-1DOM\s0 tree using XML::LibXML or iterating over an \s-1XML\s0 stream using XML::LibXML::Reader was much slower than traversing similarly large and structured native Perl data structures. .PP This module allows the user to build a document parse tree consisting of native Perl data structures (arrays and optionally hashes) using XML::LibXML::Reader with minimal number of \s-1XS\s0 calls. .PP (Note that there XML::CompactTree::XS is 100% equivalent of this module that manages the same with just one \s-1XS\s0 call.) .PP It does not provide full \s-1DOM\s0 navigation but attempts to provide maximum amount of information. Its memory footprint should be somewhat smaller than that of a corresponding XML::LibXML \s-1DOM\s0 tree. .SH "EXPORT" .IX Header "EXPORT" By default, the following constants are exported (\f(CW\*(C`:flags\*(C'\fR export tag) to be used as flags for the tree builder: .PP .Vb 11 \& XCT_IGNORE_WS \& XCT_IGNORE_SIGNIFICANT_WS \& XCT_IGNORE_PROCESSING_INSTRUCTIONS \& XCT_IGNORE_COMMENTS \& XCT_USE_QNAMES /* not yet implemented */ \& XCT_KEEP_NS_DECLS \& XCT_TEXT_AS_STRING /* not yet implemented */ \& XCT_ATTRIBUTE_ARRAY \& XCT_PRESERVE_PARENT /* not yet implemented */ \& XCT_MERGE_TEXT_NODES /* not yet implemented */ \& XCT_DOCUMENT_ROOT .Ve .SH "FUNCTIONS" .IX Header "FUNCTIONS" .ie n .SS "readSubtreeToPerl( $reader, $flags, \emy %ns )" .el .SS "readSubtreeToPerl( \f(CW$reader\fP, \f(CW$flags\fP, \emy \f(CW%ns\fP )" .IX Subsection "readSubtreeToPerl( $reader, $flags, my %ns )" Uses a given XML::LibXML::Reader parser objects to parse a subtree at the current reader position to build a tree formed of nested arrays (see \*(L"\s-1OUTPUT FORMAT\*(R"\s0). .IP "reader" 4 .IX Item "reader" A XML::LibXML::Reader object to use as the reader. While building the tree, the reader moves to the next node on the current or higher level. .IP "flags" 4 .IX Item "flags" An integer consisting of 1 bit flags (see constants in the \s-1EXPORT\s0 section). Use binary or (|) to combine individual flags. .Sp The following flags are \s-1NOT\s0 implemented yet: .Sp .Vb 1 \& XCT_USE_QNAMES, XCT_TEXT_AS_STRING, XCT_PRESERVE_PARENT, XCT_MERGE_TEXT_NODES .Ve .IP "ns" 4 .IX Item "ns" You may pass an empty hash reference that will be populated by a namespace_uri to namespace_index map, that can be used to decode namespace indexes in the resulting data structure (see \s-1OUTPUT FORMAT\s0). .ie n .SS "readLevelToPerl( $reader, $flags, $ns )" .el .SS "readLevelToPerl( \f(CW$reader\fP, \f(CW$flags\fP, \f(CW$ns\fP )" .IX Subsection "readLevelToPerl( $reader, $flags, $ns )" Like \f(CW\*(C`readSubtreeToPerl\*(C'\fR, but reads the subtree at the current reader position and all its following siblings. It returns an array reference of representations of these subtrees as in the format described in \*(L"\s-1OUTPUT FORMAT\*(R"\s0. .SH "OUTPUT FORMAT" .IX Header "OUTPUT FORMAT" The result of parsing a subtree is a Perl array reference \f(CW$node\fR contains a node type followed by node data whose interpretation on further positions in \f(CW$node\fR depends on the node type, as described below: .SS "Any Node" .IX Subsection "Any Node" .IP "\(bu" 5 \&\f(CW$node\fR\->[0] is an integer representing the node type. Use XML::LibXML::Reader node-tye constants, e.g. \s-1XML_READER_TYPE_ELEMENT\s0 for an element node, \s-1XML_READER_TYPE_TEXT\s0 for text node, etc. .SS "Document or Document Fragment Nodes" .IX Subsection "Document or Document Fragment Nodes" .IP "\(bu" 5 \&\f(CW$node\fR\->[1] contains the document encoding .IP "\(bu" 5 \&\f(CW$node\fR\->[2] is an array reference containing similar represention of all the child nodes of the document (fragment). .PP Note: XML::LibXML::Reader does not document node by default, which means that calling readSubtreeToPerl on a reader object in its initial state only parses the first node in the document (which can be the root element, but also a comment or a processing instruction). Use \&\s-1XCT_DOCUMENT_ROOT\s0 flag to force creating a document node in such case. .SS "Element nodes" .IX Subsection "Element nodes" .IP "\(bu" 5 \&\f(CW$node\fR\->[1] is the local name (\s-1UTF\-8\s0 encoded character string) .IP "\(bu" 5 \&\f(CW$node\fR\->[2] is the namespace index (see \s-1NAMESPACES\s0 below) .IP "\(bu" 5 \&\f(CW$node\fR\->[3] is undef if the element has no attributes. Otherwise if \&\s-1XCT_ATTRIBUTE_ARRAY\s0 flag was used, \f(CW$node\fR\->[3] is an array reference of the form \f(CW\*(C`[ name1, value1, name2, value2, ....]\*(C'\fR of attribute names and corresponding values. If \s-1XCT_ATTRIBUTE_ARRAY\s0 flag was not used, then \&\f(CW$node\fR\->[3] is a hash reference mapping attribute names to the corresponding attribute values \f(CW\*(C`{ name1=\*(C'\fRvalue1, name2=>value2...}> .Sp The flag \s-1XCT_KEEP_NS_DECLS\s0 controls whether namespace declarations (xmlns=... or xmlns:prefix=...) are included along with normal attributes or not. .Sp Note: there is no support for namespaced attributes yet, but the attribute names are stored as QNames, so one can always use \&\s-1XCT_KEEP_NS_DECLS\s0 to keep track of namespace prefix declarations and do the resolving manually. Support for namespaced attributes is planned. .IP "\(bu" 5 If \s-1XTC_LINE_NUMBERS\s0 flag was used, \f(CW$node\fR\->[4] contains the line number of the element and \f(CW$node\fR\->[5] contains an array reference containing similar representions of the child nodes of the current node. .IP "\(bu" 5 If \s-1XTC_LINE_NUMBERS\s0 flag was \s-1NOT\s0 used, \f(CW$node\fR\->[4] contains an array reference of similar representations of the child nodes of the current node. .SS "Text, \s-1CDATA,\s0 Comment and White-Space Nodes" .IX Subsection "Text, CDATA, Comment and White-Space Nodes" .IP "\(bu" 5 \&\f(CW$node\fR\->[1] contains the node value (\s-1UTF\-8\s0 encoded character string) .SS "Unparsed Entity, Processing-Instruction, and Notation Nodes" .IX Subsection "Unparsed Entity, Processing-Instruction, and Notation Nodes" .IP "\(bu" 5 \&\f(CW$node\fR\->[1] contains the local name (there is no support for namespaces on these types of nodes yet) .IP "\(bu" 5 \&\f(CW$node\fR\->[2] contains the node value .SS "Skipping Less-Significant Nodes" .IX Subsection "Skipping Less-Significant Nodes" White-space (non-significant or significant), processing-instruction and comment nodes can be completely skipped, using the following flags: .PP .Vb 4 \& XCT_IGNORE_WS \& XCT_IGNORE_SIGNIFICANT_WS \& XCT_IGNORE_PROCESSING_INSTRUCTIONS \& XCT_IGNORE_COMMENTS .Ve .SH "NAMESPACES" .IX Header "NAMESPACES" Namespaces of element nodes are stored in the element node as an integer. 0 always represents nodes without namespace, all other namespaces are assigned unique numbers in an increasing order as they appear. You can pass an empty hash reference to the parsing functions to obtain the mapping. .SS "Example" .IX Subsection "Example" .Vb 2 \& use XML::CompactTree; \& use XML::LibXML::Reader; \& \& my $reader = XML::LibXML::Reader\->new(location => $ARGV[0]); \& my %ns; \& my $data = XML::CompactTree::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT, \e%ns ); \& $ns_map[$ns{$_}]=$_ for keys %ns; \& my @nodes = ($data); \& while (@nodes) { \& my $node = shift @nodes; \& my $type = $node\->[0]; \& if ($type == XML_READER_TYPE_ELEMENT) { \& print "element $node\->[1] is from ns $node\->[2] \*(Aq$ns_map[$node\->[2]]\*(Aq\en"; \& push @nodes, @{$node\->[4]}; # queue children \& } elsif ($type == XML_READER_TYPE_DOCUMENT) { \& push @nodes, @{$node\->[2]}; # queue children \& } \& } .Ve .SH "PLANNED FEATURES" .IX Header "PLANNED FEATURES" Planned flags: .PP .Vb 4 \& XCT_USE_QNAMES \- use QNames instead of local names for all nodes \& XCT_TEXT_AS_STRING \- put text nodes into the tree as plain scalars \& XCT_PRESERVE_PARENT \- add a slot with a weak reference to the parent node \& XCT_MERGE_TEXT_NODES \- merge adjacent text/cdata nodes together .Ve .PP Features: allow blessing the array refs to default or user-specified classes; the default classes would provide a very small subset of \s-1DOM\s0 methods to retrieve node information, manipulate the tree, and possibly serialize the parse tree back to \s-1XML.\s0 .SH "AUTHOR" .IX Header "AUTHOR" Petr Pajas, \f(CW\*(C`\*(C'\fR .SH "BUGS" .IX Header "BUGS" Please report any bugs or feature requests to \&\f(CW\*(C`bug\-xml\-compacttree\-xs@rt.cpan.org\*(C'\fR, or through the web interface at . I will be notified, and then you'll automatically be notified of progress on your bug as I make changes. .SH "COPYRIGHT & LICENSE" .IX Header "COPYRIGHT & LICENSE" Copyright 2008\-2009 Petr Pajas, All Rights Reserved. .PP This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. .SH "SEE ALSO" .IX Header "SEE ALSO" .Vb 1 \& XML::CompactTree::XS \& \& XML::LibXML::Reader .Ve