.\" -*- mode: troff; coding: utf-8 -*- .\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. .ie n \{\ . ds C` "" . ds C' "" 'br\} .el\{\ . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "Text::Ngram 3pm" .TH Text::Ngram 3pm 2024-03-07 "perl v5.38.2" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH NAME Text::Ngram \- Ngram analysis of text .SH SYNOPSIS .IX Header "SYNOPSIS" .Vb 4 \& use Text::Ngram qw(ngram_counts add_to_counts); \& my $text = "abcdefghijklmnop"; \& my $hash_r = ngram_counts($text, 3); # Window size = 3 \& # $hash_r => { abc => 1, bcd => 1, ... } \& \& add_to_counts($more_text, 3, $hash_r); .Ve .SH DESCRIPTION .IX Header "DESCRIPTION" n\-Gram analysis is a field in textual analysis which uses sliding window character sequences in order to aid topic analysis, language determination and so on. The n\-gram spectrum of a document can be used to compare and filter documents in multiple languages, prepare word prediction networks, and perform spelling correction. .PP The neat thing about n\-grams, though, is that they're really easy to determine. For n=3, for instance, we compute the n\-gram counts like so: .PP .Vb 5 \& the cat sat on the mat \& \-\-\- $counts{"the"}++; \& \-\-\- $counts{"he "}++; \& \-\-\- $counts{"e c"}++; \& ... .Ve .PP This module provides an efficient XS-based implementation of n\-gram spectrum analysis. .PP There are two functions which can be imported: .SS ngram_counts .IX Subsection "ngram_counts" This first function returns a hash reference with the n\-gram histogram of the text for the given window size. The default window size is 5. .PP .Vb 1 \& $href = ngram_counts(\e%config, $text, $window_size); .Ve .PP As of version 0.14, the \f(CW%config\fR may instead be passed in as named arguments: .PP .Vb 1 \& $href = ngram_counts($text, $window_size, %config); .Ve .PP The only necessary parameter is \f(CW$text\fR. .PP The possible value for \f(CW%config\fR are: .PP \fIflankbreaks\fR .IX Subsection "flankbreaks" .PP If set to 1 (default), breaks are flanked by spaces; if set to 0, they're not. Breaks are punctuation and other non-alphabetic characters, which, unless you use \f(CW\*(C`punctuation => 0\*(C'\fR in your configuration, do not make it into the returned hash. .PP Here's an example, supposing you're using the default value for punctuation (1): .PP .Vb 2 \& my $text = "Hello, world"; \& my $hash = ngram_counts($text, 5); .Ve .PP That produces the following ngrams: .PP .Vb 6 \& { \& \*(AqHello\*(Aq => 1, \& \*(Aqello \*(Aq => 1, \& \*(Aq worl\*(Aq => 1, \& \*(Aqworld\*(Aq => 1, \& } .Ve .PP On the other hand, this: .PP .Vb 2 \& my $text = "Hello, world"; \& my $hash = ngram_counts({flankbreaks => 0}, $text, 5); .Ve .PP Produces the following ngrams: .PP .Vb 5 \& { \& \*(AqHello\*(Aq => 1, \& \*(Aq worl\*(Aq => 1, \& \*(Aqworld\*(Aq => 1, \& } .Ve .PP \fIlowercase\fR .IX Subsection "lowercase" .PP If set to 0, casing is preserved. If set to 1, all letters are lowercased before counting ngrams. Default is 1. .PP .Vb 2 \& # Get all ngrams of size 4 preserving case \& $href_p = ngram_counts( {lowercase => 0}, $text, 4 ); .Ve .PP \fIpunctuation\fR .IX Subsection "punctuation" .PP If set to 0 (default), punctuation is removed before calculating the ngrams. Set to 1 to preserve it. .PP .Vb 2 \& # Get all ngrams of size 2 preserving punctuation \& $href_p = ngram_counts( {punctuation => 1}, $text, 2 ); .Ve .PP \fIspaces\fR .IX Subsection "spaces" .PP If set to 0 (default is 1), no ngrams containing spaces will be returned. .PP .Vb 2 \& # Get all ngrams of size 3 that do not contain spaces \& $href = ngram_counts( {spaces => 0}, $text, 3); .Ve .PP If you're going to request both types of ngrams, than the best way to avoid calculating the same thing twice is probably this: .PP .Vb 3 \& $href_with_spaces = ngram_counts($text[, $window]); \& $href_no_spaces = $href_with_spaces; \& for (keys %$href_no_spaces) { delete $href\->{$_} if / / } .Ve .SS add_to_counts .IX Subsection "add_to_counts" This incrementally adds to the supplied hash; if \f(CW$window\fR is zero or undefined, then the window size is computed from the hash keys. .PP .Vb 1 \& add_to_counts($more_text, $window, $href) .Ve .SH "TO DO" .IX Header "TO DO" .IP \(bu 6 Look further into the tests. Sort them and add more. .SH "SEE ALSO" .IX Header "SEE ALSO" Cavnar, W. B. (1993). N\-gram-based text filtering for TREC\-2. In D. Harman (Ed.), \fIProceedings of TREC\-2: Text Retrieval Conference 2\fR. Washington, DC: National Bureau of Standards. .PP Shannon, C. E. (1951). Predication and entropy of printed English. \&\fIThe Bell System Technical Journal, 30\fR. 50\-64. .PP Ullmann, J. R. (1977). Binary n\-gram technique for automatic correction of substitution, deletion, insert and reversal errors in words. \&\fIComputer Journal, 20\fR. 141\-147. .SH AUTHOR .IX Header "AUTHOR" Maintained by Alberto Simoes, \f(CW\*(C`ambs@cpan.org\*(C'\fR. .PP Previously maintained by Jose Castro, \f(CW\*(C`cog@cpan.org\*(C'\fR. Originally created by Simon Cozens, \f(CW\*(C`simon@cpan.org\*(C'\fR. .SH "COPYRIGHT AND LICENSE" .IX Header "COPYRIGHT AND LICENSE" Copyright 2006 by Alberto Simoes .PP Copyright 2004 by Jose Castro .PP Copyright 2003 by Simon Cozens .PP This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.