.\" .\" SWISH++ .\" extract++.1 .\" .\" Copyright (C) 1998 Paul J. Lucas .\" .\" This program is free software; you can redistribute it and/or modify .\" it under the terms of the GNU General Public License as published by .\" the Free Software Foundation; either version 2 of the License, or .\" (at your option) any later version. .\" .\" This program is distributed in the hope that it will be useful, .\" but WITHOUT ANY WARRANTY; without even the implied warranty of .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the .\" GNU General Public License for more details. .\" .\" You should have received a copy of the GNU General Public License .\" along with this program; if not, write to the Free Software .\" Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. .\" .\" --------------------------------------------------------------------------- .\" define code-start macro .de cS .sp .nf .RS 5 .ft CW .ta .5i 1i 1.5i 2i 2.5i 3i 3.5i 4i 4.5i 5i 5.5i .. .\" define code-end macro .de cE .ft 1 .RE .fi .if !'\\$1'0' .sp .. .\" --------------------------------------------------------------------------- .TH \f3extract++\fP 1 "November 1, 2002" "SWISH++" .SH NAME extract++ \- SWISH++ text extractor .SH SYNOPSIS .B extract++ [ .I options ] .I directory... .I file... .SH DESCRIPTION .B extract++ is the SWISH++ text extractor, a utility to extract what text there is from a (mostly) binary file (similar to the .BR strings (1) command) prior to indexing. Original files are untouched. .PP Text is extracted from the specified files and files in the specified directories; text from files in subdirectories of specified directories is also extracted by default (unless the .BR \-r , .BR \-\-no-recurse , .BR \-f , or .B \-\-filter option or the .B RecurseSubdirs or .B ExtractFilter variable is given). .PP Ordinarily, text is extracted from files either only if their filename matches one of the patterns in the set specified with either the .B \-e or .B \-\-pattern option or the .B IncludeFile variable (unless standard input is used; see next paragraph) or is not among the set specified with either the .B \-E or .B \-\-no-pattern option or the .B ExcludeFile variable. .PP If there is a single filename of `\f(CW-\f1', the list of directories and files to extract is instead taken from standard input (one per line). In this case, filename patterns of files to extract need not be specified explicitly: all files, regardless of whether they match a pattern (unless they are among the set not to extract specified with either the .B \-E or .B \-\-no-pattern option or the .B ExcludeFile variable), are extracted, i.e., .B extract++ assumes you know what you're doing when specifying filenames in this manner. .PP Ordinarily, the text extracted from a file is written to another file in the same directory having the same filename but with the ``\f(CW.txt\fP'' extension appended by default, e.g., ``\f(CWfoo.doc\fP'' becomes ``\f(CWfoo.doc.txt\fP'' after extraction. (See also the .B \-x or .B \-\-extension option or the .B ExtractExtension variable.) However, extraction is not performed if the extracted text file exists. .PP If either the .B \-f or .B \-\-filter option or the .B ExtractFilter variable is given, then only a single file specified on the command line is extracted to standard output. In this case, filename patterns are not used and the existence of an extracted text file is irrelevant. .SS Filters Via the .B FilterFile configuration file variable, files having particular patterns can be filtered prior to extraction. (See the examples in .BR swish++.conf (5).) .SS Character Mapping and Word Determination .B extract++ performs the same character mapping, character entity conversions, and word determination heuristics used by .BR index++ (1) but also additionally: .TP 4 1. Considers all PostScript Level 2 operators that are not also English words to be stop words. Such words in a file usually indicate an encapsulated PostScript (EPS) file and such should not be indexed. .TP 2. Looks specifically for encapsulated PostScript (EPS) data between everything between one of \f(CW%%BeginSetup\fP, \f(CW%%BoundingBox\fP, \f(CW%%Creator\fP, \f(CW%%EndComments\fP, or \f(CW%%Title\fP and \f(CW%%Trailer\fP and discards it. .TP 3. Discards strings of ASCII hex data \f(CWWord_Hex_Min_Size\fP characters or longer, e.g., ``\f(CW7F454C46\fP.'' (Default is 5.) .SS Motivation .B extract++ was developed to be able to index non-text files in proprietary formats such as Microsoft Office documents. There are a couple of reasons why the functionality of .B extract++ isn't simply built into .BR index++ (1): .TP 4 1. Users who do not need to index such documents shouldn't have to pay the performance penalty for doing the extra checks for PostScript and hex data. .TP 2. While .BR index++ (1) can uncompress files on the fly using filters also, uncompressing them every time indexing is performed is excessive. Text extraction, on the other hand, is done only once per file; if the file is updated, the text-extracted version should be deleted and recreated. .SH OPTIONS Options begin with either a `\f(CW-\f1' for short options or a ``\f(CW--\f1'' for long options. Either a `\f(CW-\f1' or ``\f(CW--\f1'' by itself explicitly ends the options; however, the difference is that `\f(CW-\f1' is returned as the first non-option whereas ``\f(CW--\f1'' is skipped entirely. Long option names may be abbreviated so long as the abbreviation is unambiguous. .PP For a short option that takes an argument, the argument is either taken to be the remaining characters of the same option, if any, or, if not, is taken from the next option unless said option begins with a `\f(CW-\f1'. .PP Short options that take no arguments can be grouped (but the last option in the group can take an argument), e.g., \f(CW-lrv4\fP is equivalent to \f(CW-l \-r \-v4\fP. .PP For a long option that takes an argument, the argument is either taken to be the characters after a `\f(CW=\fP', if any, or, if not, is taken from the next option unless said option begins with a `\f(CW-\fP'. .TP 18 .B \-? .br .ns .TP .B \-\-help Print the usage (``help'') message and exit. .TP .BI \-c c .br .ns .TP .BI \-\-config-file= c The name of the configuration file, .IR c , to use. (Default is \f(CWswish++.conf\f1 in the current directory.) A configuration file is not required: if none is specified and the default does not exist, none is used; however, if one is specified and it does not exist, then this is an error. .TP .BI \-e p [, p ...] .br .ns .TP .BI \-\-pattern= p [, p ...] A filename pattern (or set of patterns separated by commas), .IR p , of files to extract text from. Case is significant. Multiple .B \-e or .B \-\-pattern options may be specified. .TP .BI \-E p [, p ...] .br .ns .TP .BI \-\-no-pattern= p [, p ...] A filename pattern or patterns, .IR p , of files .I not to extract text from. Case is significant. Multiple .B \-E or .B \-\-no-pattern options may be specified. .TP .B \-f .br .ns .TP .B \-\-filter Extract a single file to standard output and exit. .TP .B \-l .br .ns .TP .B \-\-follow-links Follow symbolic links during extraction. The default is not to follow them. (This option is not available under Microsoft Windows since it doesn't support symbolic links.) .TP .B \-r .br .ns .TP .B \-\-no-recurse Do not recursively extract the files in subdirectories, that is: when a directory is encountered, all the files in that directory are extracted (modulo the filename patterns specified via the .BR \-e , .BR \-\-pattern , .BR \-E , or .B \-\-no-pattern options or the .B IncludeFile or .B ExcludeFile variables) but subdirectories encountered are ignored and therefore the files contained in them are not extracted. (This option is most useful when specifying the directories and files to extract via standard input.) The default is to extract the files in subdirectories recursively. .TP .BI \-s f .br .ns .TP .BI \-\-stop-file= f The name of a file, .IR f , containing the set stop-words to use instead of the built-in set. Whitespace, including blank lines, and characters starting with \f(CW#\f1 and continuing to the end of the line (comments) are ignored. .TP .B \-S .br .ns .TP .B \-\-dump-stop Dump the built-in set of stop-words to standard output and exit. .TP .BI \-v c .br .ns .TP .BI \-\-verbosity= v The verbosity level, .IR v , for printing additional information to standard output during indexing. The verbosity levels, 0-4, are: .PP .RS 18 .PD 0 .TP 4 0 No output is generated (except for errors). .TP 1 Only run statistics (elapsed time, number of files, word count) are printed. .TP 2 Directories are printed as extraction progresses. .TP 3 Directories and files are printed with a word-count for each file. .TP 4 Same as 3 but also prints all files that are not extracted and why. .RE .PD .RE .TP 18 .B \-V .br .ns .TP .B \-\-version Print the version number of .BR SWISH++ and exit. .TP .BI \-x e .br .ns .TP .BI \-\-extension= e The extension to append to filenames during extraction. (It can be specified with or without the dot; default is \f(CWtxt\f1.) .SH CONFIGURATION FILE The following variables can be set in a configuration file. Variables and command-line options can be mixed. .PP .RS 5 .PD 0 .TP 18 .B ExcludeFile Same as .B \-E or .B \-\-no-pattern .TP .B ExtractExtension Same as .B \-x or .B \-\-extension .TP .B ExtractFilter Same as .B \-f or .B \-\-filter .TP .B FilterAttachment (See FILTERS in .BR swish++.conf (5).) .TP .B FilterFile (See FILTERS in .BR swish++.conf (5).) .TP .B FollowLinks Same as .B \-l or .B \-\-follow-links .TP .B IncludeFile Same as .B \-e or .B \-\-pattern .TP .B RecurseSubdirs Same as .B \-r or .B \-\-no-recurse .TP .B StopWordFile Same as .B \-s or .B \-\-stop-file .TP .B Verbosity Same as .B \-v or .B \-\-verbosity .PD .RE .SH EXAMPLES .SS Extraction To extract text from all Microsoft Office files on a web server: .cS cd /home/www/htdocs extract++ \-v3 \-e '*.doc' \-e '*.ppt' \-e '*.xls' . .cE .SS Filters (See the examples in .BR swish++.conf (5).) .SH EXIT STATUS Exits with one of the values given below: .PP .RS 5 .PD 0 .TP 5 0 Success. .TP 1 Error in configuration file. .TP 2 Error in command-line options. .TP 20 File to extract does not exist. .TP 30 Unable to read stop-word file. .PD .RE .SH CAVEATS .TP 4 1. Text extraction is not perfect, nor can be. .TP 2. As with .BR index++ (1), the word-determination heuristics employed are heavily geared for English. Using SWISH++ as-is to extract files in non-English languages is not recommended. .SH FILES .PD 0 .TP 18 \f(CWswish++.conf\f1 default configuration file name .PD .SH SEE ALSO .BR index++ (1), .BR search++ (1), .BR strings (1), .BR swish++.conf (5), .BR glob (7) .PP Adobe Systems Incorporated. .I PostScript Language Reference Manual, 2nd ed. Addison-Wesley, Reading, MA. pp. 346-359. .PP International Standards Organization. ``ISO/IEC 9945-2: Information Technology -- Portable Operating System Interface (POSIX) -- Part 2: Shell and Utilities,'' 1993. .SH AUTHOR Paul J. Lucas .RI < pauljlucas@mac.com >