'\" t .\" Title: pdf2txt .\" Author: Jakub Wilk .\" Generator: DocBook XSL Stylesheets v1.79.1 .\" Date: 12/30/2018 .\" Manual: PDFMiner Manual .\" Source: pdf2txt .\" Language: English .\" .TH "PDF2TXT" "1" "12/30/2018" "pdf2txt" "PDFMiner Manual" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" pdf2txt \- extracts text contents of PDF files .SH "SYNOPSIS" .HP \w'\fBpdf2txt\fR\ 'u \fBpdf2txt\fR [\fIoption\fR...] \fIfile\fR... .SH "DESCRIPTION" .PP \fBpdf2txt\fR extracts text contents from a PDF file\&. It extracts all the text that is to be rendered programmatically, i\&.e\&. text represented as ASCII or Unicode strings\&. It cannot recognize text drawn as images that would require optical character recognition\&. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion\&. You need to provide a password for protected PDF documents when its access is restricted\&. You cannot extract any text from a PDF document which does not have extraction permission\&. .SH "OPTIONS" .PP \fB\-o \fR\fB\fIfile\fR\fR .RS 4 Specifies the output file name\&. The default is to print the extracted contents to standard output in text format\&. .RE .PP \fB\-p \fR\fB\fIpageno\fR\fR\fB\fI[,pageno,\&...]\fR\fR .RS 4 Specifies the comma\-separated list of the page numbers to be extracted\&. Page numbers start at one\&. By default, it extracts text from all the pages\&. .RE .PP \fB\-c \fR\fB\fIcodec\fR\fR .RS 4 Specifies the output codec\&. .RE .PP \fB\-t \fR\fB\fItype\fR\fR .RS 4 Specifies the output format\&. The following formats are currently supported: .PP text .RS 4 Text format\&. This is the default\&. .RE .PP html .RS 4 HTML format\&. It is not recommended\&. .RE .PP xml .RS 4 XML format\&. It provides the most information\&. .RE .PP tag .RS 4 \(lqTagged PDF\(rq format\&. A tagged PDF has its own contents annotated with HTML\-like tags\&. \fBpdf2txt\fR tries to extract its content streams rather than inferring its text locations\&. Tags used here are defined in the \m[blue]\fBPDF Reference, Sixth Edition\fR\m[]\&\s-2\u[1]\d\s+2 (\(sc10\&.7 \(lqTagged PDF\(rq)\&. .RE .RE .PP \fB\-D \fR\fB\fIwriting\-mode\fR\fR .RS 4 Specifies the writing mode of text outputs: .PP lr\-tb .RS 4 Left\-to\-right, top\-to\-bottom\&. .RE .PP tb\-rl .RS 4 Top\-to\-bottom, right\-to\-left\&. .RE .PP auto .RS 4 Determine writing mode automatically .RE .RE .PP \fB\-M \fR\fB\fIchar\-margin\fR\fR, \fB\-L \fR\fB\fIline\-margin\fR\fR, \fB\-W \fR\fB\fIword\-margin\fR\fR .RS 4 These are the parameters used for layout analysis\&. In an actual PDF file, text portions might be split into several chunks in the middle of its running, depending on the authoring software\&. Therefore, text extraction needs to splice text chunks\&. In the figure below, two text chunks whose distance is closer than the \fIchar\-margin\fR is considered continuous and get grouped into one\&. Also, two lines whose distance is closer than the \fIline\-margin\fR is grouped as a text box, which is a rectangular area that contains a \(lqcluster\(rq of text portions\&. Furthermore, it may be required to insert blank characters (spaces) as necessary if the distance between two words is greater than the \fIword\-margin\fR, as a blank between words might not be represented as a space, but indicated by the positioning of each word\&. .sp Each value is specified not as an actual length, but as a proportion of the length to the size of each character in question\&. The default values are \fIchar\-margin\fR = 1\&.0, \fIline\-margin\fR = 0\&.3, and \fIW = 0\&.2\fR, respectively\&. .RE .PP \fB\-n\fR .RS 4 Suppress layout analysis\&. .RE .PP \fB\-A\fR .RS 4 Force layout analysis for all the text strings, including text contained in figures\&. .RE .PP \fB\-V\fR .RS 4 Enable detection of vertical writing\&. .RE .PP \fB\-s \fR\fB\fIscale\fR\fR .RS 4 Specifies the output scale\&. This option can be used in HTML format only\&. .RE .PP \fB\-m \fR\fB\fIn\fR\fR .RS 4 Specifies the maximum number of pages to extract\&. By default, all the pages in a document are extracted\&. .RE .PP \fB\-P \fR\fB\fIpassword\fR\fR .RS 4 Provides the user password to access PDF contents\&. .RE .PP \fB\-d\fR .RS 4 Increase the debug level\&. .RE .SH "EXAMPLES" .PP Extract text as an HTML file whose filename is output\&.html: .sp .if n \{\ .RS 4 .\} .nf $ \fBpdf2txt\fR \-o output\&.html samples/naacl06\-shinyama\&.pdf .fi .if n \{\ .RE .\} .PP Extract a Japanese HTML file in vertical writing: .sp .if n \{\ .RS 4 .\} .nf $ \fBpdf2txt\fR \-c euc\-jp \-D tb\-rl \-o output\&.html samples/jo\&.pdf .fi .if n \{\ .RE .\} .PP Extract text from an encrypted PDF file: .sp .if n \{\ .RS 4 .\} .nf $ \fBpdf2txt\fR \-P mypassword \-o output\&.txt secret\&.pdf .fi .if n \{\ .RE .\} .sp .SH "SEE ALSO" .PP \fBdumppdf\fR(1) .SH "AUTHORS" .PP \fBJakub Wilk\fR <\&jwilk@debian\&.org\&> .RS 4 Wrote this manual page for the Debian system\&. .RE .PP \fBYusuke Shinyama\fR <\&yusuke@cs\&.nyu\&.edu\&> .RS 4 Author of PDFMiner and its original HTML documentation\&. .RE .SH "NOTES" .IP " 1." 4 PDF Reference, Sixth Edition .RS 4 \%http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf .RE