'\" t
.\"     Title: pdf2txt
.\"    Author: Jakub Wilk <jwilk@debian.org>
.\" Generator: DocBook XSL Stylesheets v1.79.1 <http://docbook.sf.net/>
.\"      Date: 12/30/2018
.\"    Manual: PDFMiner Manual
.\"    Source: pdf2txt
.\"  Language: English
.\"
.TH "PDF2TXT" "1" "12/30/2018" "pdf2txt" "PDFMiner Manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
pdf2txt \- extracts text contents of PDF files
.SH "SYNOPSIS"
.HP \w'\fBpdf2txt\fR\ 'u
\fBpdf2txt\fR [\fIoption\fR...] \fIfile\fR...
.SH "DESCRIPTION"
.PP
\fBpdf2txt\fR
extracts text contents from a PDF file\&. It extracts all the text that is to be rendered programmatically, i\&.e\&. text represented as ASCII or Unicode strings\&. It cannot recognize text drawn as images that would require optical character recognition\&. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion\&. You need to provide a password for protected PDF documents when its access is restricted\&. You cannot extract any text from a PDF document which does not have extraction permission\&.
.SH "OPTIONS"
.PP
\fB\-o \fR\fB\fIfile\fR\fR
.RS 4
Specifies the output file name\&. The default is to print the extracted contents to standard output in text format\&.
.RE
.PP
\fB\-p \fR\fB\fIpageno\fR\fR\fB\fI[,pageno,\&...]\fR\fR
.RS 4
Specifies the comma\-separated list of the page numbers to be extracted\&. Page numbers start at one\&. By default, it extracts text from all the pages\&.
.RE
.PP
\fB\-c \fR\fB\fIcodec\fR\fR
.RS 4
Specifies the output codec\&.
.RE
.PP
\fB\-t \fR\fB\fItype\fR\fR
.RS 4
Specifies the output format\&. The following formats are currently supported:
.PP
text
.RS 4
Text format\&. This is the default\&.
.RE
.PP
html
.RS 4
HTML format\&. It is not recommended\&.
.RE
.PP
xml
.RS 4
XML format\&. It provides the most information\&.
.RE
.PP
tag
.RS 4
\(lqTagged PDF\(rq format\&. A tagged PDF has its own contents annotated with HTML\-like tags\&.
\fBpdf2txt\fR
tries to extract its content streams rather than inferring its text locations\&. Tags used here are defined in the
\m[blue]\fBPDF Reference, Sixth Edition\fR\m[]\&\s-2\u[1]\d\s+2
(\(sc10\&.7 \(lqTagged PDF\(rq)\&.
.RE
.RE
.PP
\fB\-D \fR\fB\fIwriting\-mode\fR\fR
.RS 4
Specifies the writing mode of text outputs:
.PP
lr\-tb
.RS 4
Left\-to\-right, top\-to\-bottom\&.
.RE
.PP
tb\-rl
.RS 4
Top\-to\-bottom, right\-to\-left\&.
.RE
.PP
auto
.RS 4
Determine writing mode automatically
.RE
.RE
.PP
\fB\-M \fR\fB\fIchar\-margin\fR\fR, \fB\-L \fR\fB\fIline\-margin\fR\fR, \fB\-W \fR\fB\fIword\-margin\fR\fR
.RS 4
These are the parameters used for layout analysis\&. In an actual PDF file, text portions might be split into several chunks in the middle of its running, depending on the authoring software\&. Therefore, text extraction needs to splice text chunks\&. In the figure below, two text chunks whose distance is closer than the
\fIchar\-margin\fR
is considered continuous and get grouped into one\&. Also, two lines whose distance is closer than the
\fIline\-margin\fR
is grouped as a text box, which is a rectangular area that contains a \(lqcluster\(rq of text portions\&. Furthermore, it may be required to insert blank characters (spaces) as necessary if the distance between two words is greater than the
\fIword\-margin\fR, as a blank between words might not be represented as a space, but indicated by the positioning of each word\&.
.sp
Each value is specified not as an actual length, but as a proportion of the length to the size of each character in question\&. The default values are
\fIchar\-margin\fR
= 1\&.0,
\fIline\-margin\fR
= 0\&.3, and
\fIW = 0\&.2\fR, respectively\&.
.RE
.PP
\fB\-n\fR
.RS 4
Suppress layout analysis\&.
.RE
.PP
\fB\-A\fR
.RS 4
Force layout analysis for all the text strings, including text contained in figures\&.
.RE
.PP
\fB\-V\fR
.RS 4
Enable detection of vertical writing\&.
.RE
.PP
\fB\-s \fR\fB\fIscale\fR\fR
.RS 4
Specifies the output scale\&. This option can be used in HTML format only\&.
.RE
.PP
\fB\-m \fR\fB\fIn\fR\fR
.RS 4
Specifies the maximum number of pages to extract\&. By default, all the pages in a document are extracted\&.
.RE
.PP
\fB\-P \fR\fB\fIpassword\fR\fR
.RS 4
Provides the user password to access PDF contents\&.
.RE
.PP
\fB\-d\fR
.RS 4
Increase the debug level\&.
.RE
.SH "EXAMPLES"
.PP
Extract text as an HTML file whose filename is output\&.html:
.sp
.if n \{\
.RS 4
.\}
.nf
$ \fBpdf2txt\fR \-o output\&.html samples/naacl06\-shinyama\&.pdf
.fi
.if n \{\
.RE
.\}
.PP
Extract a Japanese HTML file in vertical writing:
.sp
.if n \{\
.RS 4
.\}
.nf
$ \fBpdf2txt\fR \-c euc\-jp \-D tb\-rl \-o output\&.html samples/jo\&.pdf
.fi
.if n \{\
.RE
.\}
.PP
Extract text from an encrypted PDF file:
.sp
.if n \{\
.RS 4
.\}
.nf
$ \fBpdf2txt\fR \-P mypassword \-o output\&.txt secret\&.pdf
.fi
.if n \{\
.RE
.\}
.sp
.SH "SEE ALSO"
.PP
\fBdumppdf\fR(1)
.SH "AUTHORS"
.PP
\fBJakub Wilk\fR <\&jwilk@debian\&.org\&>
.RS 4
Wrote this manual page for the Debian system\&.
.RE
.PP
\fBYusuke Shinyama\fR <\&yusuke@cs\&.nyu\&.edu\&>
.RS 4
Author of PDFMiner and its original HTML documentation\&.
.RE
.SH "NOTES"
.IP " 1." 4
PDF Reference, Sixth Edition
.RS 4
\%http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
.RE