.\" [created by setup.py sdist]
'\" t
.\"     Title: ocrodjvu
.\"    Author: Jakub Wilk <jwilk@jwilk.net>
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\"      Date: 04/22/2014
.\"    Manual: ocrodjvu manual
.\"    Source: ocrodjvu 0.7.18
.\"  Language: English
.\"
.TH "OCRODJVU" "1" 2014-04-22 "ocrodjvu 0\&.7\&.18" "ocrodjvu manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
ocrodjvu \- OCR for DjVu files
.SH "SYNOPSIS"
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR {\fB\-o\fR | \fB\-\-save\-bundled\fR} \fIoutput\-djvu\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR {\fB\-i\fR | \fB\-\-save\-indirect\fR} \fIindex\-djvu\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR \fB\-\-save\-script\fR \fIscript\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR \fB\-\-in\-place\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR \fB\-\-dry\-run\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR {\fB\-\-version\fR | \fB\-\-help\fR | \fB\-h\fR | \fB\-\-list\-engines\fR | \fB\-\-list\-languages\fR}
.SH "DESCRIPTION"
.PP
ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files\&.
.PP
The following OCR engines are supported:
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\m[blue]\fIOCRopus\fR\m[]\&\s-2\u[1]\d\s+2
(internally, ocrodjvu calls
\fBocroscript\fR\*(Aqs
\fBrecognize\fR
(or
\fBrec\-tess\fR) command, so that ultimately
Tesseract
acts as the OCR backend);
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\m[blue]\fICuneiform for Linux\fR\m[]\&\s-2\u[2]\d\s+2\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\m[blue]\fIOcrad\fR\m[]\&\s-2\u[3]\d\s+2\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\m[blue]\fIGOCR\fR\m[]\&\s-2\u[4]\d\s+2\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
Stand\-alone
\m[blue]\fITesseract\fR\m[]\&\s-2\u[5]\d\s+2\&.
.RE
.sp
.SH "OPTIONS"
.SS "OCR engine options"
.PP
\fB\-e\fR, \fB\-\-engine=\fR\fB\fIengine\-id\fR\fR
.RS 4
Use this OCR engine\&. The default is
\(lqocropus\(rq
(OCRopus)\&.
.RE
.PP
\fB\-\-list\-engines\fR
.RS 4
Print list of available OCR engines\&.
.RE
.SS "Options controlling output"
.PP
\fB\-o\fR, \fB\-\-save\-bundled=\fR\fB\fIoutput\-djvu\-file\fR\fR
.RS 4
Save OCR results as a bundled multi\-page document into
\fIoutput\-djvu\-file\fR\&.
.RE
.PP
\fB\-i\fR, \fB\-\-save\-indirect=\fR\fB\fIindex\-djvu\-file\fR\fR
.RS 4
Save OCR results as an indirect multi\-page document\&. Use
\fIindex\-djvu\-file\fR
as the index file name; put the component files into the same directory\&. The directory must exist and be writable\&.
.RE
.PP
\fB\-\-save\-script=\fR\fB\fIscript\-file\fR\fR
.RS 4
Save a
\fBdjvused\fR
script with OCR results into
\fIscript\-file\fR\&.
.RE
.PP
\fB\-\-in\-place\fR
.RS 4
Save OCR results in place\&.
.sp
(Use this option to retain compatibility with ocrodjvu < 0\&.2\&.)
.RE
.PP
\fB\-\-dry\-run\fR
.RS 4
Don\*(Aqt change any files, throw OCR results away\&.
.RE
.PP
It is mandatory to use exactly one of the above options\&.
.PP
\fB\-\-ocr\-only\fR
.RS 4
If OCR results are to be saved to a separate document (\fB\-o\fR/\fB\-\-save\-bundled\fR
or
\fB\-i\fR/\fB\-\-save\-indirect\fR), save only the pages selected for OCR\&.
.sp
The default is to save all pages, even when the
\fB\-p\fR/\fB\-\-pages\fR
option is in effect\&.
.RE
.PP
\fB\-\-clear\-text\fR
.RS 4
Remove existing hidden text if present in the pages not selected for OCR\&.
.sp
(Use this option to retain compatibility with ocrodjvu < 0\&.2\&.)
.RE
.PP
\fB\-\-save\-raw\-ocr=\fR\fB\fIoutput\-directory\fR\fR
.RS 4
Save raw OCR results (typically in the hOCR format) into
\fIoutput\-directory\fR\&. The directory must exist and be writable\&.
.RE
.PP
\fB\-\-raw\-ocr\-filename\-template=\fR\fB\fItemplate\fR\fR
.RS 4
Specifies the file naming scheme for raw OCR results\&.
.sp
The template language uses the
\m[blue]\fIPython string formatting syntax\fR\m[]\&\s-2\u[6]\d\s+2\&. The following fields are available:
.PP
\fIpage\fR, \fIpage+\fR\fI\fIN\fR\fR, \fIpage\-\fR\fI\fIN\fR\fR
.RS 4
page number, optionally shifted by a number
\fIN\fR
.RE
.PP
\fIid\fR
.RS 4
page identifier
.RE
.PP
\fIid\-ext\fR
.RS 4
page identifier without file extension
.RE
.sp
The default template is
\(lq{id\-ext}\(rq\&.
.RE
.SS "Text segmentation options"
.PP
\fB\-t lines\fR, \fB\-\-details lines\fR
.RS 4
Record location of every line\&. Don\*(Aqt record locations of particular words or characters\&.
.sp
This is the default for OCRopus 0\&.2\&. The option is ineffective with stand\-alone Tesseract 2\&.0\&.
.RE
.PP
\fB\-t words\fR, \fB\-\-details=words\fR
.RS 4
Record location of every line and every word\&. Don\*(Aqt record locations of particular characters\&.
.sp
This is the default for most OCR engines\&.
.sp
This option is ineffective with OCRopus 0\&.2 and stand\-alone Tesseract 2\&.0\&.
.RE
.PP
\fB\-t chars\fR, \fB\-\-details=chars\fR
.RS 4
Record location of every line, every word and every character\&.
.sp
This option is ineffective with OCRopus 0\&.2 and stand\-alone Tesseract 2\&.0\&.
.RE
.PP
\fB\-\-word\-segmentation=simple\fR
.RS 4
Consider each non\-empty sequence of non\-whitespace characters a single word\&.
.sp
This is the default, despite being linguistically incorrect\&.
.RE
.PP
\fB\-\-word\-segmentation=uax29\fR
.RS 4
Use the
\m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[7]\d\s+2
algorithm to break lines into words\&.
.sp
This option breaks assumptions of some DjVu tools that words are separated by spaces, and therefore it is not recommended\&.
.RE
.SS "Other options"
.PP
\fB\-l\fR, \fB\-\-language=\fR\fB\fIlanguage\-id\fR\fR
.RS 4
Set recognition language\&.
\fIlanguage\-id\fR
is typically an ISO 639\-2/T three\-letter code\&.
.sp
Tesseract \(>= 3\&.02 allows specifying multiple languages separated by
\(lq+\(rq
characters\&.
.sp
For OCRopus, the default is
\(lqeng\(rq
(English), unless the
\fItesslanguage\fR
environment variable is set\&. For other OCR engines, the default is always
\(lqeng\(rq\&.
.RE
.PP
\fB\-\-list\-languages\fR
.RS 4
Print list of available languages for the currently selected OCR engine\&.
.RE
.PP
\fB\-\-render=mask\fR
.RS 4
Render only masks of page images\&.
.sp
This is the default\&.
.RE
.PP
\fB\-\-render=foreground\fR
.RS 4
Render only foreground layers of page images\&.
.RE
.PP
\fB\-\-render=all\fR
.RS 4
Render all layers of page images\&.
.sp
This option is necessary to OCR DjVu files with invalid foreground/background separation\&.
.RE
.PP
\fB\-p\fR, \fB\-\-pages=\fR\fB\fIpage\-range\fR\fR
.RS 4
Specifies pages to process\&.
\fIpage\-range\fR
is a comma\-separated list of sub\-ranges\&. Each sub\-range is either a single page (e\&.g\&.\ \&17) or a contiguous range of pages (e\&.g\&.\ \&37\-42)\&. Pages are numbered from 1\&.
.sp
The default is to process all pages\&.
.RE
.PP
\fB\-j\fR, \fB\-\-jobs=\fR\fB\fIn\fR\fR
.RS 4
Start up to
\fIn\fR
OCR processes\&.
.RE
.PP
\fB\-\-version\fR
.RS 4
Output version information and exit\&.
.RE
.PP
\fB\-h\fR, \fB\-\-help\fR
.RS 4
Display help and exit\&.
.RE
.SS "Advanced options"
.PP
\fB\-D\fR, \fB\-\-debug\fR
.RS 4
To ease debugging, don\*(Aqt delete intermediate files\&.
.RE
.PP
\fB\-X \fR\fB\fIkey\fR\fR\fB=\fR\fB\fIvalue\fR\fR
.RS 4
This option allows controlling some details of how ocrodjvu operates\&.
.RE
.PP
\fB\-\-on\-error=abort\fR
.RS 4
Stop program execution when an exceptional situation (e\&.g\&., malformed output from the OCR engine, internal ocrodjvu error, etc\&.) occurs\&.
.sp
This is the default\&.
.RE
.PP
\fB\-\-on\-error=resume\fR
.RS 4
Attempt to recover from exceptional situations\&.
.sp
This option is strongly discouraged\&.
.RE
.PP
\fB\-\-html5\fR
.RS 4
Use a
\m[blue]\fIHTML5 parser\fR\m[]\&\s-2\u[8]\d\s+2, which is more robust but slower than the default parser\&.
.RE
.SH "ENVIRONMENT"
.PP
The following environment variables affects ocrodjvu:
.PP
\fItesslanguage\fR
.RS 4
Recognition language for Tesseract\&.
.sp
(Use this variable is deprecated in favor of the
\fB\-\-language\fR
option\&.)
.RE
.PP
\fITMPDIR\fR
.RS 4
ocrodjvu makes heavy use of temporary files\&. It will store them in a directory specified by this variable\&. The default is
/tmp\&.
.RE
.SH "BUGS"
.SS "Known bugs"
.PP
Tesseract 3\&.00 is affected by a bug
\&\s-2\u[9]\d\s+2
making it produce invalid hOCR output in certain circumstances\&. ocrodjvu does not try recover form this fault (which couldn\*(Aqt be done reliably anyway) unless you pass the
\fB\-X fix\-html=1\fR
option\&.
.PP
When using Tesseract \(>= 3\&.00, extracting bounding boxes of particular characters (which happens when either
\fB\-\-details=chars\fR
or
\fB\-\-word\-segmentation=uax29\fR) is inefficient\&. This due to limitations of Tesseract command line interface\&.
.SS "Reporting new bugs"
.PP
Please report bugs at:
\m[blue]\fI\%https://bitbucket.org/jwilk/ocrodjvu/issues\fR\m[]
.SH "SEE ALSO"
.PP
\fBdjvu\fR(1),
\fBdjvu2hocr\fR(1),
\fBhocr2djvused\fR(1),
.PP
\fBocroscript\fR(1),
\fBtesseract\fR(1),
\fBcuneiform\fR(1),
\fBocrad\fR(1),
\fBgocr\fR(1)
.SH "NOTES"
.IP " 1." 4
OCRopus
.RS 4
\m[blue]\fI\%https://code.google.com/p/ocropus/\fR\m[]
.RE
.IP " 2." 4
Cuneiform for Linux
.RS 4
\m[blue]\fI\%https://launchpad.net/cuneiform-linux\fR\m[]
.RE
.IP " 3." 4
Ocrad
.RS 4
\m[blue]\fI\%https://www.gnu.org/software/ocrad/\fR\m[]
.RE
.IP " 4." 4
GOCR
.RS 4
\m[blue]\fI\%http://jocr.sourceforge.net/\fR\m[]
.RE
.IP " 5." 4
Tesseract
.RS 4
\m[blue]\fI\%https://code.google.com/p/tesseract-ocr/\fR\m[]
.RE
.IP " 6." 4
Python string formatting syntax
.RS 4
\m[blue]\fI\%https://docs.python.org/library/string.html#format-string-syntax\fR\m[]
.RE
.IP " 7." 4
Unicode Text Segmentation
.RS 4
\m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[]
.RE
.IP " 8." 4
HTML5 parser
.RS 4
\m[blue]\fI\%http://www.whatwg.org/specs/web-apps/current-work/#html-parser\fR\m[]
.RE
.IP " 9." 4
\m[blue]\fI\%https://code.google.com/p/tesseract-ocr/issues/detail?id=376\fR\m[]