.\" [created by setup.py sdist] '\" t .\" Title: ocrodjvu .\" Author: Jakub Wilk .\" Generator: DocBook XSL Stylesheets v1.78.1 .\" Date: 04/22/2014 .\" Manual: ocrodjvu manual .\" Source: ocrodjvu 0.7.18 .\" Language: English .\" .TH "OCRODJVU" "1" 2014-04-22 "ocrodjvu 0\&.7\&.18" "ocrodjvu manual" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" ocrodjvu \- OCR for DjVu files .SH "SYNOPSIS" .HP \w'\fBocrodjvu\fR\ 'u \fBocrodjvu\fR {\fB\-o\fR | \fB\-\-save\-bundled\fR} \fIoutput\-djvu\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR .HP \w'\fBocrodjvu\fR\ 'u \fBocrodjvu\fR {\fB\-i\fR | \fB\-\-save\-indirect\fR} \fIindex\-djvu\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR .HP \w'\fBocrodjvu\fR\ 'u \fBocrodjvu\fR \fB\-\-save\-script\fR \fIscript\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR .HP \w'\fBocrodjvu\fR\ 'u \fBocrodjvu\fR \fB\-\-in\-place\fR [\fIoption\fR...] \fIdjvu\-file\fR .HP \w'\fBocrodjvu\fR\ 'u \fBocrodjvu\fR \fB\-\-dry\-run\fR [\fIoption\fR...] \fIdjvu\-file\fR .HP \w'\fBocrodjvu\fR\ 'u \fBocrodjvu\fR {\fB\-\-version\fR | \fB\-\-help\fR | \fB\-h\fR | \fB\-\-list\-engines\fR | \fB\-\-list\-languages\fR} .SH "DESCRIPTION" .PP ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files\&. .PP The following OCR engines are supported: .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} \m[blue]\fIOCRopus\fR\m[]\&\s-2\u[1]\d\s+2 (internally, ocrodjvu calls \fBocroscript\fR\*(Aqs \fBrecognize\fR (or \fBrec\-tess\fR) command, so that ultimately Tesseract acts as the OCR backend); .RE .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} \m[blue]\fICuneiform for Linux\fR\m[]\&\s-2\u[2]\d\s+2\&. .RE .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} \m[blue]\fIOcrad\fR\m[]\&\s-2\u[3]\d\s+2\&. .RE .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} \m[blue]\fIGOCR\fR\m[]\&\s-2\u[4]\d\s+2\&. .RE .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} Stand\-alone \m[blue]\fITesseract\fR\m[]\&\s-2\u[5]\d\s+2\&. .RE .sp .SH "OPTIONS" .SS "OCR engine options" .PP \fB\-e\fR, \fB\-\-engine=\fR\fB\fIengine\-id\fR\fR .RS 4 Use this OCR engine\&. The default is \(lqocropus\(rq (OCRopus)\&. .RE .PP \fB\-\-list\-engines\fR .RS 4 Print list of available OCR engines\&. .RE .SS "Options controlling output" .PP \fB\-o\fR, \fB\-\-save\-bundled=\fR\fB\fIoutput\-djvu\-file\fR\fR .RS 4 Save OCR results as a bundled multi\-page document into \fIoutput\-djvu\-file\fR\&. .RE .PP \fB\-i\fR, \fB\-\-save\-indirect=\fR\fB\fIindex\-djvu\-file\fR\fR .RS 4 Save OCR results as an indirect multi\-page document\&. Use \fIindex\-djvu\-file\fR as the index file name; put the component files into the same directory\&. The directory must exist and be writable\&. .RE .PP \fB\-\-save\-script=\fR\fB\fIscript\-file\fR\fR .RS 4 Save a \fBdjvused\fR script with OCR results into \fIscript\-file\fR\&. .RE .PP \fB\-\-in\-place\fR .RS 4 Save OCR results in place\&. .sp (Use this option to retain compatibility with ocrodjvu < 0\&.2\&.) .RE .PP \fB\-\-dry\-run\fR .RS 4 Don\*(Aqt change any files, throw OCR results away\&. .RE .PP It is mandatory to use exactly one of the above options\&. .PP \fB\-\-ocr\-only\fR .RS 4 If OCR results are to be saved to a separate document (\fB\-o\fR/\fB\-\-save\-bundled\fR or \fB\-i\fR/\fB\-\-save\-indirect\fR), save only the pages selected for OCR\&. .sp The default is to save all pages, even when the \fB\-p\fR/\fB\-\-pages\fR option is in effect\&. .RE .PP \fB\-\-clear\-text\fR .RS 4 Remove existing hidden text if present in the pages not selected for OCR\&. .sp (Use this option to retain compatibility with ocrodjvu < 0\&.2\&.) .RE .PP \fB\-\-save\-raw\-ocr=\fR\fB\fIoutput\-directory\fR\fR .RS 4 Save raw OCR results (typically in the hOCR format) into \fIoutput\-directory\fR\&. The directory must exist and be writable\&. .RE .PP \fB\-\-raw\-ocr\-filename\-template=\fR\fB\fItemplate\fR\fR .RS 4 Specifies the file naming scheme for raw OCR results\&. .sp The template language uses the \m[blue]\fIPython string formatting syntax\fR\m[]\&\s-2\u[6]\d\s+2\&. The following fields are available: .PP \fIpage\fR, \fIpage+\fR\fI\fIN\fR\fR, \fIpage\-\fR\fI\fIN\fR\fR .RS 4 page number, optionally shifted by a number \fIN\fR .RE .PP \fIid\fR .RS 4 page identifier .RE .PP \fIid\-ext\fR .RS 4 page identifier without file extension .RE .sp The default template is \(lq{id\-ext}\(rq\&. .RE .SS "Text segmentation options" .PP \fB\-t lines\fR, \fB\-\-details lines\fR .RS 4 Record location of every line\&. Don\*(Aqt record locations of particular words or characters\&. .sp This is the default for OCRopus 0\&.2\&. The option is ineffective with stand\-alone Tesseract 2\&.0\&. .RE .PP \fB\-t words\fR, \fB\-\-details=words\fR .RS 4 Record location of every line and every word\&. Don\*(Aqt record locations of particular characters\&. .sp This is the default for most OCR engines\&. .sp This option is ineffective with OCRopus 0\&.2 and stand\-alone Tesseract 2\&.0\&. .RE .PP \fB\-t chars\fR, \fB\-\-details=chars\fR .RS 4 Record location of every line, every word and every character\&. .sp This option is ineffective with OCRopus 0\&.2 and stand\-alone Tesseract 2\&.0\&. .RE .PP \fB\-\-word\-segmentation=simple\fR .RS 4 Consider each non\-empty sequence of non\-whitespace characters a single word\&. .sp This is the default, despite being linguistically incorrect\&. .RE .PP \fB\-\-word\-segmentation=uax29\fR .RS 4 Use the \m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[7]\d\s+2 algorithm to break lines into words\&. .sp This option breaks assumptions of some DjVu tools that words are separated by spaces, and therefore it is not recommended\&. .RE .SS "Other options" .PP \fB\-l\fR, \fB\-\-language=\fR\fB\fIlanguage\-id\fR\fR .RS 4 Set recognition language\&. \fIlanguage\-id\fR is typically an ISO 639\-2/T three\-letter code\&. .sp Tesseract \(>= 3\&.02 allows specifying multiple languages separated by \(lq+\(rq characters\&. .sp For OCRopus, the default is \(lqeng\(rq (English), unless the \fItesslanguage\fR environment variable is set\&. For other OCR engines, the default is always \(lqeng\(rq\&. .RE .PP \fB\-\-list\-languages\fR .RS 4 Print list of available languages for the currently selected OCR engine\&. .RE .PP \fB\-\-render=mask\fR .RS 4 Render only masks of page images\&. .sp This is the default\&. .RE .PP \fB\-\-render=foreground\fR .RS 4 Render only foreground layers of page images\&. .RE .PP \fB\-\-render=all\fR .RS 4 Render all layers of page images\&. .sp This option is necessary to OCR DjVu files with invalid foreground/background separation\&. .RE .PP \fB\-p\fR, \fB\-\-pages=\fR\fB\fIpage\-range\fR\fR .RS 4 Specifies pages to process\&. \fIpage\-range\fR is a comma\-separated list of sub\-ranges\&. Each sub\-range is either a single page (e\&.g\&.\ \&17) or a contiguous range of pages (e\&.g\&.\ \&37\-42)\&. Pages are numbered from 1\&. .sp The default is to process all pages\&. .RE .PP \fB\-j\fR, \fB\-\-jobs=\fR\fB\fIn\fR\fR .RS 4 Start up to \fIn\fR OCR processes\&. .RE .PP \fB\-\-version\fR .RS 4 Output version information and exit\&. .RE .PP \fB\-h\fR, \fB\-\-help\fR .RS 4 Display help and exit\&. .RE .SS "Advanced options" .PP \fB\-D\fR, \fB\-\-debug\fR .RS 4 To ease debugging, don\*(Aqt delete intermediate files\&. .RE .PP \fB\-X \fR\fB\fIkey\fR\fR\fB=\fR\fB\fIvalue\fR\fR .RS 4 This option allows controlling some details of how ocrodjvu operates\&. .RE .PP \fB\-\-on\-error=abort\fR .RS 4 Stop program execution when an exceptional situation (e\&.g\&., malformed output from the OCR engine, internal ocrodjvu error, etc\&.) occurs\&. .sp This is the default\&. .RE .PP \fB\-\-on\-error=resume\fR .RS 4 Attempt to recover from exceptional situations\&. .sp This option is strongly discouraged\&. .RE .PP \fB\-\-html5\fR .RS 4 Use a \m[blue]\fIHTML5 parser\fR\m[]\&\s-2\u[8]\d\s+2, which is more robust but slower than the default parser\&. .RE .SH "ENVIRONMENT" .PP The following environment variables affects ocrodjvu: .PP \fItesslanguage\fR .RS 4 Recognition language for Tesseract\&. .sp (Use this variable is deprecated in favor of the \fB\-\-language\fR option\&.) .RE .PP \fITMPDIR\fR .RS 4 ocrodjvu makes heavy use of temporary files\&. It will store them in a directory specified by this variable\&. The default is /tmp\&. .RE .SH "BUGS" .SS "Known bugs" .PP Tesseract 3\&.00 is affected by a bug \&\s-2\u[9]\d\s+2 making it produce invalid hOCR output in certain circumstances\&. ocrodjvu does not try recover form this fault (which couldn\*(Aqt be done reliably anyway) unless you pass the \fB\-X fix\-html=1\fR option\&. .PP When using Tesseract \(>= 3\&.00, extracting bounding boxes of particular characters (which happens when either \fB\-\-details=chars\fR or \fB\-\-word\-segmentation=uax29\fR) is inefficient\&. This due to limitations of Tesseract command line interface\&. .SS "Reporting new bugs" .PP Please report bugs at: \m[blue]\fI\%https://bitbucket.org/jwilk/ocrodjvu/issues\fR\m[] .SH "SEE ALSO" .PP \fBdjvu\fR(1), \fBdjvu2hocr\fR(1), \fBhocr2djvused\fR(1), .PP \fBocroscript\fR(1), \fBtesseract\fR(1), \fBcuneiform\fR(1), \fBocrad\fR(1), \fBgocr\fR(1) .SH "NOTES" .IP " 1." 4 OCRopus .RS 4 \m[blue]\fI\%https://code.google.com/p/ocropus/\fR\m[] .RE .IP " 2." 4 Cuneiform for Linux .RS 4 \m[blue]\fI\%https://launchpad.net/cuneiform-linux\fR\m[] .RE .IP " 3." 4 Ocrad .RS 4 \m[blue]\fI\%https://www.gnu.org/software/ocrad/\fR\m[] .RE .IP " 4." 4 GOCR .RS 4 \m[blue]\fI\%http://jocr.sourceforge.net/\fR\m[] .RE .IP " 5." 4 Tesseract .RS 4 \m[blue]\fI\%https://code.google.com/p/tesseract-ocr/\fR\m[] .RE .IP " 6." 4 Python string formatting syntax .RS 4 \m[blue]\fI\%https://docs.python.org/library/string.html#format-string-syntax\fR\m[] .RE .IP " 7." 4 Unicode Text Segmentation .RS 4 \m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[] .RE .IP " 8." 4 HTML5 parser .RS 4 \m[blue]\fI\%http://www.whatwg.org/specs/web-apps/current-work/#html-parser\fR\m[] .RE .IP " 9." 4 \m[blue]\fI\%https://code.google.com/p/tesseract-ocr/issues/detail?id=376\fR\m[]