table of contents
OCRODJVU(1) | ocrodjvu manual | OCRODJVU(1) |
NAME¶
ocrodjvu - OCR for DjVu filesSYNOPSIS¶
ocrodjvu
{ -o | --save-bundled} output-djvu-file
[option...] djvu-file
ocrodjvu
{ -i | --save-indirect} index-djvu-file
[option...] djvu-file
ocrodjvu
--save-script script-file [option...]
djvu-file
ocrodjvu
--in-place [option...] djvu-file
ocrodjvu
--dry-run [option...] djvu-file
ocrodjvu
{ --version | --help | -h | --list-engines |
--list-languages}
DESCRIPTION¶
ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files. The following OCR engines are supported:•OCRopus[1] (internally, ocrodjvu
calls ocroscript's recognize (or rec-tess) command, so
that ultimately Tesseract acts as the OCR backend);
•Cuneiform for Linux[2].
•Ocrad[3].
•GOCR[4].
•Stand-alone Tesseract[5].
OPTIONS¶
OCR engine options¶
-e, --engine=engine-idUse this OCR engine. The default is
‘ocropus’ (OCRopus).
--list-engines
Print list of available OCR engines.
Options controlling output¶
It is mandatory to use exactly one of the following options: -o, --save-bundled=output-djvu-fileSave OCR results as a bundled multi-page
document into output-djvu-file.
-i, --save-indirect=index-djvu-file
Save OCR results as an indirect multi-page
document. Use index-djvu-file as the index file name; put the component
files into the same directory. The directory must exist and be writable.
--save-script=script-file
Save a djvused script with OCR results
into script-file.
--in-place
Save OCR results in place.
(Use this option to retain compatibility with ocrodjvu < 0.2.)
--dry-run
Don't change any files, throw OCR results
away.
Text segmentation options¶
-t lines, --details linesRecord location of every line. Don't record
locations of particular words or characters.
This is the default for OCRopus 0.2. The option is ineffective with stand-alone
Tesseract 2.0.
-t words, --details=words
Record location of every line and every word.
Don't record locations of particular characters.
This is the default for most OCR engines.
This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.
-t chars, --details=chars
Record location of every line, every word and
every character.
This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.
--word-segmentation=simple
Consider each non-empty sequence of
non-whitespace characters a single word.
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
Use the Unicode Text Segmentation[6]
algorithm to break lines into words.
This option breaks assumptions of some DjVu tools that words are separated by
spaces, and therefore it is not recommended.
Other options¶
--clear-textRemove existing hidden text if present in the
pages not selected for OCR.
(Use this option to retain compatibility with ocrodjvu < 0.2.)
--ocr-only
Don't save pages that were not
processed.
-l, --language=language-id
Set recognition language. language-id
is typically an ISO 639-2/T three-letter code.
For OCRopus, the default is ‘eng’ (English), unless the
tesslanguage environment variable is set. For other OCR engines, the
default is always ‘eng’.
--list-languages
Print list of available languages for the
currently selected OCR engine.
--render=mask
Render only masks of page images.
This is the default.
--render=foreground
Render only foreground layers of page
images.
--render=all
Render all layers of page images.
This option is necessary to OCR DjVu files with invalid foreground/background
separation.
-p, --pages=page-range
Specifies pages to process. page-range
is a comma-separated list of sub-ranges. Each sub-range is either a single
page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages
are numbered from 1.
The default is to process all pages.
-j, --jobs=n
Start up to n OCR processes.
--version
Output version information and exit.
-h, --help
Display help and exit.
Advanced options¶
-D, --debugTo ease debugging, don't delete intermediate
files.
-X key=value
This option allow to control some details of
how ocrodjvu operates.
--on-error=abort
Stop program execution when exception
situation (e.g., malformed output from the OCR engine, internal ocrodjvu
error, etc.) occurs.
This is the default.
--on-error=resume
Attempt to recover from exceptional
situations.
This option is strongly discouraged.
--html5
Use a HTML5 parser[7], which is more
robust but slower than the default parser.
ENVIRONMENT¶
The following environment variables affects ocrodjvu: tesslanguageRecognition language for Tesseract.
(Use this variable is deprecated in favor of the --language
option.)
TMPDIR
ocrodjvu makes heavy use of temporary files.
It will store them in a directory specified by this variable. The default is
/tmp.
BUGS¶
Tesseract 3.00 is affected by a bug [8] making it produce invalid hOCR output in certain circumstances. ocrodjvu does not try recover form this fault (which couldn't be done reliably anyway) unless you pass the -X fix-html=1 option. When using Tesseract ≥ 3.00, extracting bounding boxes of particular characters (which happens when either --details=chars or --word-segmentation=uax29) is inefficient. This due to limitations of Tesseract command line interface.SEE ALSO¶
AUTHOR¶
Jakub Wilk <jwilk@jwilk.net>Author.
NOTES¶
- 1.
- OCRopus
- 2.
- Cuneiform for Linux
- 3.
- Ocrad
- 4.
- GOCR
- 5.
- Tesseract
- 6.
- Unicode Text Segmentation
- 7.
- HTML5 parser
03/10/2012 | ocrodjvu 0.7.9 |