DJVU2HOCR(1)

djvu2hocr manual

DJVU2HOCR(1)

NAME¶

djvu2hocr - DjVu to hOCR converter

SYNOPSIS¶

djvu2hocr [option...] djvu-file

djvu2hocr {--version | --help | -h}

DESCRIPTION¶

djvu2hocr converts hidden text from a DjVu file to the hOCR[1] format.

OPTIONS¶

Input selection options¶

-p, --pages=page-range

Specifies pages to covert. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from 1.

The default is to convert all pages.

Text segmentation options¶

--word-segmentation=simple

Use the same word segmentation as found in the DjVu file.

This is the default.

--word-segmentation=uax29

Use the Unicode Text Segmentation[2] algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file.

HTML output options¶

--title=title

Specifies the document title.

The default is “DjVu hidden text layer”.

--css=style

Add the specified CSS style to the document.

For example, --css='.ocrx_line { display: block; }' can be used to visually preserve line breaks.

Other options¶

--version

Output version information and exit.

-h, --help

Display help and exit.

PORTABILITY¶

djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk: <span class="djvu_char" title="#x07"> </span>

BUGS¶

Please report bugs at: https://github.com/jwilk/ocrodjvu/issues

NOTES¶

1.: hOCR

https://docs.google.com/View?docid=dfxcv4vc_67g844kf

2.: Unicode Text Segmentation

https://unicode.org/reports/tr29/

2018-07-12

djvu2hocr 0.10.4

Source file:	djvu2hocr.1.en.gz (from ocrodjvu 0.10.4-1)
Source last updated:	2018-08-02T12:18:08Z
Converted to HTML:	2020-08-08T10:13:16Z