HOCR2DJVUSED(1)

hocr2djvused manual

HOCR2DJVUSED(1)

NAME¶

hocr2djvused - hOCR to djvused script converter

SYNOPSIS¶

hocr2djvused [ option...]

DESCRIPTION¶

hocr2djvused reads a hOCR[1] file (as produced by OCRopus[2] or Cuneiform[3] or Tesseract[4]) from the standard input and converts it to a djvused script.

OPTIONS¶

Text segmentation options¶

-t lines, --details lines

Record location of every line. Don't record locations of particular words or characters.

-t words, --details=words

Record location of every line and every word. Don't record locations of particular characters.

This is the default.

-t chars, --details=chars

Record location of every line, every word and every character.

--word-segmentation=simple

Consider each non-empty sequence of non-whitespace characters a single word.

This is the default, despite being linguistically incorrect.

--word-segmentation=uax29

Use the Unicode Text Segmentation[5] algorithm to break lines into words.

This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended.

Other options¶

--rotation=n

Assume that DjVu pages are rotated by n degrees.

--page-size=widthxheight

Specifies that page size is width pixels × height pixels.

This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise.

--html5

Use a HTML5 parser[6], which is more robust but slower than the default parser.

--version

Output version information and exit.

-h, --help

Display help and exit.

AUTHOR¶

Jakub Wilk <jwilk@jwilk.net>

Author.

NOTES¶

1.: hOCR

http://docs.google.com/View?docid=dfxcv4vc_67g844kf

2.: OCRopus

http://ocropus.googlecode.com/

3.: Cuneiform

http://launchpad.net/cuneiform-linux

4.: Tesseract

http://tesseract-ocr.googlecode.com/

5.: Unicode Text Segmentation

http://unicode.org/reports/tr29/

6.: HTML5 parser

http://www.whatwg.org/specs/web-apps/current-work/#html-parser

03/10/2012

hocr2djvused 0.7.9

Source file:	hocr2djvused.1.en.gz (from ocrodjvu 0.7.9-1)
Source last updated:	2012-04-23T21:04:31Z
Converted to HTML:	2017-06-07T16:46:17Z