Scroll to navigation

HOCR2DJVUSED(1) hocr2djvused manual HOCR2DJVUSED(1)

NAME

hocr2djvused - hOCR to djvused script converter

SYNOPSIS

hocr2djvused [ option...]

DESCRIPTION

hocr2djvused reads a hOCR[1] file (as produced by OCRopus[2] or Cuneiform[3] or Tesseract[4]) from the standard input and converts it to a djvused script.

OPTIONS

Text segmentation options

-t lines, --details lines
Record location of every line. Don't record locations of particular words or characters.
-t words, --details=words
Record location of every line and every word. Don't record locations of particular characters.
 
This is the default.
-t chars, --details=chars
Record location of every line, every word and every character.
--word-segmentation=simple
Consider each non-empty sequence of non-whitespace characters a single word.
 
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
Use the Unicode Text Segmentation[5] algorithm to break lines into words.
 
This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended.

Other options

--rotation=n
Assume that DjVu pages are rotated by n degrees.
--page-size=widthxheight
Specifies that page size is width pixels × height pixels.
 
This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise.
--html5
Use a HTML5 parser[6], which is more robust but slower than the default parser.
--version
Output version information and exit.
-h, --help
Display help and exit.

SEE ALSO

 
ocrodjvu(1), djvused(1)

AUTHOR

Jakub Wilk <jwilk@jwilk.net>
Author.

NOTES

1.
hOCR
http://docs.google.com/View?docid=dfxcv4vc_67g844kf
2.
OCRopus
http://ocropus.googlecode.com/
3.
Cuneiform
http://launchpad.net/cuneiform-linux
4.
Tesseract
http://tesseract-ocr.googlecode.com/
5.
Unicode Text Segmentation
http://unicode.org/reports/tr29/
6.
HTML5 parser
http://www.whatwg.org/specs/web-apps/current-work/#html-parser
03/10/2012 hocr2djvused 0.7.9