.\" [created by setup.py sdist] '\" t .\" Title: hocr2djvused .\" Author: Jakub Wilk .\" Generator: DocBook XSL Stylesheets v1.78.1 .\" Date: 04/21/2014 .\" Manual: hocr2djvused manual .\" Source: hocr2djvused 0.7.18 .\" Language: English .\" .TH "HOCR2DJVUSED" "1" 2014-04-21 "hocr2djvused 0\&.7\&.18" "hocr2djvused manual" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" hocr2djvused \- hOCR to \fBdjvused\fR script converter .SH "SYNOPSIS" .HP \w'\fBhocr2djvused\fR\ 'u \fBhocr2djvused\fR [\fIoption\fR...] [\fIhocr\-file\fR...] .SH "DESCRIPTION" .PP hocr2djvused reads one or more \m[blue]\fIhOCR\fR\m[]\&\s-2\u[1]\d\s+2 files (as produced by \m[blue]\fIOCRopus\fR\m[]\&\s-2\u[2]\d\s+2 or \m[blue]\fICuneiform\fR\m[]\&\s-2\u[3]\d\s+2 or \m[blue]\fITesseract\fR\m[]\&\s-2\u[4]\d\s+2) and converts them to a \fBdjvused\fR script\&. .PP Unless a filename is explicitly provided on the command line, hOCR is read from the standard input\&. .SH "OPTIONS" .SS "Text segmentation options" .PP \fB\-t lines\fR, \fB\-\-details lines\fR .RS 4 Record location of every line\&. Don\*(Aqt record locations of particular words or characters\&. .RE .PP \fB\-t words\fR, \fB\-\-details=words\fR .RS 4 Record location of every line and every word\&. Don\*(Aqt record locations of particular characters\&. .sp This is the default\&. .RE .PP \fB\-t chars\fR, \fB\-\-details=chars\fR .RS 4 Record location of every line, every word and every character\&. .RE .PP \fB\-\-word\-segmentation=simple\fR .RS 4 Consider each non\-empty sequence of non\-whitespace characters a single word\&. .sp This is the default, despite being linguistically incorrect\&. .RE .PP \fB\-\-word\-segmentation=uax29\fR .RS 4 Use the \m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[5]\d\s+2 algorithm to break lines into words\&. .sp This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended\&. .RE .SS "Other options" .PP \fB\-\-rotation=\fR\fB\fIn\fR\fR .RS 4 Assume that DjVu pages are rotated by \fIn\fR degrees\&. .RE .PP \fB\-\-page\-size=\fR\fB\fIwidth\fR\fR\fBx\fR\fB\fIheight\fR\fR .RS 4 Specifies that page size is \fIwidth\fR pixels \(mu \fIheight\fR pixels\&. .sp This option is required for hOCR generated by Cuneiform (< 0\&.8) and superfluous otherwise\&. .RE .PP \fB\-\-html5\fR .RS 4 Use a \m[blue]\fIHTML5 parser\fR\m[]\&\s-2\u[6]\d\s+2, which is more robust but slower than the default parser\&. .RE .PP \fB\-\-fix\-utf8\fR .RS 4 Attempt to fix UTF\-8 encoding issues and eliminate unwanted control characters\&. .sp This option might be needed for hOCR generated by Cuneiform\&\s-2\u[7]\d\s+2 or Tesseract\&\s-2\u[8]\d\s+2\&. .RE .PP \fB\-\-version\fR .RS 4 Output version information and exit\&. .RE .PP \fB\-h\fR, \fB\-\-help\fR .RS 4 Display help and exit\&. .RE .SH "BUGS" .PP Please report bugs at: \m[blue]\fI\%https://bitbucket.org/jwilk/ocrodjvu/issues\fR\m[] .SH "SEE ALSO" .PP \fBdjvu\fR(1), \fBocrodjvu\fR(1), \fBdjvu2hocr\fR(1), \fBdjvused\fR(1) .SH "NOTES" .IP " 1." 4 hOCR .RS 4 \m[blue]\fI\%https://docs.google.com/View?docid=dfxcv4vc_67g844kf\fR\m[] .RE .IP " 2." 4 OCRopus .RS 4 \m[blue]\fI\%https://code.google.com/p/ocropus/\fR\m[] .RE .IP " 3." 4 Cuneiform .RS 4 \m[blue]\fI\%https://launchpad.net/cuneiform-linux\fR\m[] .RE .IP " 4." 4 Tesseract .RS 4 \m[blue]\fI\%https://code.google.com/p/tesseract-ocr/\fR\m[] .RE .IP " 5." 4 Unicode Text Segmentation .RS 4 \m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[] .RE .IP " 6." 4 HTML5 parser .RS 4 \m[blue]\fI\%http://www.whatwg.org/specs/web-apps/current-work/#html-parser\fR\m[] .RE .IP " 7." 4 \m[blue]\fI\%https://bugs.launchpad.net/cuneiform-linux/+bug/585418\fR\m[] .IP " 8." 4 \m[blue]\fI\%https://code.google.com/p/tesseract-ocr/issues/detail?id=690\fR\m[]