.\" [created by setup.py sdist]
'\" t
.\"     Title: hocr2djvused
.\"    Author: Jakub Wilk <jwilk@jwilk.net>
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\"      Date: 04/21/2014
.\"    Manual: hocr2djvused manual
.\"    Source: hocr2djvused 0.7.18
.\"  Language: English
.\"
.TH "HOCR2DJVUSED" "1" 2014-04-21 "hocr2djvused 0\&.7\&.18" "hocr2djvused manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
hocr2djvused \- hOCR to \fBdjvused\fR script converter
.SH "SYNOPSIS"
.HP \w'\fBhocr2djvused\fR\ 'u
\fBhocr2djvused\fR [\fIoption\fR...] [\fIhocr\-file\fR...]
.SH "DESCRIPTION"
.PP
hocr2djvused reads one or more
\m[blue]\fIhOCR\fR\m[]\&\s-2\u[1]\d\s+2
files (as produced by
\m[blue]\fIOCRopus\fR\m[]\&\s-2\u[2]\d\s+2
or
\m[blue]\fICuneiform\fR\m[]\&\s-2\u[3]\d\s+2
or
\m[blue]\fITesseract\fR\m[]\&\s-2\u[4]\d\s+2) and converts them to a
\fBdjvused\fR
script\&.
.PP
Unless a filename is explicitly provided on the command line, hOCR is read from the standard input\&.
.SH "OPTIONS"
.SS "Text segmentation options"
.PP
\fB\-t lines\fR, \fB\-\-details lines\fR
.RS 4
Record location of every line\&. Don\*(Aqt record locations of particular words or characters\&.
.RE
.PP
\fB\-t words\fR, \fB\-\-details=words\fR
.RS 4
Record location of every line and every word\&. Don\*(Aqt record locations of particular characters\&.
.sp
This is the default\&.
.RE
.PP
\fB\-t chars\fR, \fB\-\-details=chars\fR
.RS 4
Record location of every line, every word and every character\&.
.RE
.PP
\fB\-\-word\-segmentation=simple\fR
.RS 4
Consider each non\-empty sequence of non\-whitespace characters a single word\&.
.sp
This is the default, despite being linguistically incorrect\&.
.RE
.PP
\fB\-\-word\-segmentation=uax29\fR
.RS 4
Use the
\m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[5]\d\s+2
algorithm to break lines into words\&.
.sp
This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended\&.
.RE
.SS "Other options"
.PP
\fB\-\-rotation=\fR\fB\fIn\fR\fR
.RS 4
Assume that DjVu pages are rotated by
\fIn\fR
degrees\&.
.RE
.PP
\fB\-\-page\-size=\fR\fB\fIwidth\fR\fR\fBx\fR\fB\fIheight\fR\fR
.RS 4
Specifies that page size is
\fIwidth\fR
pixels \(mu
\fIheight\fR
pixels\&.
.sp
This option is required for hOCR generated by Cuneiform (< 0\&.8) and superfluous otherwise\&.
.RE
.PP
\fB\-\-html5\fR
.RS 4
Use a
\m[blue]\fIHTML5 parser\fR\m[]\&\s-2\u[6]\d\s+2, which is more robust but slower than the default parser\&.
.RE
.PP
\fB\-\-fix\-utf8\fR
.RS 4
Attempt to fix UTF\-8 encoding issues and eliminate unwanted control characters\&.
.sp
This option might be needed for hOCR generated by Cuneiform\&\s-2\u[7]\d\s+2
or Tesseract\&\s-2\u[8]\d\s+2\&.
.RE
.PP
\fB\-\-version\fR
.RS 4
Output version information and exit\&.
.RE
.PP
\fB\-h\fR, \fB\-\-help\fR
.RS 4
Display help and exit\&.
.RE
.SH "BUGS"
.PP
Please report bugs at:
\m[blue]\fI\%https://bitbucket.org/jwilk/ocrodjvu/issues\fR\m[]
.SH "SEE ALSO"
.PP
\fBdjvu\fR(1),
\fBocrodjvu\fR(1),
\fBdjvu2hocr\fR(1),
\fBdjvused\fR(1)
.SH "NOTES"
.IP " 1." 4
hOCR
.RS 4
\m[blue]\fI\%https://docs.google.com/View?docid=dfxcv4vc_67g844kf\fR\m[]
.RE
.IP " 2." 4
OCRopus
.RS 4
\m[blue]\fI\%https://code.google.com/p/ocropus/\fR\m[]
.RE
.IP " 3." 4
Cuneiform
.RS 4
\m[blue]\fI\%https://launchpad.net/cuneiform-linux\fR\m[]
.RE
.IP " 4." 4
Tesseract
.RS 4
\m[blue]\fI\%https://code.google.com/p/tesseract-ocr/\fR\m[]
.RE
.IP " 5." 4
Unicode Text Segmentation
.RS 4
\m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[]
.RE
.IP " 6." 4
HTML5 parser
.RS 4
\m[blue]\fI\%http://www.whatwg.org/specs/web-apps/current-work/#html-parser\fR\m[]
.RE
.IP " 7." 4
\m[blue]\fI\%https://bugs.launchpad.net/cuneiform-linux/+bug/585418\fR\m[]
.IP " 8." 4
\m[blue]\fI\%https://code.google.com/p/tesseract-ocr/issues/detail?id=690\fR\m[]