.\" [created by setup.py sdist]
'\" t
.\"     Title: djvu2hocr
.\"    Author: Jakub Wilk <jwilk@jwilk.net>
.\" Generator: DocBook XSL Stylesheets v1.79.1 <http://docbook.sf.net/>
.\"      Date: 11/22/2016
.\"    Manual: djvu2hocr manual
.\"    Source: djvu2hocr 0.10.1
.\"  Language: English
.\"
.TH "DJVU2HOCR" "1" 2016-11-22 "djvu2hocr 0\&.10\&.1" "djvu2hocr manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
djvu2hocr \- DjVu to hOCR converter
.SH "SYNOPSIS"
.HP \w'\fBdjvu2hocr\fR\ 'u
\fBdjvu2hocr\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBdjvu2hocr\fR\ 'u
\fBdjvu2hocr\fR {\fB\-\-version\fR | \fB\-\-help\fR | \fB\-h\fR}
.SH "DESCRIPTION"
.PP
djvu2hocr converts hidden text from a DjVu file to the
\m[blue]\fIhOCR\fR\m[]\&\s-2\u[1]\d\s+2
format\&.
.SH "OPTIONS"
.SS "Input selection options"
.PP
\fB\-p\fR, \fB\-\-pages=\fR\fB\fIpage\-range\fR\fR
.RS 4
Specifies pages to covert\&.
\fIpage\-range\fR
is a comma\-separated list of sub\-ranges\&. Each sub\-range is either a single page (e\&.g\&.\ \&17) or a contiguous range of pages (e\&.g\&.\ \&37\-42)\&. Pages are numbered from 1\&.
.sp
The default is to convert all pages\&.
.RE
.SS "Text segmentation options"
.PP
\fB\-\-word\-segmentation=simple\fR
.RS 4
Use the same word segmentation as found in the DjVu file\&.
.sp
This is the default\&.
.RE
.PP
\fB\-\-word\-segmentation=uax29\fR
.RS 4
Use the
\m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[2]\d\s+2
algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file\&.
.RE
.SS "HTML output options"
.PP
\fB\-\-title=\fR\fB\fItitle\fR\fR
.RS 4
Specifies the document title\&.
.sp
The default is
\(lqDjVu hidden text layer\(rq\&.
.RE
.PP
\fB\-\-css=\fR\fB\fIstyle\fR\fR
.RS 4
Add the specified CSS style to the document\&.
.sp
For example,
\fB\-\-css=\*(Aq\&.ocrx_line { display: block; }\*(Aq\fR
can be used to visually preserve line breaks\&.
.RE
.SS "Other options"
.PP
\fB\-\-version\fR
.RS 4
Output version information and exit\&.
.RE
.PP
\fB\-h\fR, \fB\-\-help\fR
.RS 4
Display help and exit\&.
.RE
.SH "PORTABILITY"
.PP
djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document\&. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk:
<span class="djvu_char" title="#x07"> </span>
.SH "BUGS"
.PP
Please report bugs at:
\m[blue]\fI\%https://github.com/jwilk/ocrodjvu/issues\fR\m[]
.SH "SEE ALSO"
.PP
\fBdjvu\fR(1),
\fBhocr2djvused\fR(1),
\fBocrodjvu\fR(1)
.SH "NOTES"
.IP " 1." 4
hOCR
.RS 4
\m[blue]\fI\%https://docs.google.com/View?docid=dfxcv4vc_67g844kf\fR\m[]
.RE
.IP " 2." 4
Unicode Text Segmentation
.RS 4
\m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[]
.RE