.\" [created by setup.py sdist] '\" t .\" Title: djvu2hocr .\" Author: Jakub Wilk .\" Generator: DocBook XSL Stylesheets v1.79.1 .\" Date: 11/22/2016 .\" Manual: djvu2hocr manual .\" Source: djvu2hocr 0.10.1 .\" Language: English .\" .TH "DJVU2HOCR" "1" 2016-11-22 "djvu2hocr 0\&.10\&.1" "djvu2hocr manual" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" djvu2hocr \- DjVu to hOCR converter .SH "SYNOPSIS" .HP \w'\fBdjvu2hocr\fR\ 'u \fBdjvu2hocr\fR [\fIoption\fR...] \fIdjvu\-file\fR .HP \w'\fBdjvu2hocr\fR\ 'u \fBdjvu2hocr\fR {\fB\-\-version\fR | \fB\-\-help\fR | \fB\-h\fR} .SH "DESCRIPTION" .PP djvu2hocr converts hidden text from a DjVu file to the \m[blue]\fIhOCR\fR\m[]\&\s-2\u[1]\d\s+2 format\&. .SH "OPTIONS" .SS "Input selection options" .PP \fB\-p\fR, \fB\-\-pages=\fR\fB\fIpage\-range\fR\fR .RS 4 Specifies pages to covert\&. \fIpage\-range\fR is a comma\-separated list of sub\-ranges\&. Each sub\-range is either a single page (e\&.g\&.\ \&17) or a contiguous range of pages (e\&.g\&.\ \&37\-42)\&. Pages are numbered from 1\&. .sp The default is to convert all pages\&. .RE .SS "Text segmentation options" .PP \fB\-\-word\-segmentation=simple\fR .RS 4 Use the same word segmentation as found in the DjVu file\&. .sp This is the default\&. .RE .PP \fB\-\-word\-segmentation=uax29\fR .RS 4 Use the \m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[2]\d\s+2 algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file\&. .RE .SS "HTML output options" .PP \fB\-\-title=\fR\fB\fItitle\fR\fR .RS 4 Specifies the document title\&. .sp The default is \(lqDjVu hidden text layer\(rq\&. .RE .PP \fB\-\-css=\fR\fB\fIstyle\fR\fR .RS 4 Add the specified CSS style to the document\&. .sp For example, \fB\-\-css=\*(Aq\&.ocrx_line { display: block; }\*(Aq\fR can be used to visually preserve line breaks\&. .RE .SS "Other options" .PP \fB\-\-version\fR .RS 4 Output version information and exit\&. .RE .PP \fB\-h\fR, \fB\-\-help\fR .RS 4 Display help and exit\&. .RE .SH "PORTABILITY" .PP djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document\&. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk: .SH "BUGS" .PP Please report bugs at: \m[blue]\fI\%https://github.com/jwilk/ocrodjvu/issues\fR\m[] .SH "SEE ALSO" .PP \fBdjvu\fR(1), \fBhocr2djvused\fR(1), \fBocrodjvu\fR(1) .SH "NOTES" .IP " 1." 4 hOCR .RS 4 \m[blue]\fI\%https://docs.google.com/View?docid=dfxcv4vc_67g844kf\fR\m[] .RE .IP " 2." 4 Unicode Text Segmentation .RS 4 \m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[] .RE