'\" t .\" Title: djvu2hocr .\" Author: Jakub Wilk .\" Generator: DocBook XSL Stylesheets v1.76.1 .\" Date: 03/10/2012 .\" Manual: djvu2hocr manual .\" Source: djvu2hocr 0.7.9 .\" Language: English .\" .TH "DJVU2HOCR" "1" "03/10/2012" "djvu2hocr 0\&.7\&.9" "djvu2hocr manual" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" djvu2hocr \- DjVu to hOCR converter .SH "SYNOPSIS" .HP \w'\fBdjvu2hocr\fR\ 'u \fBdjvu2hocr\fR [\fIoption\fR...] \fIdjvu\-file\fR .HP \w'\fBdjvu2hocr\fR\ 'u \fBdjvu2hocr\fR {\fB\-\-version\fR | \fB\-\-help\fR | \fB\-h\fR} .SH "DESCRIPTION" .PP djvu2hocr converts hidden text from a DjVu file to the \m[blue]\fIhOCR\fR\m[]\&\s-2\u[1]\d\s+2 format\&. .SH "OPTIONS" .SS "Text segmentation options" .PP \fB\-\-word\-segmentation=simple\fR .RS 4 Use the same word segmentation as found in the DjVu file\&. .sp This is the default\&. .RE .PP \fB\-\-word\-segmentation=uax29\fR .RS 4 Use the \m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[2]\d\s+2 algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file\&. .RE .SS "Other options" .PP \fB\-\-version\fR .RS 4 Output version information and exit\&. .RE .PP \fB\-h\fR, \fB\-\-help\fR .RS 4 Display help and exit\&. .RE .SH "PORTABILITY" .PP djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document\&. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk: .SH "SEE ALSO" .PP \fBdjvu\fR(1) .SH "AUTHOR" .PP \fBJakub Wilk\fR <\&jwilk@jwilk\&.net\&> .RS 4 Author. .RE .SH "NOTES" .IP " 1." 4 hOCR .RS 4 \m[blue]\fI\%http://docs.google.com/View?docid=dfxcv4vc_67g844kf\fR\m[] .RE .IP " 2." 4 Unicode Text Segmentation .RS 4 \m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[] .RE