.\" Text automatically generated by txt2man .TH PDFSANDWICH 1 "04 April 2020" "" "" .SH NAME \fBpdfsandwich \fP- A generator for sandwich OCR pdfs from scanned pdf files .SH SYNOPSIS .nf .fam C \fBpdfsandwich\fP [\fIoptions\fP] inputfile.pdf .fam T .fi .fam T .fi .SH DESCRIPTION \fBpdfsandwich\fP generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images. Note that \fBpdfsandwich\fP needs the following programs: unpaper, convert, gs, hocr2pdf (for tesseract < 3.03), and tesseract. As tesseract >= 3.03 can write pdf files, hocr2pdf is only needed for older versions of tesseract. Please visit http://www.tobias-elze.de/\fBpdfsandwich\fP. .SH OPTIONS .TP .B \fB-convert\fP \fB-convert\fP filename : name of convert binary (default: convert) .TP .B \fB-coo\fP \fB-coo\fP \fIoptions\fP : additional convert \fIoptions\fP; make sure to quote; e.g. \fB-coo\fP "\fB-normalize\fP \fB-black-threshold\fP 75%" call convert \fB--help\fP or man convert for all convert \fIoptions\fP .TP .B \fB-debug\fP keep all temporary files in /tmp (for debugging) .TP .B \fB-enforcehocr2pdf\fP use hocr2pdf even if tesseract >= 3.03 .TP .B \fB-first_page\fP \fB-first_page\fP number : number of page to start OCR from (default: 1) .TP .B \fB-gray\fP use grayscale for images (default: black and white) .TP .B \fB-grayfilter\fP enable unpaper's gray filter; further \fIoptions\fP can be set by \fB-unpo\fP .TP .B \fB-gs\fP \fB-gs\fP filename : name of gs binary (default: gs); optional, only required for resizing .TP .B \fB-hocr2pdf\fP \fB-hocr2pdf\fP filename : name of hocr2pdf binary (default: hocr2pdf); ignored for tesseract >= 3.03 unless option \fB-enforcehocr2pdf\fP is set .TP .B \fB-hoo\fP \fB-hoo\fP \fIoptions\fP : additional hocr2pdf \fIoptions\fP; make sure to quote .TP .B \fB-identify\fP \fB-identify\fP filename : name of identify binary (default: identify) .TP .B \fB-last_page\fP \fB-last_page\fP number : number of page up to which to process OCR (default: number of pages in inputfile) .TP .B \fB-lang\fP \fB-lang\fP language : language of the text; option to tesseract (default: eng) e.g: eng, deu, deu-frak, fra, rus, swe, spa, ita, \.\.\. see option \fB-list_langs\fP; Multiple languages may be specified, separated by plus characters. .TP .B \fB-layout\fP \fB-layout\fP { single | double | none } : layout of the scanned pages; requires unpaper single: one page per sheet double: two pages per sheet none: no auto-layout (default) .TP .B \fB-list_langs\fP list currently available languages and exit; in case of custom binaries of tesseract, place this after the \fB-tesseract\fP option .TP .B \fB-maxpixels\fP \fB-maxpixels\fP NUM : maximal number of pixels allowed for input file if (resolution/72)^2 *width*height > maxpixels then scale page of input file down prior to OCR so that page size in pixels corresponds to maxpixels; default: 17415167 (A3 @ 300 dpi) .TP .B \fB-noimage\fP do not place the image over the text (requires hocr2pdf; ignored without \fB-enforcehocr2pdf\fP option) .TP .B \fB-nopreproc\fP do not preprocess with unpaper .TP .B \fB-nthreads\fP \fB-nthreads\fP number : number of parallel threads (default: guessed number of CPUs; if guessing fails: 1) .TP .B \fB-o\fP \fB-o\fP filename : output file; default: inputfile_ocr.pdf (if extension is different from .pdf, original extension is kept) .TP .B \fB-omp_thread_limit\fP \fB-omp_thread_limit\fP number : number of threads tesseract may use for each page (default: 1) .TP .B \fB-pagesize\fP \fB-pagesize\fP { original | NUMxNUM } : set page size of output pdf (requires ghostscript) original: same as input file (default) NUMxNUM: width x height in pixel (e.g. for A4: \fB-pagesize\fP 595x842) .TP .B \fB-pdfinfo\fP \fB-pdfinfo\fP filename : name of pdfinfo binary (default: pdfinfo) .TP .B \fB-pdfunite\fP \fB-pdfunite\fP filename : name of pdfunite binary (default: pdfunite) .TP .B \fB-resolution\fP \fB-resolution\fP NUM : resolution (dpi) used for OCR (default: 300) .TP .B \fB-rgb\fP use RGB color space for images (default: black and white); use with care: causes problems with some color spaces .TP .B \fB-sloppy_text\fP sloppily place text, group words, do not draw single glyphs; ignored for tesseract >= 3.03 unless option \fB-enforcehocr2pdf\fP is set .TP .B \fB-tesseract\fP \fB-tesseract\fP filename : name of tesseract binary (default: tesseract) .TP .B \fB-tesso\fP \fB-tesso\fP \fIoptions\fP : additional tesseract \fIoptions\fP; make sure to quote .TP .B \fB-unpaper\fP \fB-unpaper\fP filename : name of unpaper binary (default: unpaper) .TP .B \fB-unpo\fP \fB-unpo\fP \fIoptions\fP : additional unpaper \fIoptions\fP; make sure to quote .TP .B \fB-quiet\fP suppress output .TP .B \fB-verbose\fP produce more output .TP .B \fB-version\fP print version and quit .TP .B \fB-help\fP Display this list of \fIoptions\fP .TP .B \fB--help\fP Display this list of \fIoptions\fP .SH LANGUAGES Via Tesseract, numerous language packagess available - follow this link http://code.google.com/p/tesseract-ocr/downloads/list for a complete list. Here is an incomplete selection of supported languages and their abbreviations: .PP ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu (German), ell (Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese) .PP Multiple languages may be specified, separated by plus characters. Note that the respective tesseract language package needs to be installed on your system to be usable by \fBpdfsandwich\fP. Option \fB-list_langs\fP lists the languages which are available on your system. .SH AVAILABILITY Sources and packages as well as comprehensive help can be found at http://www.tobias-elze.de/\fBpdfsandwich\fP. .SH AUTHOR Tobias Elze