'\" t .\" Title: combine_tessdata .\" Author: [see the "AUTHOR" section] .\" Generator: DocBook XSL Stylesheets v1.79.1 .\" Date: 01/21/2019 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" .TH "COMBINE_TESSDATA" "1" "01/21/2019" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" combine_tessdata \- combine/extract/overwrite/list/compact Tesseract data .SH "SYNOPSIS" .sp \fBcombine_tessdata\fR [\fIOPTION\fR] \fIFILE\fR\&... .SH "DESCRIPTION" .sp combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact tessdata components in [lang]\&.traineddata files\&. .sp To combine all the individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs) located at, say, /home/$USER/temp/eng\&.* run: .sp .if n \{\ .RS 4 .\} .nf combine_tessdata /home/$USER/temp/eng\&. .fi .if n \{\ .RE .\} .sp The result will be a combined tessdata file /home/$USER/temp/eng\&.traineddata .sp Specify option \-e if you would like to extract individual components from a combined traineddata file\&. For example, to extract language config file and the unicharset from tessdata/eng\&.traineddata run: .sp .if n \{\ .RS 4 .\} .nf combine_tessdata \-e tessdata/eng\&.traineddata \e /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset .fi .if n \{\ .RE .\} .sp The desired config file and unicharset will be written to /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset .sp Specify option \-o to overwrite individual components of the given [lang]\&.traineddata file\&. For example, to overwrite language config and unichar ambiguities files in tessdata/eng\&.traineddata use: .sp .if n \{\ .RS 4 .\} .nf combine_tessdata \-o tessdata/eng\&.traineddata \e /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharambigs .fi .if n \{\ .RE .\} .sp As a result, tessdata/eng\&.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc\&. .sp Note: the file names of the files to extract to and to overwrite from should have the appropriate file suffixes (extensions) indicating their tessdata component type (\&.unicharset for the unicharset, \&.unicharambigs for unichar ambigs, etc)\&. See k*FileSuffix variable in ccutil/tessdatamanager\&.h\&. .sp Specify option \-u to unpack all the components to the specified path: .sp .if n \{\ .RS 4 .\} .nf combine_tessdata \-u tessdata/eng\&.traineddata /home/$USER/temp/eng\&. .fi .if n \{\ .RE .\} .sp This will create /home/$USER/temp/eng\&.* files with individual tessdata components from tessdata/eng\&.traineddata\&. .SH "OPTIONS" .sp \fB\-c\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Compacts the LSTM component in the \&.traineddata file to int\&. .sp \fB\-d\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Lists directory of components from the \&.traineddata file\&. .sp \fB\-e\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Extracts the specified components from the \&.traineddata file .sp \fB\-o\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Overwrites the specified components of the \&.traineddata file with those provided on the command line\&. .sp \fB\-u\fR \fI\&.traineddata\fR \fIPATHPREFIX\fR Unpacks the \&.traineddata using the provided prefix\&. .SH "CAVEATS" .sp \fIPrefix\fR refers to the full file prefix, including period (\&.) .SH "COMPONENTS" .sp The components in a Tesseract lang\&.traineddata file as of Tesseract 4\&.0 are briefly described below; For more information on many of these files, see \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[] and \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\-4\&.00\fR\m[] .PP lang\&.config .RS 4 (Optional) Language\-specific overrides to default config variables\&. For 4\&.0 traineddata files, lang\&.config provides control parameters which can affect layout analysis, and sub\-languages\&. .RE .PP lang\&.unicharset .RS 4 (Required \- 3\&.0x legacy tesseract) The list of symbols that Tesseract recognizes, with properties\&. See unicharset(5)\&. .RE .PP lang\&.unicharambigs .RS 4 (Optional \- 3\&.0x legacy tesseract) This file contains information on pairs of recognized symbols which are often confused\&. For example, \fIrn\fR and \fIm\fR\&. .RE .PP lang\&.inttemp .RS 4 (Required \- 3\&.0x legacy tesseract) Character shape templates for each unichar\&. Produced by mftraining(1)\&. .RE .PP lang\&.pffmtable .RS 4 (Required \- 3\&.0x legacy tesseract) The number of features expected for each unichar\&. Produced by mftraining(1) from \fB\&.tr\fR files\&. .RE .PP lang\&.normproto .RS 4 (Required \- 3\&.0x legacy tesseract) Character normalization prototypes generated by cntraining(1) from \fB\&.tr\fR files\&. .RE .PP lang\&.punc\-dawg .RS 4 (Optional \- 3\&.0x legacy tesseract) A dawg made from punctuation patterns found around words\&. The "word" part is replaced by a single space\&. .RE .PP lang\&.word\-dawg .RS 4 (Optional \- 3\&.0x legacy tesseract) A dawg made from dictionary words from the language\&. .RE .PP lang\&.number\-dawg .RS 4 (Optional \- 3\&.0x legacy tesseract) A dawg made from tokens which originally contained digits\&. Each digit is replaced by a space character\&. .RE .PP lang\&.freq\-dawg .RS 4 (Optional \- 3\&.0x legacy tesseract) A dawg made from the most frequent words which would have gone into word\-dawg\&. .RE .PP lang\&.fixed\-length\-dawgs .RS 4 (Optional \- 3\&.0x legacy tesseract) Several dawgs of different fixed lengths \(em useful for languages like Chinese\&. .RE .PP lang\&.shapetable .RS 4 (Optional \- 3\&.0x legacy tesseract) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar\-id and font\&. .RE .PP lang\&.bigram\-dawg .RS 4 (Optional \- 3\&.0x legacy tesseract) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a \fI?\fR\&. .RE .PP lang\&.unambig\-dawg .RS 4 (Optional \- 3\&.0x legacy tesseract) \&. .RE .PP lang\&.params\-model .RS 4 (Optional \- 3\&.0x legacy tesseract) \&. .RE .PP lang\&.lstm .RS 4 (Required \- 4\&.0 LSTM) Neural net trained recognition model generated by lstmtraining\&. .RE .PP lang\&.lstm\-punc\-dawg .RS 4 (Optional \- 4\&.0 LSTM) A dawg made from punctuation patterns found around words\&. The "word" part is replaced by a single space\&. Uses lang\&.lstm\-unicharset\&. .RE .PP lang\&.lstm\-word\-dawg .RS 4 (Optional \- 4\&.0 LSTM) A dawg made from dictionary words from the language\&. Uses lang\&.lstm\-unicharset\&. .RE .PP lang\&.lstm\-number\-dawg .RS 4 (Optional \- 4\&.0 LSTM) A dawg made from tokens which originally contained digits\&. Each digit is replaced by a space character\&. Uses lang\&.lstm\-unicharset\&. .RE .PP lang\&.lstm\-unicharset .RS 4 (Required \- 4\&.0 LSTM) The unicode character set that Tesseract recognizes, with properties\&. Same unicharset must be used to train the LSTM and build the lstm\-*\-dawgs files\&. .RE .PP lang\&.lstm\-recoder .RS 4 (Required \- 4\&.0 LSTM) Unicharcompress, aka the recoder, which maps the unicharset further to the codes actually used by the neural network recognizer\&. This is created as part of the starter traineddata by combine_lang_model\&. .RE .PP lang\&.version .RS 4 (Optional) Version string for the traineddata file\&. First appeared in version 4\&.0 of Tesseract\&. Old version of traineddata files will report Version string:Pre\-4\&.0\&.0\&. 4\&.0 version of traineddata files may include the network spec used for LSTM training as part of version string\&. .RE .SH "HISTORY" .sp combine_tessdata(1) first appeared in version 3\&.00 of Tesseract .SH "SEE ALSO" .sp tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), unicharambigs(5) .SH "COPYING" .sp Copyright (C) 2009, Google Inc\&. Licensed under the Apache License, Version 2\&.0 .SH "AUTHOR" .sp The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.