'\" t .\" Title: unicharset_extractor .\" Author: [see the "AUTHOR" section] .\" Generator: DocBook XSL Stylesheets vsnapshot .\" Date: 03/26/2024 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" .TH "UNICHARSET_EXTRACTOR" "1" "03/26/2024" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" unicharset_extractor \- Reads box or plain text files to extract the unicharset\&. .SH "SYNOPSIS" .sp \fBunicharset_extractor\fR [\-\-output_unicharset filename] [\-\-norm_mode mode] box_or_text_file [\&...] .sp Where mode means: 1=combine graphemes (use for Latin and other simple scripts) 2=split graphemes (use for Indic/Khmer/Myanmar) 3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan) .SH "DESCRIPTION" .sp Tesseract needs to know the set of possible characters it can output\&. To generate the unicharset data file, use the unicharset_extractor program on training pages bounding box files or a plain text file: .sp .if n \{\ .RS 4 .\} .nf unicharset_extractor fontfile_1\&.box fontfile_2\&.box \&.\&.\&. .fi .if n \{\ .RE .\} .sp The unicharset will be put into the file \fI\&./unicharset\fR if no output filename is provided\&. .sp \fBNOTE\fR Use the appropriate norm_mode based on the language\&. .SH "SEE ALSO" .sp tesseract(1), unicharset(5) .sp \m[blue]\fBhttps://tesseract\-ocr\&.github\&.io/tessdoc/Training\-Tesseract\&.html\fR\m[] .SH "HISTORY" .sp unicharset_extractor first appeared in Tesseract 2\&.00\&. .SH "COPYING" .sp Copyright (C) 2006, Google Inc\&. Licensed under the Apache License, Version 2\&.0 .SH "AUTHOR" .sp The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.