NAME¶
string::token - Regex based iterative lexing
SYNOPSIS¶
package require
Tcl 8.5
package require
string::token ?1?
package require
fileutil
::string token text lex string
::string token file lex path
::string token chomp lex startvar string
resultvar
DESCRIPTION¶
This package provides commands for regular expression based lexing
(tokenization) of strings.
The complete set of procedures is described below.
- ::string token text lex string
- This command takes an ordered dictionary lex mapping regular
expressions to labels, and tokenizes the string according to this
dictionary.
The result of the command is a list of tokens, where each token is a
3-element list of label, start- and end-index in the string.
The command will throw an error if it is not able to tokenize the whole
string.
- ::string token file lex path
- This command is a convenience wrapper around ::string token text
above, and fileutil::cat, enabling the easy tokenization of whole
files. Note that this command loads the file wholly into memory
before starting to process it.
If the file is too large for this mode of operation a command directly based
on ::string token chomp below will be necessary.
- ::string token chomp lex startvar string
resultvar
- This command is the work horse underlying ::string token text
above. It is exposed to enable users to write their own lexers, which, for
example may apply different lexing dictionaries according to some internal
state, etc.
The command takes an ordered dictionary lex mapping regular
expressions to labels, a variable startvar which indicates where to
start lexing in the input string, and a result variable
resultvar to extend.
The result of the command is a tri-state numeric code indicating one of
- 0
- No token found.
- 1
- Token found.
- 2
- End of string reached.
- Note that recognition of a token from lex is started at the
character index in startvar.
If a token was recognized (status 1) the command will update the
index in startvar to point to the first character of the
string past the recognized token, and it will further extend the
resultvar with a 3-element list containing the label associated
with the regular expression of the token, and the start- and
end-character-indices of the token in string.
Neither startvar nor resultvar will be updated if no token is
recognized at all.
Note that the regular expressions are applied (tested) in the order they are
specified in lex, and the first matching pattern stops the process.
Because of this it is recommended to specify the patterns to lex with from
the most specific to the most general.
Further note that all regex patterns are implicitly prefixed with the
constraint escape A to ensure that a match starts exactly at the
character index found in startvar.
BUGS, IDEAS, FEEDBACK¶
This document, and the package it describes, will undoubtedly contain bugs and
other problems. Please report such in the category
textutil of the
Tcllib Trackers [
http://core.tcl.tk/tcllib/reportlist]. Please also
report any ideas for enhancements you may have for either package and/or
documentation.
KEYWORDS¶
lexing, regex, string, tokenization
CATEGORY¶
Text processing