cmdTokenizerobjecttokens.t[66]

Command tokenizer for US English. Other language modules should provide their own tokenizers to allow for differences in punctuation and other lexical elements.

[Required]

cmdTokenizer :   Tokenizer

Superclass Tree   (in declaration order)

cmdTokenizer
        Tokenizer
                object

Summary of Properties  

endAssert  patAlphaDashAlpha  patPunct  patSpelledTens  patSpelledUnits  punctChars  rules_  squote  wordPunct 

Summary of Methods  

acceptAbbrTok  buildOrigText  tokCvtAbbr  tokCvtApostropheS  tokCvtSpelledNumber 

Inherited from Tokenizer :
deleteRule  deleteRuleAt  insertRule  insertRuleAt  tokCvtLower  tokCvtSkip  tokenize 

Properties  

endAsserttokens.t[199]

end-of-token assertion

patAlphaDashAlphatokens.t[258]
add the part after the hyphen

patPuncttokens.t[375]
no description available

patSpelledTenstokens.t[371]
some pre-compiled regular expressions

patSpelledUnitstokens.t[373]
no description available

punctCharstokens.t[196]
token-separating punctuation marks, as an <alpha|x|y> pattern

rules_OVERRIDDENtokens.t[74]
The list of tokenizing rules. This isn't actually required to be defined by the language module, since you *could* just use the default rules inherited from the base Tokenizer class, but it's likely that each language will have some quirks that require custom rules.

squotetokens.t[206]
List of characters consisting a single quote mark. This includes regular ASCII straight quotes as well as the unicode curly quotes. This is for pasting into a <alpha|x|y> pattern.

wordPuncttokens.t[212]
list of acceptable punctuation marks within words; this is for pasting into an <alpha|x|y> pattern

Methods  

acceptAbbrTok (txt)tokens.t[270]

Check to see if we want to accept an abbreviated token - this is a token that ends in a period, which we use for abbreviated words like "Mr." or "Ave." We'll accept the token only if it appears as given - including the period - in the dictionary. Note that we ignore truncated matches, since the only way we'll accept a period in a word token is as the last character; there is thus no way that a token ending in a period could be a truncation of any longer valid token.

buildOrigText (toks)tokens.t[311]
Given a list of token strings, rebuild the original input string. We can't recover the exact input string, because the tokenization process throws away whitespace information, but we can at least come up with something that will display cleanly and produce the same results when run through the tokenizer.

[Required]

tokCvtAbbr (txt, typ, toks)tokens.t[290]
Process an abbreviated token.

When we find an abbreviation, we'll enter it with the abbreviated word minus the trailing period, plus the period as a separate token. We'll mark the period as an "abbreviation period" so that grammar rules will be able to consider treating it as an abbreviation -- but since it's also a regular period, grammar rules that treat periods as regular punctuation will also be able to try to match the result. This will ensure that we try it both ways - as abbreviation and as a word with punctuation - and pick the one that gives us the best result.

tokCvtApostropheS (txt, typ, toks)tokens.t[220]
Handle an apostrophe-s word. We'll return this as two separate tokens: one for the word preceding the apostrophe-s, and one for the apostrophe-s itself.

tokCvtSpelledNumber (txt, typ, toks)tokens.t[244]
Handle a spelled-out hyphenated number from 21 to 99. We'll return this as three separate tokens: a word for the tens name, a word for the hyphen, and a word for the units name.

Adv3Lite Library Reference Manual
Generated on 25/04/2024 from adv3Lite version 2.0