Totally Objects - REST - Version 5.5 [1.0]

Contents

Introduction

This document explains how to install and use the TORest product from Totally Objects for IBM's VisualAge for Smalltalk. The product provides tools to enable developers and application end-users to use the power of Regular Expressions to do complex text pattern matching and processing. Regular expressions are most commonly associated with tools such as grep and programming languages such as Perl. Totally Objects have combined the Regular Expression functionality from these and other tools making a powerful Regular Expression toolset for IBM Smalltalk. The parser and engine can be configured to behave similarly to your favourite flavour of Regular Expression.

About Regular Expressions

Regular Expressions (also known as regexs) use a special pattern notation to provide a powerful and flexible pattern matching capabilities. They provide the heart for convenient mechanisms for doing complex text and data processing. For a full explanation along with techniques for solving real problems, Totally Objects recommends "Mastering Regular Expressions" by Jeffrey E. F. Friedl (O'Reilly ISBN: 1-56592-257-3). The rest of this section describes the default regex notation used in this product. This can be changed: the way to do this is described later.

This manual uses this notation to denote part of or a whole regex, 'this' to denote a string being tested with the regex, and this to show a match within the test string.

The parts of a regex give an indication as to which characters in a test string should or should not match. For example, the regex h[eau]llo will match a character sequence in a test string starting with 'h' followed by an 'e', 'a' or 'u' followed by the sequence 'llo'. So if we try to match using the test string 'A big hallo to you' we will have a match as shown: 'A big hallo to you'.

As you can see the item [eau] indicates a match of a single character against a group of potential candidates. This is known as a character-class. They can also be constructed using a 'range' notation, such as [0-9] meaning any digit, [a-zA-Z] meaning any letter and [0-9a-zA-Z] meaning any alphanumeric. There is also a negated character-class that will match any character not specified. The following notation is used: [^...]. For example, [^0-9] will match any non-digit character. Some commonly used character-classes can be expressed using a shorthand notation, for example . will match any character and \s will match any whitespace character.

Many characters are treated literally within the regex when matching (such as h, l and o above) but others have special meanings and are called metacharacters. Sometimes a character used as metacharacter is needed to be treated literally; to do this the character is usually escaped by prefixing it with the \ (backslash) character. At other times escaped characters are treated as metacharacters. The table below shows common uses:
Character-classes
. dot Matches any character
\s Matches any whitespace character (ASCII 9, 10, 12, 13 and 22)
\S Matches any non-whitespace character
\d Matches any digit. This is the same as [0-9].
\D Matches any non-digit. This is the same as [^0-9].
\w Matches any word character. This is the same as [0-9a-zA-Z_] i.e. alphanumerics and underscore.
\W Matches any non-word character. This is the same as [^0-9a-zA-Z_].
Quantifiers
{min,max} The preceding item will be matched a minimum of min times and a maximum of max. The value max may be omitted if there is no maximum. For example, \s{3,5} will match 3, 4 or 5 consecutive whitespace characters; \d{4,} will match 4 or more digits. The matching process is 'greedy' and will try to match as many characters (up to the maximum) as it can. If an exact number of matches is required then the max value and preceding comma should be omitted. For example, \D{3} is the same as \D\D\D.
{num}
? question mark Match zero or one occurrencies of the preceding item. This is the same as {0,1}.
* star Match zero or more occurrences of the preceding item. This is the same as {0,}.
+ plus Match one or more occurrences of the preceding item. This is the same as {1,}.
Position Matching
^ caret This will match the start of a string.
$ dollar This will match the end of a string.
\< word boundary This will match the start of a word. I.e. when the next character is a word character (alphanumeric or underscore) and the previous character (if there is one) is not.
\> word boundary This will match the end of a word. I.e. when the previous character is a word character and the next character (if there is one) is not.
\b word boundary This will match the start or end of a word.
\B non-word boundary This will match a position that is not the start or end of a word.
Other Metacharacters
| alternation This will match either of the expressions it separates. For example, abc|xyz will match the test string 'abcdef' and 'uvwxyz' as shown. The alternatives are attempted in order they are written. Be careful when using the * quantifier in conjunction with alternation as it will alway give a match (even if the match is of zero length). Later alternatives will therefore never be attempted.
(...) parentheses Parentheses provide a means of limiting scope for alternation, grouping items to share a quantifier and "capturing" matched text for backreferences.

Examples:

a\s(slice|loaf)\sof\sbread will match 'Give me a loaf of bread please' and 'Give me a slice of bread please'

(ollie\s){1,2}oxenfree will match 'ollie ollie oxenfree' and 'ollie oxenfree!!'

\ref backreference This will match the captured text within the parentheses given by decimal number ref (made up from digits and not starting with 0). Pairs of parentheses are numbered according the order of the open parenthesis.

Examples:

width=(\d+)\sheight=\1 will match 'width=15 height=15' but not 'width=15 height=20'

It's\sa\s((cat|dog)-eat-\2|small)\sworld. will match 'It's a dog-eat-dog world.', 'It's a cat-eat-cat world.' and 'It's a small world.' but not 'It's a dog-eat-cat world.'

Unprintables and ASCII codes
The following metacharacters (and only the following) will have special meanings within the [ and ] (or [^ and ]) of a character-class.
\a alarm Matches ASCII 7
\b backspace Matches ASCII 8
\e escape Matches ASCII 27
\f form feed Matches ASCII 12
\n newline Matches ASCII 10 (but can be reconfigured)
\r carriage return Matches ASCII 13 (but can be reconfigured)
\t tab Matches ASCII 9
\v vertical tab Matches ASCII 11
\xhex hexadecimal ASCII code The value hex must comprise one or two valid hexadecimal characters (i.e. digits or uppercase A to F). The item matches the ASCII character given by the hexadecimal number hex. Note that this hexadecimal notation can also be used outside character-classes.
\octal octal ASCII code The value octal must comprise one, two or three valid octal characters (i.e. digits from 0 to 7). The item matches the ASCII character given by the octal number octal. Note that this octal notation can also be used outside character-classes. To avoid confusion with backreferences, the octal number should either start with a 0 (zero) or the hexadecimal notation described above used. If the regex parser is configured not to allow backreferences then items such as \1 will be treated as octal.

Installing TORest

This Totally Objects product has been packaged as a configuration map. To import it into your library select 'Browse Configuration Maps' from the 'Tools' menu of the 'System Transcript'. In the 'Configuration Maps Browser' select 'Import...' from the 'Names' menu and select the file tobrest5-5_1-0-*.dat. You should then select the configuration map contained within this file.

To load the configuration map into your image select them in the 'Configuration Maps Browser' and select 'Load With Required Maps' from the 'Editions' menu.

Creating Regular Expression Objects

Regexs are expressed using Strings that are converted into instance of TobRepRegularExpression using one of the following methods:

The first method takes an argument (an instance of TobRepParserProperties) defining the rules that the parser should follow. The second method is the same as the first but the default rules will be adopted.

If the receiver contains a syntax error then the exception ExTobRepSyntaxError (in the pool dictionary TobRepExceptions) will be signalled. The signal will have four arguments: For convenience the following two methods can also be sent to Strings to create regexs and handle syntax errors, however, there is no mechanism to access the arguments described above.

In these cases the zero-argument Block aBlock is executed if a syntax error occurs (and the result answered).

Configuring the Rules

If you are new to regexs and are reading this document for the first time, you should skip this section.

The way the parser behaves is dependent on an instance of TobRepParserProperties that can be created using the class method #new. The following instance methods can be used to configure the properties:

Methods Description
allowsBackreferences, allowsBackreferences: If true then items of the form \1, \13 and \20 would match text previously matched between the 1st, 13th and 20th set of parentheses respectively. If false the digits following the \ would be treated as ASCII code in octal (as they would if the first digit were 0).
carriageReturn, carriageReturn: Gets and sets a Character that will match the item \r. This Character is known as carriage return. Note that it is not necessarily ASCII 13.
dotMatchesNewLine, dotMatchesNewLine: If true then the item . will match any Character. If false the Character answered by the method #newLine will not match.
isEscapedbBackspace, escapedbIsBackspace These get and set what \b will match:
  • backspace - (ASCII 8)
  • literal - the letter 'b' (ASCII 98)
  • word boundary - this matches a position in the text (rather than a character) showing the start or end of a word. See #underscoreIsWordCharacter for a description of word characters. Also, \B will match a position not on a word boundary.
Note that, whatever the setting, \b will match only backspace when used in a character-class (assuming character-classes support escapes).
isEscapedbLiteral, escapedbIsLiteral
isEscapedbWordBoundary, escapedbIsWordBoundary
escapedLessThanAndGreaterThanAreWordBoundaries, escapedLessThanAndGreaterThanAreWordBoundaries: If true then the items \< and \> match the start and end of words. If false they will match the literal characters '<' (ASCII 60) and '>' (ASCII 62).
escapesInCharacterClasses, escapesInCharacterClasses: If true then escaped character have a special meaning in a character-class. If false the item \ just matches '\' (ASCII 92).

The supported escaped items in a character-class are:

  • \a - matches alarm (ASCII 7)
  • \b - matches backspace (ASCII 8)
  • \e - matches escape (ASCII 27)
  • \f - matches form feed (ASCII 12)
  • \n - matches newline (see method #newLine)
  • \r - matches carriage return (see method #carriageReturn)
  • \t - matches tab (ASCII 9)
  • \v - matches vertical tab (ASCII 11)
  • \xhex - Hexadecimal
  • \octal - Octal
  • All other escaped characters will match the character itself. Useful sequences are \\ and \] which match '\' (ASCII 92) and ']' (ASCII 93).
matchCaretAfterAnyNewLine, matchCaretAfterAnyNewLine: If true then the item ^ will match the position at the start of a String and any position immediately after a newline (see method #newLine). If false then if will only match the start of a String.
matchDollarBeforeAnyNewLine, matchDollarBeforeAnyNewLine: If true then the item $ will match the position at the end of a String and any position immediately before a newline (see method #newLine). If false then if will only match the end of a String.
negatedCharacterClassMatchesNewLine, negatedCharacterClassMatchesNewLine: If true then character-classes of the form [^...] will match a newline (see method #newLine).
newLine, newLine: Gets and sets a Character that will match the item \n. This Character is known as newline.
parenthesesEscaped, parenthesesEscaped: If true then \( and \) will be used to parenthesise items within an expression: ( and ) will match '(' (ASCII 40) and ')' (ASCII 41). If false then meanings are reversed.
quantifierBracesEscaped, quantifierBracesEscaped: If true then \{ and \} will be used to parenthesise range quantifiers: { and } will match '{' (ASCII 123) and '}' (ASCII 125). If false then meanings are reversed.
quantifiersEscaped, quantifiersEscaped: If true then \?, \* and \+ will be used to indicate the quantifiers 'one-or-more', 'zero-or-more' and 'zero-or-one': ?, * and + will match '?' (ASCII 63), '*' (ASCII 42) and '+' (ASCII 43). If false then meanings are reversed.
alternationEscaped, alternationEscaped: If true then \| will be used to separate alternate items: | will match '|' (ASCII 124). If false then meanings are reversed.
underscoreIsWordCharacter, underscoreIsWordCharacter: If true then '_' (ASCII 95) will be considered a word character. This will impact any items relating to word boundaries (\b, \B, \< and \>) and the items \w and \W.
whitespaceCharacters, whitespaceCharacters: Gets and sets a Collection of Characters that are considered whitespace. This will impact the items \s and \S.

Using Regular Expressions

The TobRepRegularExpression class contains many utility methods for matching, substituting and splitting text. The methods for each of these activities are described below.

Matching

Regexs can be compared to a test String to see if it matches any sequence of Characters (a substring) within it. Several methods have been provided for doing these matches. Some of these methods answer an instance (or collection of instances) of TobReeMatchResult. The instance methods of this class are explained first:

Matching methods in TobRepRegularExpression:

By default, matches are case-sensitive. However, it is possible to do case-insensitive matches by changing a flag in the regex itself. The following methods are used:

The setting of the flag can also be tested using the methods #isCaseInsensitive and #isCaseSensitive.

Case-insensitive matching can sometimes be confusing. Take care particularly when using backreferences - the cases of the subsequent strings will not have to have the same case as the original. Note that all methods described below that are based on matching will be impacted by this flag.

Splitting

TobRepRegularExpression uses the matching methods above to provide a convenient ways to split a String into substrings. Have you ever wanted to use the String method #subStrings: but wanted to provide a String as the argument - or maybe even multiple alternative Strings (e.g. at commas and semicolons)?

Examples:

The following items show the results of splitting the String 'The quick brown fox jumps over the lazy dog.' using the regex brown|lazy with each of the above methods.

Splitting at commas and semicolons:
',|;' asTobRegularExpression
   split: 'alpha,bravo;charlie,delta,echo;foxtrot'

...answers the Array
   #('alpha' 'bravo' 'charlie' 'delta' 'echo' 'foxtrot').

Substituting

Methods for substituting matches or captures from a single match into another String provide a convenient mechanism to reformatting data.

Note: #bindWithArguments:
In addition to the matching methods described above the substitution methods use the String method #bindWithArguments: from the CLDT Application. This works by replacing the Character sequences %1, %2, ..., %9 in the receiver with the Strings given in the argument.

For example, executing
   '%2,%1' bindWithArguments: #('alpha' 'beta')
will answer
   'beta,alpha'.

The substitution methods are described below:

Examples:

Reordering the comma-separated elements in a String and removing the whitespace:
'([^,]*)\s*,\s*([^,]*)\s*,\s*([^,]*)' asTobRegularExpression
   substituteCaptures: 'Matt Sims, Totally Objects, UK'
   in: '%2,%1,%3'

...answers the String
   'Totally Objects,Matt Sims,UK'

Extracting email addresses:
'\w+@\w+(\.\w+)+' asTobRegularExpression
   substituteMatches: 'My email addresses, mattsims@totallyobjects.com, matt@totallyobjects.com and msims@totallyobjects.com.'
   in: 'Second email address is %2 and the first is %1'

...answers the String
   'Second email address is matt@totallyobjects.com and the first is mattsims@totallyobjects.com'

Replacing

These methods describe techniques for replacing matched substrings in the test String with another String. Unlike the substitution methods described above, these methods throw away the matches and keep the substrings between them.

Examples:

Replacing all digits with the letter X:
'\d' asTobRegularExpression
   replaceMatches: 'The credit card number is 1234 5678 9876 5432.'
   with: 'X'

...answers the String
   'The credit card number is XXXX XXXX XXXX XXXX.'

Place an underscore before and after every word (i.e. replacing the word boundary):
'\b' asTobRegularExpression
   replaceMatches: 'The cat sat on the mat.'
   with: '_'

...answers the String
   '_The_ _cat_ _sat_ _on_ _the_ _mat_.'

Replacing tags in HTML:
'(</?)[bB]>' asTobRegularExpression
   replaceMatches: '<html>Do <b>this</b> and <b>that</b></html>'
   withCapturesIn: '%1i>'

...answers the String
   '<html>Do <i>this</i> and <i>that</i></html>'

Technical Details and Limitations

The TORest regex engine follows the two main rules required of such engines:

This means that when matching the regex a+|b+ to the test string 'xxaaaxxbbbbbbxx' the match will be here 'xxaaaxxbbbbbbxx' and not here 'xxaaaxxbbbbbbxx'. Note also that as + is greedy it will match as many times as it can (i.e. 3) rather than settling with the minimum of 1. Watch out whe using the * quantifier as this might match a zero length substring at the beginning of the test string if the rest of the regex allows. For example, a*|b* matches a zero-length string at the beginning of the test string as shown ' xxaaaxxbbbbbbxx'. TORest does not support the 'non-greedy' versions of the quantifiers *?, +?, ?? and friends.

TORest is a Traditional NFA (or Nondeterministic Finite Automaton) Engine. This means that the way the regexs are constructed can have a significant impact on the speed of execution. It does not support 'look-ahead' or 'negative-look-ahead' matching.

TORest does not support the POSIX character-class metasequences [:alnum:] and friends. It also does not support Perl5 non-capturing parentheses of the form (?:...)


Further Information