Backreferences

One of the most important features of regular expressions is the ability to store a part of a matched pattern for later reuse. As you'll recall, placing parentheses around a regular expression pattern or part of a pattern causes that part of the expression to be stored into a temporary buffer. You can override the saving of that part of the regular expression using the non-capturing metacharacters '?:', '?=', or '?!'.

Each captured submatch is stored as it is encountered from left to right in a regular expressions pattern. The buffer numbers where the submatches are stored begin at 1 and continue up to a maximum of 99 subexpressions. Each different buffer can be accessed using '\n' where n is one or two decimal digits identifying a specific buffer.

One of the simplest, most useful applications of back references provides the ability to locate the occurrence of two identical words together in a text. Take the following sentence:

Is is the cost of of gasoline going up up?

As written, the sentence shown above clearly has a problem with several duplicated words. It would be nice to devise a way to fix that sentence without having to look for duplicates of every single word. The following JScript regular expression uses a single subexpression to do that.

/\b([a-z]+) \1\b/gi

The equivalent VBScript expression is:

"\b([a-z]+) \1\b"

The subexpression, in this case, is everything between parentheses. That captured expression includes one or more alphabetic characters, as specified by '[a-z]+'. The second part of the regular expression is the reference to the previously captured submatch, that is, the second occurrence of the word just matched by the parenthetical expression. '\1' is used to specified the first submatch. The word boundary meta characters ensure that only separate words are detected. If they weren't, a phrase such as "is issued" or "this is" would be incorrectly identified by this expression.

In the JScript expression the global flag ('g') following the regular expression indicates that the expression is applied to as many matches as it can find in the input string. The case insensitivity is specified by the case insensitivity ('i') flag at the end of the expression. The multiline flag specifies that potential matches may occur on either side of a newline character. For VBScript, the various flags cannot be set in the expression but must be explicitly set using properties of the RegExp object.

Using the regular expression shown above, the following JScript code can use the submatch information to replace an occurrence of two consecutive identical words in a string of text with a single occurrence of the same word:

var ss = "Is is the cost of of gasoline going up up?.\n";
var re = /\b([a-z]+) \1\b/gim;       //Create regular expression pattern.
var rv = ss.replace(re,"$1");   //Replace two occurrences with one.

The closest equivalent VBScript code appears as follows:

Dim ss, re, rv
ss = "Is is the cost of of gasoline going up up?." & vbNewLine
Set re = New RegExp
re.Pattern = "\b([a-z]+) \1\b"
re.Global = True
re.IgnoreCase = True
re.MultiLine = True
rv = re.Replace(ss,"$1")

In the VBScript code, notice that the global, case-insensitivity, and multiline flags are set using the appropriately named properties of the RegExp object.

The use of the $1 within the replace method refers to the first saved submatch. If you had more than one submatch, you'd refer to them consecutively by $2, $3, and so on.

Another way that backreferences can be used is to break down a Universal Resource Indicator (URI) into its component parts. Assume that you want to break down the following URI down to the protocol (ftp, http, etc), the domain address, and the page/path:

http://msdn.microsoft.com:80/scripting/default.htm

The following regular expressions provides that functionality. For JScript:

/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/

For VBScript:

"(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)"

The first parenthetical subexpression is designed to capture the protocol part of the web address. That subexpression matches any word that precedes a colon and two forward slashes. The second parenthetical subexpression captures the domain address part of the address. That subexpression matches any sequence of characters that does not include '^', '/', or ':' characters. The third parenthetical subexpression captures a website port number, if one is specified. That subexpression matches zero or more digits following a colon. And finally, the fourth parenthetical subexpression captures the path and\or page information specified by the web address. That subexpression matches one or more characters other than '#' or the space character.

Applying the regular expression to the URI shown above, the submatches contain the following:

RegExp.$1 contains "http"
RegExp.$2 contains "msdn.microsoft.com"
RegExp.$3 contains ":80"
RegExp.$4 contains "/scripting/default.htm"