C# Regular Expressions

by Brad Merrill
01/18/2001

Introduction

Regular expressions have been used in various programming languages and tools for many years. The .NET Base Class Libraries include a namespace and a set of classes for utilizing the power of regular expressions. They are designed to be compatible with Perl 5 regular expressions whenever possible.

In addition, the regexp classes implement some additional functionality, such as named capture groups, right- to-left pattern matching, and expression compilation.

In this article, I'll provide a quick overview of the classes and methods of the System.Text.RegularExpression assembly, some examples of matching and replacing strings, a more detailed walk-through of a grouping structure, and finally, a set of cookbook expressions for use in your own applications.

Presumed Knowledge Base

Regular expression knowledge seems to be one of those topics that most programmers have learned and forgotten, more than once. For the purposes of this article, I will presume some previous use of regular expressions, and specifically, some experience with their use within Perl 5, as a reference point. The .NET regexp classes are a superset of Perl 5 functionality, so this will serve as a good conceptual starting point.

I'm also presuming a basic knowledge of C# syntax and the .NET Framework environment.

If you are new to regular expressions, I suggest starting with some of the basic Perl 5 introductions. The perl.com site has some great resource materials and introductory tutorials.

Related Reading

C# Essentials
By Ben Albahari, Peter Drayton, Brad Merrill

The definitive work on regular expressions is Mastering Regular Expressions, by Jeffrey E. F. Friedl. For those who want to get the most out of working with regular expressions, I highly recommend this book.

The RegularExpression Assembly

The regexp classes are contained in the System.Text.RegularExpressions.dll assembly, and you will have to reference the assembly at compile time in order to build your application. For example: csc /r:System.Text.RegularExpressions.dll foo.cs will build the foo.exe assembly, with a reference to the System.Text.RegularExpressions assembly.

A Brief Overview of the Namespace

There are actually only six classes and one delegate definition in the assembly namespace. These are:

Capture: Contains the results of a single match

CaptureCollection: A sequence of Capture's

Group: The result of a single group capture, inherits from Capture

Match: The result of a single expression match, inherits from Group

MatchCollection: A sequence of Match's

MatchEvaluator: A delegate for use during replacement operations

Regex: An instance of a compiled regular expression

The Regex class also contains several static methods:

Escape: Escapes regex metacharacters within a string

IsMatch: Methods return a boolean result if the supplied regular expression matches within the string

Match: Methods return Match instance

Matches: Methods return a list of Match as a collection

Replace: Methods that replace the matched regular expressions with replacement strings

Split: Methods return an array of strings determined by the expression

Unescape: Unescapes any escaped characters within a string

Simple Matches

Let's start with simple expressions using the Regex and the Match class.

    Match m = Regex.Match("abracadabra", "(a|b|r)+");

You now have an instance of a Match that can be tested for success, as in:

    if (m.Success)
	...

without even looking at the contents of the matched string.

If you wanted to use the matched string, you can simply convert it to a string:

    Console.WriteLine("Match="+m.ToString());

This example gives us the output:

Match=abra

which is the amount of the string that has been successfully matched.

Replacing Strings

Simple string replacements are very straightforward. For example, the statement:

  string s = Regex.Replace("abracadabra", "abra", "zzzz");

returns the string zzzzcadzzzz, in which all occurrences of the matching pattern are replaced by the replacement string zzzzz.

Now let's look at a more complex expression:

  string s = Regex.Replace("  abra  ", @"^\s*(.*?)\s*$", "$1");

This returns the string abra, with preceeding and trailing spaces removed.

The above pattern is actually generally useful for removing leading and trailing spaces from any string. We also have used the literal string quote construct in C#. Within a literal string, the compiler does not process the \ as an escape character. Consequently, the @"..." is very useful when working with regular expressions, when you are specifying escaped metacharacters with a \. Also of note is the use of $1 as the replacement string. The replacement string can only contain substitutions, which are references to Capture Group in the regular expression.

Engine Details

Now let's try to understand a slightly more complex sample by doing a walk-through of a grouping structure. Given the following sample:

string text = "abracadabra1abracadabra2abracadabra3";
string pat = @"
    (		# start the first group
      abra	# match the literal 'abra'
      (		# start the second (inner) group
      cad	# match the literal 'cad'
      )?	# end the second (optional) group
    )		# end the first group
    +		# match one or more occurences
    ";
// use 'x' modifier to ignore comments
Regex r = new Regex(pat, "x");
// get the list of group numbers
int[] gnums = r.GetGroupNumbers();	
// get first match
Match m = r.Match(text);		
while (m.Success)
  {
// start at group 1
  for (int i = 1; i < gnums.Length; i++) 
    {
    Group g = m.Group(gnums[i]);	
// get the group for this match
    Console.WriteLine("Group"+gnums[i]+"=["+g.ToString()+"]");
// get caps for this group
    CaptureCollection cc = g.Captures;	
    for (int j = 0; j < cc.Count; j++)
      {
      Capture c = cc[j];
      Console.WriteLine("	Capture" + j + "=["+c.ToString() 
         + "] Index=" + c.Index + " Length=" + c.Length);
      }
    }
// get next match
  m = m.NextMatch();			
  }

the output of this sample would be:

Group1=[abra]
        Capture0=[abracad] Index=0 Length=7
        Capture1=[abra] Index=7 Length=4
Group2=[cad]
        Capture0=[cad] Index=4 Length=3
Group1=[abra]
        Capture0=[abracad] Index=12 Length=7
        Capture1=[abra] Index=19 Length=4
Group2=[cad]
        Capture0=[cad] Index=16 Length=3
Group1=[abra]
        Capture0=[abracad] Index=24 Length=7
        Capture1=[abra] Index=31 Length=4
Group2=[cad]
        Capture0=[cad] Index=28 Length=3

Let's start by examining the string pat, which contains the regular expression. The first capture group is marked by the first parenthesis, and then the expression will match an abra, if the regex engine matches the expression to that which is found in the text. Then the second capture group, marked by the second parenthesis, begins, but the definition of the first capture group is still ongoing. What this tells us is that the first group must match abracad and the second group would just match the cad. So, if you decide to make the cad match an optional occurrence with the ? metacharacter, then abra or abracad will be matched. Next, you end the first group, and ask the expression to match 1 or more occurrences by specifying the + metacharacter.

Now let's examine what happens during the matching process. First, create an instance of the expression by calling the Regex constructor, which is also where you specify your options. In this case, I'm using the x option, as I have included comments in the regular expression itself, and some whitespace for formatting purposes. By turning on the x option, the expression will ignore the comments, and all whitespace that I have not explicitly escaped.

Next, get the list of group numbers (gnums) defined in this regular expression. You could also have used these numbers explicitly, but this provides you with a programmatic method. This method is also useful if you have specified named groups, as a way of quickly indexing through the set of groups.

Next, perform the first match. Then enter a loop testing for success of the current match. The next step is to iterate through the list of groups starting at group 1. The reason you do not use group 0 in this sample is that group 0 is the fully captured match string, and what you usually (but not always) want to pick out of a string is a subgroup. You might use group 0 if you wanted to collect the fully matched string as a single string.

Within each group, iterate through the CaptureCollection. There is usually only one capture per match, per group, but in this case, for Group1, two captures show: Capture0 and Capture1. And if you had asked for only the ToString of Group1, you would have received abra, although it also did match the abracad. The group ToString value will be the value of the last Capture in its CaptureCollection. This is the expected behavior, and if you want the match to stop after just the abra, you would remove the + from the expression, telling the regex engine to match on just the expression.

Procedural-Based vs. Expression-Based

Generally, the users of regular expressions will tend to fall into one of two groups.

The first group tends to use minimal regular expressions that provide matching or grouping behaviors, and then write procedural code to perform some iterative behavior.

The second group tries to utilize the maximum power and functionality of the expression-processing engine itself, with as little procedural logic as possible.

For most of us, the best answer is somewhere in between, and I hope this article outlines both the capabilities of the .NET regexp classes, as well as the trade-offs in complexity and performance of the solution.

Procedural-Based Patterns

A common processing need is to match certain parts of a string and perform some processing. So, here's an example that matches words within a string and capitalizes them:

string text = "the quick red fox jumped over the lazy brown dog.";
System.Console.WriteLine("text=[" + text + "]");
string result = "";
string pattern = @"\w+|\W+";

foreach (Match m in Regex.Matches(text, pattern))
  {
// get the matched string
  string x = m.ToString();	
// if the first char is lower case
  if (char.IsLower(x[0]))	
// capitalize it
    x = char.ToUpper(x[0]) + x.Substring(1, x.Length-1); 
// collect all text
  result += x;			
  }
System.Console.WriteLine("result=[" + result + "]");

As you can see, you use the C# foreach statement to process the set of matches found, and perform some processing. In this case, creating a new result string.

The output of the sample is:

text=[the quick red fox jumped over the lazy brown dog.]
result=[The Quick Red Fox Jumped Over The Lazy Brown Dog.]

Expression-Based Patterns

Another way to implement the above example is by providing a MatchEvaluator, which will process it as a single result set.

So the new sample looks like:

  static string CapText(Match m)
    {
// get the matched string
    string x = m.ToString();	
// if the first char is lower case
    if (char.IsLower(x[0]))	
// capitalize it
      return char.ToUpper(x[0]) + x.Substring(1, x.Length-1); 
    return x;
    }
    
  static void Main()
    {
    string text = "the quick red fox jumped over the 
      lazy brown dog.";
    System.Console.WriteLine("text=[" + text + "]");
    string pattern = @"\w+";
    string result = Regex.Replace(text, pattern,
		  new MatchEvaluator(Test.CapText));
    System.Console.WriteLine("result=[" + result + "]");
    }

Also of note is that the pattern was simplified since I only needed to modify the words and not the non-words.

Cookbook Expressions

To wrap up this overview of how regular expressions are used in the C# environment, I'll leave you with a set of useful expressions that have been used in other environments. I got them from a great book, the Perl Cookbook, by Tom Christiansen and Nathan Torkington, and updated them for C# programmers. I hope you find them useful.

Roman Numbers

string p1 = "^m*(d?c{0,3}|c[dm])"
  + "(l?x{0,3}|x[lc])(v?i{0,3}|i[vx])$";
string t1 = "vii";
Match m1 = Regex.Match(t1, p1);

Swapping First Two Words

string t2 = "the quick brown fox";
string p2 = @"(\S+)(\s+)(\S+)";
Regex x2 = new Regex(p2);
string r2 = x2.Replace(t2, "$3$2$1", 1);

Keyword = Value

string t3 = "myval = 3";
string p3 = @"(\w+)\s*=\s*(.*)\s*$";
Match m3 = Regex.Match(t3, p3);

Line of at Least 80 Characters

string t4 = "********************"
  + "******************************"
  + "******************************";
string p4 = ".{80,}";
Match m4 = Regex.Match(t4, p4);

MM/DD/YY HH:MM:SS

string t5 = "01/01/01 16:10:01";
string p5 =
  @"(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)";
Match m5 = Regex.Match(t5, p5);

Changing Directories (for Windows)

string t6 =
  @"C:\Documents and Settings\user1\Desktop\";
string r6 = Regex.Replace(t6,
  @"\\user1\\",
  @"\\user2\\");

Expanding (%nn) Hex Escapes

string t7 = "%41"; // capital A
string p7 = "%([0-9A-Fa-f][0-9A-Fa-f])";
// uses a MatchEvaluator delegate
string r7 = Regex.Replace(t7, p7,
  HexConvert);

Deleting C Comments (Imperfectly)

string t8 = @"
/*
 * this is an old cstyle comment block
 */
";
string p8 = @"
  /\*  # match the opening delimiter
  .*?	 # match a minimal numer of chracters
  \*/	 # match the closing delimiter
";
string r8 = Regex.Replace(t8, p8, "", "xs");

Removing Leading and Trailing Whitespace

string t9a = "   leading";
string p9a = @"^\s+";
string r9a = Regex.Replace(t9a, p9a, "");

string t9b = "trailing  ";
string p9b = @"\s+$";
string r9b = Regex.Replace(t9b, p9b, "");

Turning '\' Followed by 'n' Into a Real Newline

string t10 = @"\ntest\n";
string r10 = Regex.Replace(t10, @"\\n", "\n");

IP Address

string t11 = "55.54.53.52";
string p11 = "^" +
  @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
  @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
  @"([01]?\d\d|2[0-4]\d|25[0-5])\." +
  @"([01]?\d\d|2[0-4]\d|25[0-5])" +
  "$";
Match m11 = Regex.Match(t11, p11);

Removing Leading Path from Filename

string t12 = @"c:\file.txt";
string p12 = @"^.*\\";
string r12 = Regex.Replace(t12, p12, "");

Joining Lines in Multiline Strings

string t13 = @"this is 
a split line";
string p13 = @"\s*\r?\n\s*";
string r13 = Regex.Replace(t13, p13, " ");

Extracting All Numbers from a String

string t14 = @"
test 1
test 2.3
test 47
";
string p14 = @"(\d+\.?\d*|\.\d+)";
MatchCollection mc14 = Regex.Matches(t14, p14);

Finding All Caps Words

string t15 = "This IS a Test OF ALL Caps";
string p15 = @"(\b[^\Wa-z0-9_]+\b)";
MatchCollection mc15 = Regex.Matches(t15, p15);

Finding All Lowercase Words

string t16 = "This is A Test of lowercase";
string p16 = @"(\b[^\WA-Z0-9_]+\b)";
MatchCollection mc16 = Regex.Matches(t16, p16);

Finding All Initial Caps

string t17 = "This is A Test of Initial Caps";
string p17 = @"(\b[^\Wa-z0-9_][^\WA-Z0-9_]*\b)";
MatchCollection mc17 = Regex.Matches(t17, p17);

Finding Links in Simple HTML

string t18 = @"
<html>
<a href=""first.htm"">first tag text</a>
<a href=""next.htm"">next tag text</a>
</html>
";
string p18 = @"<A[^>]*?HREF\s*=\s*[""']?"
  + @"([^'"" >]+?)[ '""]?>";
MatchCollection mc18 = Regex.Matches(t18, p18, "si");

Finding Middle Initial

string t19 = "Hanley A. Strappman";
string p19 = @"^\S+\s+(\S)\S*\s+\S";
Match m19 = Regex.Match(t19, p19);

Changing Inch Marks to Quotes

string t20 = @"2' 2"" ";
string p20 = "\"([^\"]*)";
string r20 = Regex.Replace(t20, p20, "``$1''");

Download the source code for these Cookbook Expressions.

 

Brad Merrill is a compiler technology engineer in the Microsoft Developer Relations Group. He has worked with the C# team for more than two years. He also has worked with other non-Microsoft language partners, including ActiveState, Perl, and Python, to help make additional languages available on the Microsoft .NET Framework. You can reach Brad at brad_merrill@hotmail.com.


O'Reilly & Associates will soon release (February 2001) C# Essentials.