Jun 19, 2011

Regular Expressions as a Language

Unless you've had some experience with regular expressions, you won't understand the regular expression ^(From|Subject): from the last example, but there's nothing magic about it. For that matter, there is nothing magic about magic. The magician merely understands something simple which doesn't appear to be simple or natural to the untrained audience. Once you learn how to hold a card while making your hand look empty, you only need practice before you, too, can "do magic." Like a foreign language once you learn it, it stops sounding like gibberish.

1.2.1. The Filename Analogy

Since you have decided to use this book, you probably have at least some idea of just what a "regular expression" is. Even if you don't, you are almost certainly already familiar with the basic concept.
You know that report.txt is a specific filename, but if you have had any experience with Unix or DOS/Windows, you also know that the pattern "*.txt" can be used to select multiple files. With filename patterns like this (called file globs or wildcards), a few characters have special meaning. The star means "match anything," and a question mark means "match any one character." So, with the file glob "*.txt," we start with a match-anything * and end with the literal .txt, so we end up with a pattern that means "select the files whose names start with anything and end with .txt".
Most systems provide a few additional special characters, but, in general, these filename patterns are limited in expressive power. This is not much of a shortcoming because the scope of the problem (to provide convenient ways to specify groups of files) is limited, well, simply to filenames.
On the other hand, dealing with general text is a much larger problem. Prose and poetry, program listings, reports, HTML, code tables, word lists... you name it, if a particular need is specific enough, such as "selecting files," you can develop some kind of specialized scheme or tool to help you accomplish it. However, over the years, a generalized pattern language has developed, which is powerful and expressive for a wide variety of uses. Each program implements and uses them differently, but in general, this powerful pattern language and the patterns themselves are called regular expressions.

1.2.2. The Language Analogy

Full regular expressions are composed of two types of characters. The special characters (like the * from the filename analogy) are called metacharacters, while the rest are called literal, or normal text characters. What sets regular expressions apart from filename patterns are the advanced expressive powers that their metacharacters provide. Filename patterns provide limited metacharacters for limited needs, but a regular expression "language" provides rich and expressive metacharacters for advanced uses.
It might help to consider regular expressions as their own language, with literal text acting as the words and metacharacters as the grammar. The words are combined with grammar according to a set of rules to create an expression that communicates an idea. In the email example, the expression I used to find lines beginning with 'From:' or 'Subject:' was . The metacharacters are underlined; we'll get to their interpretation soon.
As with learning any other language, regular expressions might seem intimidating at first. This is why it seems like magic to those with only a superficial understanding, and perhaps completely unapproachable to those who have never seen it at all. But, just as !would soon become clear to a student of Japanese, the regular expression in
 "Regular expressions are easy!" A somewhat humorous comment about this: as Chapter 3 explains, the term regular expression originally comes from formal algebra. When people ask me what my book is about, the answer "regular expressions" draws a blank face if they are not already familiar with the concept. The Japanese word for regular expression, , means as little to the average Japanese as its English counterpart, but my reply in Japanese usually draws a bit more than a blank stare. You see, the "regular" part is unfortunately pronounced identically to a much more common word, a medical term for "reproductive organs." You can only imagine what flashes through their minds until I explain!
s!<emphasis>([0-9]+(\.[0-9]+){3})</emphasis>!<inet>$1</inet>!

will soon become crystal clear to you, too.
This example is from a Perl language script that my editor used to modify a manuscript. The author had mistakenly used the typesetting tag <emphasis> to mark Internet IP addresses (which are sets of periods and numbers that look like 209.204.146.22). The incantation uses Perl's text-substitution command with the regular expression
<emphasis>([0-9]+(\.[0-9]+){3})</emphasis>

to replace such tags with the appropriate <inet> tag, while leaving other uses of <emphasis> alone. In later chapters, you'll learn all the details of exactly how this type of incantation is constructed, so you'll be able to apply the techniques to your own needs, with your own application or programming language.
1.2.2.1. The goal of this book
The chance that you will ever want to replace <emphasis> tags with <inet> tags is small, but it is very likely that you will run into similar "replace this with that" problems. The goal of this book is not to teach solutions to specific problems, but rather to teach you how to think regular expressions so that you will be able to conquer whatever problem you may face.

No comments:

Post a Comment