There Is No Such Thing As The Regex

By Artyom Bologov Cold

I have a new challenge for myself: solving RosettaCode only using ed scripts. Not even sed, plain old ed. This makes me try to run as many versions of this venerable editor. Or at least read the manuals—it's hard to get BSD/Mac ed while on Guix. Several things I noticed:

I'm saying "essential", but who's to decide? Apparently, GNU, BSD, MacOS, and Plan 9 ed maintainers chose their regex version consciously. And, apparently, they thought their version is The Regex.

But there's no such thing as The Regex. There are only heaps of slightly incompatible syntaxes and extensions to the basic idea. And there's no basic idea either, as I'll try to show.

I'll use the chrestomathic example: parsing email addresses. Don't expect the RFC-compliant pattern, rather let's simplify this to essentials. Email address comes in three parts:

This should be enough to showcase the features of most dialects. But (here's a spoiler) it might be too simple for some regex to shine.

Regular Languages and Kleene Star

So they wanted to describe languages in the middle of previous century. The definition from Wikipedia is: "[Regular language] is a formal language that can be defined by a regular expression." Not making things clearer. Luckily, Wikipedia defines it more rigorously below. Here's my summary:

Ø
Empty language, nothingness.
a
Some symbol.
{a...}
A set of symbols.
X*
Something with Kleene star, repeated zero or more times.
∪ and •
Union and concatenation, set operations on sub-languages.

Quite different from the popular idea of regex as we know it, right? The only comforting resemblance is Kleene star. Let's try to define a regular expression describing emails in it.

{a..z, A..Z, 0..9, ...}*•{@}•{a..z, A..Z, 0..9, -}*•({.}•{a..z, A..Z, 0..9, -})*
Email addresses as regular language

Notice that these ellipses are just me being lazy and not wanting to list out every possible character. Because that's what happens when you mess with math people. No mercy.

"Regular expressions" as they initially were... look quite alien. But they'll fix it in the course of history, right?

Thompson & Basic POSIX/UNIX Regex

Ken Thompson fixed regex and turned them into something we're familliar with. The transitionary state can be seen in the Regular Expression Search Algorithm paper. There's still • operator, even if used internally. ∪ turns into a vertical bar |. And Kleene star remains as is. That's it really.

What followed is so-called Basic Regular Expressions. Notable additions:

One problematic thing is that one has to mark braces and parentheses by backslash. To highlight their special status. What about square brackets? No backslash. Inconsistent and error-prone.

Here's how POSIX Basic Regular Expressions version of email address might look like:

.{1,}@[-[:alnum:]]*(.[-[:alnum:]]*)*
Email address as POSIX Basic Regex

A lot of backslashes, ugh. It's only intuitive that one might want less backslashes. One superpower of BRE, though: backreferences! \n (backslash with a number) means referencing the previous regex group. The parenthesized one. Most of the flawors below lost this ability for some reason. A shame.

Plan 9 Regex

These are closer to initial Thompson ones, but add some quality of life features:

.+@[-a-zA-Z0-9]+(.[-a-zA-Z0-9]+)+
Email address as Plan 9

What Plan 9 regex miss is curly braces and spelled-out char classes. Yes, no quantified matches and convenient Unicode-aware matching. So overall Plan 9 regex are a syntactic evolution over BRE, but they miss a lot of semantic necessities. For whatever reason.

Extended POSIX Regex

The synthesis (not necessarily historical, but rather ideological) of BREs and Plan 9: ERE. Extended Regular Expressions. No backslashes, curly braces are back, as are char classes. That's where we get closer to conventional regex as we know them:

.+@[-[:alnum:]]+(.[-[:alnum:]]+)+
Email address as POSIX Basic Regex

GNU & Emacs Regex

GNU regex (as present in GNU ed, which I'm currently using, and in GNU Emacs as the main dissemination source) are extending over POSIX ones. Mainly via backslash sequences. Lots of them. Notable highlights:

\w
For any word (alphanumeric and underscore) char.
\W
For any NON-word char.
\b
For word boundary.
\B
For inside-word char boundary.
\s
For spaces.
\S
For NON-spaces.

You probably got the idea—some letter for char class, and its capital version for inversion or special case. No, wait, there are lots of special cases. I'm not even sure what's the heuristic. But still, lots of special sequences for fun and profit:

.+@w+(.w+)+
Email address as GNU regex

Our email pattern gets shorter and shorter, which is a win. (Ignore the fact that I'm excluding dashes from domain names now.) But it's also confined to GNU software, which sucks. Because there is other software. Like...

SQL & Postgres

SQL has these weird LIKE patterns:

an_thing -- _ stands for any single char
%prefixed -- % is any set of chars
suffixed% -- Same %, but ending a string now
%wrapped% -- A string containing 'wrapped' in any position
LIKE patterns in SQL

But these are unrelated to regex, right? They look so alien, after all. Unfortunately, there are systems where this alien syntax is valid regex. Like Postgres.

Postgres regex, as they call it, is a frivolous mix of POSIX and SQL LIKE. Period (.), our trusted friend, is missing in Postgres syntax, and is replaced by underscore (_). A soul-calming .* sequence is % instead. The rest is plain POSIX EREs.

%@[-a-zA-Z0-9]+(.[-a-zA-Z0-9]+)+
Email address as Postgres regex

Scheme Regex (SRFI 115)

Scheme (and Lisps in general) are famous for the Domain-Specific Languages. You can build a subset of a language that maps 1-to-1 to the problem domain. Which is a blessing, isn't it? But this also gives rise to really complex and Turing-complete languages one can conjure. Like Scheme Regular Expressions (SRE) described in SRFI 115.

These encode regular expressions as parenthesised lists. The patterns form trees of expressions that are parseable but somewhat verbose. And, I bet, they don't look like the idea of The Regex:

(: (* any) #@ (* (or #- alnum)) (+ (: #. (+ (or #- alnum)))))
(Untested) email pattern as a Scheme regex

Here's your portion of parentheses disgust for today. No need to thank me.

Perl Regex

I'll skip this section, because I'm unfamiliar with Perl. But what I know is that Perl regex/string processing it Turing-complete. They basically are Extended Extended Extended POSIX regex. Pretty with the basics, yet syntactically horrifying with advanced things (like lookaheads and recursive matching.)

I'm not listing an email example here, because it's going to look the same as GNU one. See? No longer a useful litmus test!

Modern Regex: Python, JS, .NET

Most modern languages and runtimes have their own regex libraries/syntax. Some (JS) even go as far as having a dedicated language syntax for regex literals. Given the ubiquity of regex, they must be similar, transferable, and still powerful, right? Only the latter is true.

See this StackOverflow question for a staggering showcase of the syntaxes modern RE systems boast. I'm linking to it in part because my example with email is no longer useful. Most modern RE systems will have something similar(ly short) to Perl/GNU regex.

Post Modern Regex: Any Pattern-Matching Language?

What about PEG? Parsing Expression Grammar, I mean. The thing specifying parsers. A parser for email might read:

.*@[-a-zA-Z0-9]+(.[-a-zA-Z0-9]+)+
A potential PEG parser for emails

Which looks almost like POSIX ERE, right? Yet it's called PEG. So one has two options here:

Either way, regex proves to be a complicated concept with no set syntax/definition. And any other pattern-matching language has a chance to be called regex, really. Be it scanf patterns, BNF/yacc parsers, or templating engines.

The Regex, Really?

Given the variety of syntaxes listed in this post, there seem to be no basic properties to regex. The only thing making appearance everywhere is Kleene star. The only one. Other "obvious" and "basic" things like . and {} are often missing or re-imagined. So, is there The Regex?

No.

A longer answer would be: There's no definitive spec/version/implementation. Some implementations, like Perl and GNU, are more successful/useful/influential. Some syntactic features appear more than the others. But there's no consensus and there are lots of weird outliers.

Use whatever you like and don't let anyone tell you it is not regex. (Just remember to mention which syntax/implementation you use when sharing your sed one-liners with people.)

Leave feedback! (via email)