5 (Wrong) Regex To Parse Parentheses

By Artyom Bologov

IMAGE_ALT

I'm trying to write Modal.ed, a compiler from Modal to ed(1) scripts. Which is quite a story of its own, I guess? I'll leave it for later.

One part of this project is reliably parsing Modal syntax. Here are some examples of Modal rules:

<> ((unwrap ?x)) (unwrap ?x)
<> (q ?x) (q ?x)
<> (`?: ?0 ?1) ((Int ?:))
<> (?: print ') (?:)
<> (iterating ?x ?y ?n (1) ?a ?b) (?((?:) ?:) \s\s)
<> (?0 ?1 `?:) ?:
<> rule0 data
Modal rules in all their diversity

The general pattern (hopefully apparent from the snippets) is:

So all we need is to parse balanced parentheses in ed. Easy, right? This post is an exercise of desperation trying to parse parentheses with regex. A journey that has a happily ever after, unfortunately. But here's a bit of boring stuff first:

Pragmatic: Recursive Regex

The search for perfect regex led me to StackOverflow. To the "Regular expression to match balanced parentheses" question in particular. A place where the best minds of the industry were either:

.NET and Perl regex are suggested multiple times, with a recurring form of

\((?:[^)(]+|(?R))*+\)
Perl recursive regex for parentheses matching

Perl is nice, but There Is No Such Thing As The Regex, and I have a soft spot for ed regex, and POSIX BREs in particular. These don't have recursive matching. So I was bored out of my mind reading all these StackOverflow exercises in conformity. And I decided to prove them wrong: One does not need .NET or Perl to parse parentheses properly. POSIX BREs and text substitution are more than enough to work with parentheses and Modal rules.

Naive: It Is Balanced, Right?

So the simplest way to match parentheses is to... match parentheses. And nothing/anything else in between:

(.*)
Implying parentheses are balanced, because they must be?

There are multiple cases where this regex breaks:

(...) (...)
Is matched as one piece due to regex greediness.
()))))))
Is matched too, because we actually don't look for balance.
()(((((()
A variation on the previous too, but ugly enough to include.

We need more structure to paren matching. And we'll have it:

Restrictive: I Will Be Correct Whatever the Cost

So we don't want stray parens and we need balanced parentheses. We can arrange exactly that:

([^()]*)
A regex to match a parenthesised list with no nested lists

So we match two parentheses and some number of non-parentheses inside. Which doesn't give us false positives like in the first iteration. But doesn't scale to more nested patterns.

Pedantic: Hardcoding Patterns

There has to be a way that covers most of the practical cases. I'm ready to cut corners, like too deeply nested parentheses. Which means most cases can be hard-coded:

 # Level one
([^()]*)
 # Level two
(\([^() ]\{0,1\}\(([^()]*)\)\{0,1\}\)*)
 # Level three
(\([^() ]\{0,1\}\((\([^() ]\{0,1\}\(([^()]*)\)\{0,1\}\)*)\)\{0,1\}\)*)
 # Et cetera
Hard-coded nested parentheses regex

The second and third ones look weird, but it's a way of saying: there must be a monolythic symbol (like hello) or a parenthesised list of symbols (like (hello there)) inside these two parentheses. This is due to BREs having no alternation pattern, |. These rules can be conveniently rewritten in POSIX EREs (that's the last time I employ such a bloated regex format in this post!)

\(([^()]+|\([^()]*\))*\)
POSIX ERE for two-levels-deep parentheses

Elegant, but less portable than BREs. I need to run it on BSD, UNIX, and GNU ed, so begone EREs!

Lispy: Looking for Tiger Tails

Another way to ensure the balance is to restrict the inputs. Say, we can require Modal rules to be Lisp-like in only having prefix operators. With all the argument at the tail. This would result in a dreaded "tiger tail" or parentheses at the end. But it also makes the parsing problem irrelevant: just match the tail and you're golden:

((*[^)]*)*)
Lispy parentheses matching

Most of Modal codebases are somewhat lispy already. So this regex has a higher success/LoC ratio than, for example, hardcoded rules. But I can see why one might be uneasy about it—it's not really structural. Unlike...

Recursive, But Not That One

The structure of the Modal rules is:

So it dawned on me: we can isolate symbols, then use these to highlight the simple lists and then propagate up the list ladder. So the algorithm I came up with is:

And here's (an arguably ugly) code that does this in my ed script:

 # Wrap every symbol/nil into quotes
g/[^() ]\{1,\}/s//«&»/g
g/()/s//«&»/g
 # Actual parsing: if the current layer of parens is all quoted, quote it and unquote contents
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7 \8 \9)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7 \8)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6 \7)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5 \6)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4 \5)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3 \4)»/g
g/(«\([^»]*\)» «\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2 \3)»/g
g/(«\([^»]*\)» «\([^»]*\)»)/s//«(\1 \2)»/g
g/(«\([^»]*\)»)/s//«(\1)»/g
 # Repeating second time for recursive case
 # Ad infinitum
Recursive parentheses matching

You may say I'm cheating, because it's not only regex, it's also substitution rules. But I warned you! The results are relatively good:

So I have my parsing rules happily chewing on Modal code. In Modal.ed, check it out! Thanks for lending me a ear and embracing the power of regex 🖤

Leave feedback! (via email)