5 (Wrong) Regex To Parse Parentheses
By Artyom BologovI'm trying to write Modal.ed, a compiler from Modal to ed(1) scripts. Which is quite a story of its own, I guess? I'll leave it for later.
One part of this project is reliably parsing Modal syntax. Here are some examples of Modal rules:
The general pattern (hopefully apparent from the snippets) is:
- Rules are starting with
<>
- Rules have a left hand side (pattern) and a right hand side (replacement.)
- Both sides are wrapped in parentheses, but parens are sometimes omitted for obvious single-symbol cases.
- The pattern/replacement can get as complex as the programmer wishes. But (hopefully) the parentheses are reliably balanced.
So all we need is to parse balanced parentheses in ed. Easy, right? This post is an exercise of desperation trying to parse parentheses with regex. A journey that has a happily ever after, unfortunately. But here's a bit of boring stuff first:
Pragmatic: Recursive Regex
The search for perfect regex led me to StackOverflow. To the "Regular expression to match balanced parentheses" question in particular. A place where the best minds of the industry were either:
- "No, it's not possible.", or
- Suggesting non-portable vendor-specific regex extensions.
.NET and Perl regex are suggested multiple times, with a recurring form of
Perl is nice, but There Is No Such Thing As The Regex, and I have a soft spot for ed regex, and POSIX BREs in particular. These don't have recursive matching. So I was bored out of my mind reading all these StackOverflow exercises in conformity. And I decided to prove them wrong: One does not need .NET or Perl to parse parentheses properly. POSIX BREs and text substitution are more than enough to work with parentheses and Modal rules.
Naive: It Is Balanced, Right?
So the simplest way to match parentheses is to... match parentheses. And nothing/anything else in between:
There are multiple cases where this regex breaks:
- (...) (...)
- Is matched as one piece due to regex greediness.
- ()))))))
- Is matched too, because we actually don't look for balance.
- ()(((((()
- A variation on the previous too, but ugly enough to include.
We need more structure to paren matching. And we'll have it:
Restrictive: I Will Be Correct Whatever the Cost
So we don't want stray parens and we need balanced parentheses. We can arrange exactly that:
So we match two parentheses and some number of non-parentheses inside. Which doesn't give us false positives like in the first iteration. But doesn't scale to more nested patterns.
Pedantic: Hardcoding Patterns
There has to be a way that covers most of the practical cases. I'm ready to cut corners, like too deeply nested parentheses. Which means most cases can be hard-coded:
The second and third ones look weird, but it's a way of saying:
there must be a monolythic symbol (like hello
)
or a parenthesised list of symbols (like (hello there)
) inside these two parentheses.
This is due to BREs having no alternation pattern, |
.
These rules can be conveniently rewritten in POSIX EREs
(that's the last time I employ such a bloated regex format in this post!)
Elegant, but less portable than BREs. I need to run it on BSD, UNIX, and GNU ed, so begone EREs!
Lispy: Looking for Tiger Tails
Another way to ensure the balance is to restrict the inputs. Say, we can require Modal rules to be Lisp-like in only having prefix operators. With all the argument at the tail. This would result in a dreaded "tiger tail" or parentheses at the end. But it also makes the parsing problem irrelevant: just match the tail and you're golden:
Most of Modal codebases are somewhat lispy already. So this regex has a higher success/LoC ratio than, for example, hardcoded rules. But I can see why one might be uneasy about it—it's not really structural. Unlike...
Recursive, But Not That One
The structure of the Modal rules is:
- It's a symbol or register.
- Or it's a list of symbols/registers.
- Or it's a list of symbols/registers and other lists.
- And so on.
So it dawned on me: we can isolate symbols, then use these to highlight the simple lists and then propagate up the list ladder. So the algorithm I came up with is:
- Wrap all simple symbols in Russian quotes: «». (Because they are distinctive and unique enough for parsing, and because I'm ethnically Russian.)
- If the list consists entirely of quoted forms or is empty, remove all quotes inside and wrap the list itself into quotes.
- Repeat until there's nothing to substitute and the quoted list is the toplevel one.
And here's (an arguably ugly) code that does this in my ed script:
You may say I'm cheating, because it's not only regex, it's also substitution rules. But I warned you! The results are relatively good:
- The parsed depth is at worst the number of copy-pasted ladders.
- And at best the number of inidividual substitutions.
- The average lies somewhere around the number I don't know how to compute, but it should be good enough.
- The algorithm would be perfect if there were looping constructs in ed. But there aren't. I might've used sed(1) for looping, but... I despise sed.
So I have my parsing rules happily chewing on Modal code. In Modal.ed, check it out! Thanks for lending me a ear and embracing the power of regex 🖤