# 📎 Towards Better Word
By Artyom Bologov <aartaka.me>

<assets/better-word.png> (Image with hand-drawn text, mostly repeating “WORD”. The “WORD” in the center is big and highlighted. The ones around it are smaller and scattered around the canvas. Every one of them has some special symbols (“!?_-$” etc.) appended. Brighter color “ARTYOM”, “BOLOGOV”, and “AARTAKA.ME” are also spread around.)

Armenian has 39 letters
And even some more if you take some historic ones.
(My favorite one is ‘ԱՒ’, doing a sound in between ‘A’ and ‘W’.)
It’s a lot.
With such an amount, even the language as phonetically diverse as Armenian… has some letters unused.
Well, not unused, but rare.
Some almost exclusively allocated to represent loan word sounds, like English soft ‘p’ or plain ‘o’
(The letters in question are ‘Փ’ and ‘Օ’.)

I don’t find that bad or worthy of a language reform.
More than anything, I’m ready to embrace this diversity.
(However hard learning Armenian alphabet and phonetics was.)

I find it more surprising that some languages don’t do that.
Toki Pona is extremely minimalist in its sound set, which might be good for a conlang.
But it’s terrible for loan words.
And loan words are an important way to extend the language, almost for free.
I realize that this is not a goal for Toki Pona, but the problem is there.
Inflexibility when exposed to other languages.

Now where am I getting with that…
Right! Text editing!

Open your code editor and input this sequence of symbols into the buffer (copy-pasting is allowed):

===================================  ===================================
ident_name-re-enacted@here
==================== test test test ====================

What happens if you put the cursor at the start of this sequence and do forward-word (vi <kbd>w</kbd>, Emacs <kbd>M-f</kbd> key)?
It will most inevitably land on the first dash.
Press it again? Next dash.
Press again? At-sign.
But… why didn’t is stop at the underscore?

That’s the concept of “word.”
It’s quite rich in linguistic associations, so it must be something smart.
Yet it’s boring in text editing: “word” is a sequence of alphanumeric and underscores.
That’s it.

You can see this notion of “word” everywhere:

• Most text editing widgets in most GUI frameworks
• Browser text boxes (I’ll let you in on a secret: these are the same as GUI text widgets)
• Regex (Perl et al. `\w` char class)
• Hashtags
• Usernames

This awful notion permeates everywhere.
It doesn’t account for any language but C (which vi was apparently written in/for.)
Even though most C-family languages have diverging notion of identifier / word / movement unit.
Here’s what’s allowed in identifiers in some programming languages:

JavaScript:
C-like, but with `$` allowed (remember jQuery?)
Perl:
Sigils starting identifiers! Like `$scalar`, `@array`, `%hash`, and `&fn`. (As an added irony, these are not recognized as integral words by Perl’s own `\w` regex)
BASIC:
Type-postfixed vars: `string$`, `double#`, `float!`, and `integer%`.
Lisp-family:
`|Anything|`, but conventionally `d-a-s-h-e-d`, `destructive!`, `predicate?`, `%internal%`, and `*global*`
CSS:
@nγ✝нiΝ9! <developer.mozilla.org/en-US/docs/Web/CSS/custom-ident>

See where I’m getting with this Armenian vs. Toki Pona argument?
Text editing word concept is inflexible when encountering the diversity of naming systems.
We need better.

“S-expressions!” shouts a Lisp freak in me.
And that’s a really good solution, because it implies a structure that “words” don’t have.
All the text you work on has structure, and yet you’re not using it.
Waste of resource.

But I think that s-expressions are not enough to dismantle “words.”
We need a better concept for atomic symbols / identifiers in our code / texts.
I introduce you to a Better Word™:

===================================  ed ===================================
\([[:alnum:]!#$%&*+/:;=?@^_`|~-]\{1,\}\)
==================== POSIX regex for a Better Word ====================

There are multiple complaints you can make about it that I expect:

• “But why include exclamation and question marks? There are terminators in prose”
• “Arithmetic operators should separate words. Yes, even dash, even in Lisp”
• “Words cannot consist entirely of digits, these are numbers”
• “Why not include dots and commas?”

But the mere fact of us thinking that shows: the concept of “word” is up to debate.
(If the “word” is indisputable to you, then you’re likely a vi+C—two separate words, no doubt!—extremist and I’m sorry for you.)
“Words” are inflexible, much like phonetic simplicity of Toki Pona.
Let’s change that.
If not with Better Word™, then with something else.
Because languages and tools are ours to shape, not vice versa.


Copyright 2022-2025 Artyom Bologov (aartaka).
Any and all opinions listed here are my own and not representative of my employers; future, past and present.