Reverse-engineering Prose From Internet Lingo

Oh, I see you disabled JavaScript. Keep up the good work, my fellow cleanweb person!

Notice that there might be trace amounts of JS, used for:

Runnable JS code blocks
Prettier email feedback form
Random shuffling of “thought” cards in some posts.

JS is not required for use of the website though, it’s only enhancing the existing functionality.

Amateurly hand-drawn thumbnail. On it, there’s regular text: “2day im rcnstrctin ltrlly proz f notin,“ and on top of it, in brighter color, “Today I’m Reconstructing Literary Proze from Nothing,” as if correcting the text below it. In the corners, attributions to “Artёm / Artyom Bologov” and “aartaka.me” are visible.

So we all communicate over Internet. And we don’t really care much about punctuation or capitalization. Which is fine, I guess? But what if someone (like me) wanted a proper literary text instead of this all-lowercase-no-punctuation soup? Well, why not reverse-engineer text from that into prose?

Alternative villain origin story: I used to make this website in Lisp. Like, text too, as raw symbols in nested lists denoting HTML. But Lisp is case-insensitive and has too much syntax to my taste. So I had to come up with a number of heuristics to manage that. And reconstruct proper prose from code.

Anyway, I’m into building literary prose from nothing. In this post, I’m building some from raw text with ed(1) scripts. You can preview both: the original text of this post in re-prose.phtm file, and the script re-constructing the text in scripts/re-prose.ed. It's all built with ed(1), still.

Escaping Literals #

Now, my painful experience taught me that algorithms on text need one thing. Escaping. Because if your algorithm eats something that should’ve stayed literal… It’s bad.

So, for this post, I’m prefixing lines to not be converted—with a greater than sign. I also allow myself to ignore HTML tags. Because I don’t want to accidentally break the markup.

So this regex pattern repeats in all of this post. Because we only need to act on lines intended for it.

g/^[^<>]...

Only act on non-tag lines not prefixed by greater than sign

Reconstructing Sentences #

That’s an easy one: (at least if we observe semantic line breaks) Just add periods to all the lines where there’s no terminator.

g/^[^<>].*[^;—:,.!?…]$/s//&./

Appending periods everywhere suitable

Then comes capitalization. The heuristic is simple: for every line with sentence terminator, capitalize the first letter of the next line:

g/^[^<>].*[.!?…]$/+1s/^š/Š/g\
s/^č/Č/g\
s/^ž/Ž/g\
s/^a/A/g\
s/^b/B/g\
s/^c/C/g\
s/^d/D/g\
s/^e/E/g\
s/^f/F/g\
s/^g/G/g\
s/^h/H/g\
s/^i/I/g\
s/^j/J/g\
s/^k/K/g\
s/^l/L/g\
s/^m/M/g\
s/^n/N/g\
s/^o/O/g\
s/^p/P/g\
s/^r/R/g\
s/^s/S/g\
s/^t/T/g\
s/^u/U/g\
s/^v/V/g\
s/^w/W/g\
s/^x/X/g\
s/^y/Y/g\
s/^z/Z/g

Long-ish literal capitalization limited by ed(1) capabilities

Repeat that a couple of times for lines that were capitalized on first pass. And lines after paragraph-initiating tag. (I’m betraying my own setup here.) And now there are properly capitalized sentences.

But there are some more words needing capitalization. Not only sentence-initiating ones:

g/^[^<>].*internet.*/s/internet/Internet/
g/^[^<>].*unicode.*/s/unicode/Unicode/
g/^[^<>].*interslavic.*/s/interslavic/Interslavic/
g/^[^<>].*artyom.*/s/artyom/Artyom/
g/^[^<>].*latin.*/s/latin/Latin/
g/^[^<>].*romaji.*/s/romaji/Romaji/
g/^[^<>].*lisp.*/s/lisp/Lisp/
g/^[^<>].*armascii.*/s/armascii/ArmASCII/
g/^[^<>].*html.*/s/html/HTML/

Capitalizing frequent capitalized / all-caps words

Latin Unicode Lisp in HTML. Cool, right?

Reconstructing Words #

Internet learned to read phrases like "u r". So why not process ’em too?

g/^[^<>].* i .*/s/ i / I /
g/^[^<>].* i'm .*/s/ i'm / I'm /
g/^[^<>].* i'd .*/s/ i'd / I'd /
g/^[^<>].* i'll .*/s/ i'll / I'll /
g/^\([^<>].* \)u\([ ,;:—].*\)/s//\1you\2/
g/^\([^<>].* \)u r\([ ,;:—].*\)/s//\1you are\2/
g/^\([^<>].* \)v r\([ ,;:—].*\)/s//\1we are\2/

Capitalizing and pre-processing pronoun contractions

But abbreviations don’t stop at pronouns. There are other interesting occurences:

g/^[^<>].* i .*/s/ abt / about /
g/^\([^<>].* \)rlly\([ ,;:—].*\)/s//\1really\2/
g/^\([^<>].* \)dunno\([ ,;:—].*\)/s//\1don’t know\2/

Other contractions

Having that, one might start feverishly expanding wtf-s and lol-s. But I’ll not do that.

Reconstructing Characters #

So not all languages use Latin alphabet. And, out of those that use it, many contain additional letters. Computer systems used to accept only Latin characters. Thus the existence of transliteration schemes such as Romaji and ArmASCII.

So why not reverse-engineer one of such schemes? Say, for Interslavic:

g/cz/s//č/g
g/sz/s//š/g
g/zs/s//ž/g
g/Cz/s//Č/g
g/Sz/s//Š/g
g/Zs/s//Ž/g

Common interslavic romanization reversal

Yet another character-level thing is smart punctuation. These curly quotes and apostrophes no one uses. And "AI" em dashes, of course.

g/^[^<>].*"\\([^\"]*\\)"/s/"\\([^\"]*\\)"/“\\1”/g
g/^[^<>].*\'/s/'/’/g
g/^[^<>].*---/s/---/—/g

Using prettier characters for punctuation

Next Steps #

Okay, I admit, I ran out of things to optimize here. This post itself is processed into readable prose already. And I don’t have many more samples to reverse-engineer into literary prose. Help will be appreciated?