Reverse-engineering Prose From Internet Lingo
By Artyom Bologov
So we all communicate over Internet. And we don’t really care much about punctuation or capitalization. Which is fine, I guess? But what if someone (like me) wanted a proper literary text instead of this all-lowercase-no-punctuation soup? Well, why not reverse-engineer text from that into prose?
Alternative villain origin story: I used to make this website in Lisp. Like, text too, as raw symbols in nested lists denoting HTML. But Lisp is case-insensitive and has too much syntax to my taste. So I had to come up with a number of heuristics to manage that. And reconstruct proper prose from code.
Anyway, i’m into building literary prose from nothing. In this post, i’m building some from raw text with ed(1) scripts. You can preview both: the original text of this post in re-prose.phtm file, and the script re-constructing the text in scripts/re-prose.ed. It's all built with ed(1), still.
Escaping Literals #
Now, my painful experience taught me that algorithms on text need one thing. Escaping. Because if your algorithm eats something that should’ve stayed literal… It’s bad.
So, for this post, i’m prefixing lines to not be converted—with a greater than sign. I also allow myself to ignore HTML tags. Because I don’t want to accidentally break the markup.
So this regex pattern repeats in all of this post. Because we only need to act on lines intended for it.
g/^[^<>]...
Reconstructing Sentences #
That’s an easy one: (at least if we observe semantic line breaks) Just add periods to all the lines where there’s no terminator.
g/^[^<>].*[^;—:,.!?…]$/s//&./
Then comes capitalization. The heuristic is simple: for every line with sentence terminator, capitalize the first letter of the next line:
g/^[^<>].*[.!?…]$/+1s/^š/Š/g\
s/^č/Č/g\
s/^ž/Ž/g\
s/^a/A/g\
s/^b/B/g\
s/^c/C/g\
s/^d/D/g\
s/^e/E/g\
s/^f/F/g\
s/^g/G/g\
s/^h/H/g\
s/^i/I/g\
s/^j/J/g\
s/^k/K/g\
s/^l/L/g\
s/^m/M/g\
s/^n/N/g\
s/^o/O/g\
s/^p/P/g\
s/^r/R/g\
s/^s/S/g\
s/^t/T/g\
s/^u/U/g\
s/^v/V/g\
s/^w/W/g\
s/^x/X/g\
s/^y/Y/g\
s/^z/Z/g
Repeat that a couple of times for lines that were capitalized on first pass. And lines after paragraph-initiating tag. (I’m betraying my own setup here.) And now there are properly capitalized sentences.
But there are some more words needing capitalization. Not only sentence-initiating ones:
g/^[^<>].*internet.*/s/internet/Internet/
g/^[^<>].*unicode.*/s/unicode/Unicode/
g/^[^<>].*interslavic.*/s/interslavic/Interslavic/
g/^[^<>].*artyom.*/s/artyom/Artyom/
g/^[^<>].*latin.*/s/latin/Latin/
g/^[^<>].*romaji.*/s/romaji/Romaji/
g/^[^<>].*lisp.*/s/lisp/Lisp/
g/^[^<>].*armascii.*/s/armascii/ArmASCII/
g/^[^<>].*html.*/s/html/HTML/
Latin Unicode Lisp in HTML. Cool, right?
Reconstructing Words #
Internet learned to read phrases like "u r". So why not process ’em too?
g/^[^<>].* i .*/s/ i / I /
g/^[^<>].* i'm .*/s/ i'm / I'm //
g/^[^<>].* i'd .*/s/ i'd / I'd //
g/^[^<>].* i'll .*/s/ i'll / I'll //
g/^\([^<>].* \)u\([ ,;:—].*\)/s//\1you\2/
g/^\([^<>].* \)u r\([ ,;:—].*\)/s//\1you are\2/
g/^\([^<>].* \)v r\([ ,;:—].*\)/s//\1we are\2/
But abbreviations don’t stop at pronouns. There are other interesting occurences:
g/^[^<>].* i .*/s/ abt / about /
g/^\([^<>].* \)rlly\([ ,;:—].*\)/s//\1really\2/
g/^\([^<>].* \)dunno\([ ,;:—].*\)/s//\1don’t know\2/
Having that, one might start feverishly expanding wtf-s and lol-s. But i’ll not do that.
Reconstructing Characters #
So not all languages use Latin alphabet. And, out of those that use it, many contain additional letters. Computer systems used to accept only Latin characters. Thus the existence of transliteration schemes such as Romaji and ArmASCII.
So why not reverse-engineer one of such schemes? Say, for Interslavic:
g/cz/s//č/g g/sz/s//š/g g/zs/s//ž/g g/Cz/s//Č/g g/Sz/s//Š/g g/Zs/s//Ž/g
Yet another character-level thing is smart punctuation. These curly quotes and apostrophes no one uses. And "AI" em dashes, of course.
g/^[^<>].*"\\([^\"]*\\)"/s/"\\([^\"]*\\)"/“\\1”/g
g/^[^<>].*\'/s/'/’/g
g/^[^<>].*---/s/---/—/g
Next Steps #
Okay, I admit, I ran out of things to optimize here. This post itself is processed into readable prose already. And I don’t have many more samples to reverse-engineer into literary prose. Help will be appreciated?