Reverse-engineering Prose From Internet Lingo

> Reverse-engineering Prose From Internet Lingo > Internet learned to speak gibberish that doesn’t always coincide with literary text. But it can be converted back to that. Here’s my experiment along these lines. > assets/re-prose.png > Amateurly hand-drawn thumbnail. On it, there’s regular text: “2day im rcnstrctin ltrlly proz f notin,“ and on top of it, in brighter color, “Today I’m Reconstructing Literary Proze from Nothing,” as if correcting the text below it. In the corners, attributions to “Artёm / Artyom Bologov” and “aartaka.me” are visible. > > IMAGE-ALT

so we all communicate over internet and we don't rlly care much abt punctuation or capitalization which is fine, i guess? but what if someone (like me) wanted a proper literary text instead of this all-lowercase-no-punctuation soup? well, why not reverse-engineer text from that into prose?

alternative villain origin story: I used to make this website in Lisp. like, text too, as raw symbols in nested lists denoting html but lisp is case-insensitive and has too much syntax to my taste so i had to come up with a number of heuristics to manage that and reconstruct proper prose from code

anyway, i'm into building literary prose from nothing in this post, i'm building some from raw text with ed(1) scripts you can preview both: the original text of this post in re-prose.phtm file, and the script re-constructing the text in scripts/re-prose.ed. It's all built with ed(1), still.

Escaping Literals

now, my painful experience taught me that algorithms on text need one thing escaping because if your algorithm eats something that should've stayed literal… it's bad

so, for this post, i'm prefixing lines to not be converted---with a greater than sign i also allow myself to ignore html tags because i don't want to accidentally break the markup

So this regex pattern repeats in all of this post because we only need to act on lines intended for it

> g/^[^<>]...

Reconstructing Sentences

that's an easy one: (at least if we observe semantic line breaks) just add periods to all the lines where there's no terminator

> g/^[^<>].*[^;—:,.!?…]$/s//&./

then comes capitalization the heuristic is simple: for every line with sentence terminator, capitalize the first letter of the next line:

g/^[^<>].*[.!?…]$/+1s/^š/Š/g\
> s/^č/Č/g\
> s/^ž/Ž/g\
> s/^a/A/g\
> s/^b/B/g\
> s/^c/C/g\
> s/^d/D/g\
> s/^e/E/g\
> s/^f/F/g\
> s/^g/G/g\
> s/^h/H/g\
> s/^i/I/g\
> s/^j/J/g\
> s/^k/K/g\
> s/^l/L/g\
> s/^m/M/g\
> s/^n/N/g\
> s/^o/O/g\
> s/^p/P/g\
> s/^r/R/g\
> s/^s/S/g\
> s/^t/T/g\
> s/^u/U/g\
> s/^v/V/g\
> s/^w/W/g\
> s/^x/X/g\
> s/^y/Y/g\
> s/^z/Z/g

repeat that a couple of times for lines that were capitalized on first pass and lines after paragraph-initiating tag (I’m betraying my own setup here.) and now there are properly capitalized sentences.

but there are some more words needing capitalization not only sentence-initiating ones:

> g/^[^<>].*internet.*/s/internet/Internet/
> g/^[^<>].*unicode.*/s/unicode/Unicode/
> g/^[^<>].*interslavic.*/s/interslavic/Interslavic/
> g/^[^<>].*artyom.*/s/artyom/Artyom/
> g/^[^<>].*latin.*/s/latin/Latin/
> g/^[^<>].*romaji.*/s/romaji/Romaji/
> g/^[^<>].*lisp.*/s/lisp/Lisp/
> g/^[^<>].*armascii.*/s/armascii/ArmASCII/
> g/^[^<>].*html.*/s/html/HTML/

latin unicode lisp in html cool, right?

Reconstructing Words

internet learned to read phrases like "u r" so why not process 'em too?

> g/^[^<>].* i .*/s/ i / I /
> g/^[^<>].* i'm .*/s/ i'm / I'm //
> g/^[^<>].* i'd .*/s/ i'd / I'd //
> g/^[^<>].* i'll .*/s/ i'll / I'll //
> g/^\([^<>].* \)u\([ ,;:—].*\)/s//\1you\2/
> g/^\([^<>].* \)u r\([ ,;:—].*\)/s//\1you are\2/
> g/^\([^<>].* \)v r\([ ,;:—].*\)/s//\1we are\2/

but abbreviations don't stop at pronouns there are other interesting occurences:

> g/^[^<>].* i .*/s/ abt / about /
> g/^\([^<>].* \)rlly\([ ,;:—].*\)/s//\1really\2/
> g/^\([^<>].* \)dunno\([ ,;:—].*\)/s//\1don’t know\2/

having that, one might start feverishly expanding wtf-s and lol-s but i'll not do that

Reconstructing Characters

so not all languages use latin alphabet and, out of those that use it, many contain additional letters computer systems used to accept only latin characters thus the existence of transliteration schemes such as romaji and armascii

so why not reverse-engineer one of such schemes? say, for interslavic:

> g/cz/s//č/g
> g/sz/s//š/g
> g/zs/s//ž/g
> g/Cz/s//Č/g
> g/Sz/s//Š/g
> g/Zs/s//Ž/g

yet another character-level thing is smart punctuation these curly quotes and apostrophes no one uses and "AI" em dashes, of course

> g/^[^<>].*"\\([^\"]*\\)"/s/"\\([^\"]*\\)"/“\\1”/g
> g/^[^<>].*\'/s/'/’/g
> g/^[^<>].*---/s/---/—/g

Next Steps

okay, i admit, i ran out of things to optimize here this post itself is processed into readable prose already and i don't have many more samples to reverse-engineer into literary prose help will be appreciated?