
\documentclass[12pt]{article}
\usepackage[T2A,OT1]{fontenc}
\usepackage[default]{cantarell}
\usepackage[a4paper, top=20mm, bottom=20mm, left=20mm, right=20mm]{geometry}
\usepackage[utf8]{inputenc}
\usepackage[russian, english]{babel}
\usepackage{tabu}
\usepackage{hyperref}
\usepackage{parskip}
\usepackage{graphicx}
\usepackage{tabularx}
\usepackage[normalem]{ulem}
\usepackage{float}
\floatstyle{boxed}
\restylefloat{figure}
\usepackage{setspace}
\onehalfspacing
\author{Artyom Bologov \href{mailto:re-prose@aartaka.me}{(email)}}
\date{\today}
\title{Reverse-engineering Prose From Internet Lingo}
\makeatletter
\def\endenv{\expandafter\end\expandafter{\@currenvir}}
\makeatother
\begin{document}
\maketitle

\includegraphics[width=\textwidth,height=\textheight,keepaspectratio]{./assets/re-prose.png}

So we all communicate over Internet.
And we don't really care much about punctuation or capitalization.
Which is fine, I guess?
But what if someone (like me) wanted a proper literary text instead of this all-lowercase-no-punctuation soup?
Well, why not reverse-engineer text from that into prose?

Alternative villain origin story:
\href{run:this-post-is-lisp}{I used to make this website in Lisp}.
Like, text too, as raw symbols in nested lists denoting HTML.
But Lisp is case-insensitive and has too much syntax to my taste.
So I had to come up with a number of heuristics to manage that.
And reconstruct proper prose from code.

Anyway, i'm into building literary prose from nothing.
In this post, i'm building some from raw text with ed(1) scripts.
You can preview both:
\href{re-prose.phtm}{the original text of this post in re-prose.phtm file},
\href{scripts/re-prose.ed}{and the script re-constructing the text in scripts/re-prose.ed}.
\href{run:this-post-is-ed}{It's all built with ed(1), still}.

\section*{Escaping Literals} \label{escaping}

Now, my painful experience taught me that algorithms on text need one thing.
Escaping.
Because if your algorithm eats something that should've stayed literal…
It's bad.

So, for this post, i'm prefixing lines to \emph{not} be converted---with a greater than sign.
I also allow myself to ignore HTML tags.
Because I don't want to accidentally break the markup.

So this regex pattern repeats in all of this post.
Because we only need to act on lines intended for it.

\begin{figure}[h!]\begin{verbatim}
g/^[^<>]...
\end{verbatim}\caption{Only act on non-tag lines not prefixed by greater than sign}\end{figure}

\section*{Reconstructing Sentences} \label{sentences}

That's an easy one:
\href{https://sembr.org}{(at least if we observe semantic line breaks)}
Just add periods to all the lines where there's no terminator.

\begin{figure}[h!]\begin{verbatim}
g/^[^<>].*[^;---:,.!?…]$/s//&./
\end{verbatim}\caption{Appending periods everywhere suitable}\end{figure}

Then comes capitalization.
The heuristic is simple:
for every line with sentence terminator, capitalize the first letter of the next line:

\begin{figure}[h!]\begin{verbatim}
g/^[^<>].*[.!?…]$/+1s/^š/Š/g\
s/^č/Č/g\
s/^ž/Ž/g\
s/^a/A/g\
s/^b/B/g\
s/^c/C/g\
s/^d/D/g\
s/^e/E/g\
s/^f/F/g\
s/^g/G/g\
s/^h/H/g\
s/^i/I/g\
s/^j/J/g\
s/^k/K/g\
s/^l/L/g\
s/^m/M/g\
s/^n/N/g\
s/^o/O/g\
s/^p/P/g\
s/^r/R/g\
s/^s/S/g\
s/^t/T/g\
s/^u/U/g\
s/^v/V/g\
s/^w/W/g\
s/^x/X/g\
s/^y/Y/g\
s/^z/Z/g
\end{verbatim}\caption{Long-ish literal capitalization limited by ed(1) capabilities}\end{figure}

Repeat that a couple of times for lines that were capitalized on first pass.
And lines after paragraph-initiating tag.
\href{run:pidgin}{(I'm betraying my own setup here.)}
And now there are properly capitalized sentences.

But there are some more words needing capitalization.
Not only sentence-initiating ones:

\begin{figure}[h!]\begin{verbatim}
g/^[^<>].*internet.*/s/internet/Internet/
g/^[^<>].*unicode.*/s/unicode/Unicode/
g/^[^<>].*interslavic.*/s/interslavic/Interslavic/
g/^[^<>].*artyom.*/s/artyom/Artyom/
g/^[^<>].*latin.*/s/latin/Latin/
g/^[^<>].*romaji.*/s/romaji/Romaji/
g/^[^<>].*lisp.*/s/lisp/Lisp/
g/^[^<>].*armascii.*/s/armascii/ArmASCII/
g/^[^<>].*html.*/s/html/HTML/
\end{verbatim}\caption{Capitalizing frequent capitalized / all-caps words}\end{figure}

Latin Unicode Lisp in HTML.
Cool, right?

\section*{Reconstructing Words} \label{words}

Internet learned to read phrases like "u r".
So why not process 'em too?

\begin{figure}[h!]\begin{verbatim}
g/^[^<>].* i .*/s/ i / I /
g/^[^<>].* i'm .*/s/ i'm / I'm //
g/^[^<>].* i'd .*/s/ i'd / I'd //
g/^[^<>].* i'll .*/s/ i'll / I'll //
g/^\([^<>].* \)u\([ ,;:---].*\)/s//\1you\2/
g/^\([^<>].* \)u r\([ ,;:---].*\)/s//\1you are\2/
g/^\([^<>].* \)v r\([ ,;:---].*\)/s//\1we are\2/
\end{verbatim}\caption{Capitalizing and pre-processing pronoun contractions}\end{figure}

But abbreviations don't stop at pronouns.
There are other interesting occurences:

\begin{figure}[h!]\begin{verbatim}
g/^[^<>].* i .*/s/ abt / about /
g/^\([^<>].* \)rlly\([ ,;:---].*\)/s//\1really\2/
g/^\([^<>].* \)dunno\([ ,;:---].*\)/s//\1don't know\2/
\end{verbatim}\caption{Other contractions}\end{figure}

Having that, one might start feverishly expanding wtf-s and lol-s.
But i'll not do that.

\section*{Reconstructing Characters} \label{characters}

So not all languages use Latin alphabet.
And, out of those that use it, many contain additional letters.
Computer systems used to accept only Latin characters.
Thus the existence of transliteration schemes such as Romaji and ArmASCII.

So why not reverse-engineer one of such schemes?
Say, for Interslavic:

\begin{figure}[h!]\begin{verbatim}
g/cz/s//č/g
g/sz/s//š/g
g/zs/s//ž/g
g/Cz/s//Č/g
g/Sz/s//Š/g
g/Zs/s//Ž/g
\end{verbatim}\caption{Common interslavic romanization reversal}\end{figure}

Yet another character-level thing is smart punctuation.
These curly quotes and apostrophes no one uses.
And "AI" em dashes, of course.

\begin{figure}[h!]\begin{verbatim}
g/^[^<>].*"\\([^\"]*\\)"/s/"\\([^\"]*\\)"/``\\1''/g
g/^[^<>].*\'/s/'/'/g
g/^[^<>].*---/s/---/---/g
\end{verbatim}\caption{Using prettier characters for punctuation}\end{figure}

\section*{Next Steps} \label{next}

Okay, I admit, I ran out of things to optimize here.
This post itself is processed into readable prose already.
And I don't have many more samples to reverse-engineer into literary prose.
Help will be appreciated?


\par\noindent\rule{\textwidth}{0.4pt}
\href{https://creativecommons.org/licenses/by/4.0}{CC-BY 4.0} 2022-2026 by Artyom Bologov (aartaka,)
\href{https://codeberg.org/aartaka/pages/commit/a91befa}{with one commit remixing Claude-generated code}.
Any and all opinions listed here are my own and not representative of my employers; future, past and present.
\end{document}
