Oh no, I've done it again. I've heard your cries, this time I have a decoder idea too.
Random Removal for Lossy Compression
The idea this time is to remove random letters from the text to make it shorter, and then 'decode' using autocorrect.
Encoder
import random
import re
import string
def encode(text: str, remove_rate: float) -> str:
"""
Remove random letters from the string
We shouldn't remove any spaces, to give autocorrect a fighting chance
Args:
text (str): String to be 'encoded'
remove_rate (float): [0,1] proportion of characters to be removed.
Returns:
str: lossily encoded string
"""
if remove_rate > 1:
raise ValueError("Remove rate greater than 1")
if remove_rate < 0:
raise ValueError("Remove rate less than 0")
text_list = list(text)
# Strip out spaces, punctuation since we don't need them here
stripped_punc = text.translate(str.maketrans("", "", string.punctuation + " "))
letters_to_remove = int(remove_rate * len(stripped_punc))
for _ in range(letters_to_remove):
choice = random.randint(0, len(text_list) - 1)
while text_list[choice] in string.punctuation + " ":
choice = random.randint(0, len(text_list) - 1)
del text_list[choice]
removed_letters = "".join(text_list)
# Make multiple spaces just one
return re.sub(" +", " ", removed_letters)
Here we can provide a string and a proportion of the text to randomly remove. I opted to keep punctuation and spaces, but at the end I collapse consecutive spaces into just one (this keeps the string nice if whole words are removed). This also makes the decoder simple.
Some examples using the text
The Quick Brown Fox Jumps Over The Lazy Dog
With a 0.1 rate
Th Quick Bown Fox Jump Over The Lazy Dog
and a 0.9 rate
Qu u g
Okay so there's not a great chance that removing 90% of the letters in a text is very useful for decoding.
Decoder
from spellchecker import SpellChecker
def decode(text: str) -> str:
"""
Tries to auto-correct a 'lossily' encoded string
Args:
text (str): lossy string
Returns:
str: Auto-corrected string, 'decoded'
"""
spell = SpellChecker()
words = text.split(" ")
results = []
for word in words:
results.append(spell.correction(word))
return " ".join(results)
We can try our 0.1 rate text and see if this produces a useful output
the Quick down Fox Jump Over The Lazy Dog
Okay, so we're slightly wrong but 'down' is probably more likely to be a misspelling than 'brown' I think this is understandable.
Let's try the 0.9 rate text for kicks
Qu u i
We notice a slightly interesting thing in that the spell checker will actually try and turn any lone consonant into 'i' (and not 'I'), I'm not really sure why it's lowercase.
Different Approach: Remove Only Vowels
Vowels area bit more of a 'gluey' letter, removing them from words tends to make more sense than just random letters. A word losing most of its consonants turns it into more 'primordial soup' than just the vowels generally.
import random
import re
VOWELS = ["a", "e", "i", "o", "u"]
def encode(text: str, remove_rate: float) -> str:
"""
Remove vowels from the string
We shouldn't remove any spaces, to give autocorrect a fighting chance
Args:
text (str): String to be 'encoded'
remove_rate (float): [0,1] proportion of vowels to be removed.
Returns:
str: lossily encoded string
"""
if remove_rate > 1:
raise ValueError("Remove rate greater than 1")
if remove_rate < 0:
raise ValueError("Remove rate less than 0")
text_list = list(text)
# Strip out spaces, punctuation since we don't need them here
stripped = [l for l in text_list if l in VOWELS]
letters_to_remove = int(remove_rate * len(stripped))
for _ in range(letters_to_remove):
choice = random.randint(0, len(text_list) - 1)
while text_list[choice] not in VOWELS:
choice = random.randint(0, len(text_list) - 1)
del text_list[choice]
removed_letters = "".join(text_list)
# Make multiple spaces just one
return re.sub(" +", " ", removed_letters)
Again using that phrase:
The Quick Brown Fox Jumps Over The Lazy Dog
We can see the results for 0.5
Th Qck Brown Fx Jmps Over The Lazy Dog
This makes much more sense in my opinion but is obviously a smaller 'saving' than the first approach
We can even remove all the vowels
Th Qck Brwn Fx Jmps Ovr Th Lzy Dg
And this is pretty much intelligible still even without 'decoding'.
Using the same decoding scheme we end up with something like
the ack brown fix jumps or the lay do
However, this is probably because the autocorrect favours words that are the same length. It's also worth noting there's no real contextual information for each word 'brown fix' makes no sense where 'fox' would be a better option.
Demo on Longer Text
We can take Genesis 1:1 as our text to be encoded:
In the beginning God created the heavens and the earth. Now the earth was formless and empty, darkness was over the surface of the deep, and the Spirit of God was hovering over the waters.
Encoded with 50% of the vowels dropped:
In th bgnning Gd cratd th hevns and the arth. Nw the earth was formlss nd empty, darkness was ovr th surfac of th dep, and the Sprt of God was hverng ver th wtrs.
Decoded:
In the banning go crate the hens and the earth no the earth was formless and empty darkness was or the surface of the dept and the sort of God was hovering ver the worse
We get a compression ratio of 1.16, which is not great considering how different the decoded text is! However, perhaps this shows some promise of a sort of poetic re-imagining of the input text.
Removing all of the vowels from the text ends up with this after decoding:
In the boning go crud the hans and the the no the ruth is frills and empty dress is or the safe i the do and the sort i go is having or the worse
Which I think is at least somewhat interesting for having the same structure as the original text but interprets very differently.