Blog

Silly Lossy Text Compression Idea

Oh no, I've done it again. I've heard your cries, this time I have a decoder idea too.

Random Removal for Lossy Compression

The idea this time is to remove random letters from the text to make it shorter, and then 'decode' using autocorrect.

Encoder

import random
import re
import string


def encode(text: str, remove_rate: float) -> str:
    """
    Remove random letters from the string

    We shouldn't remove any spaces, to give autocorrect a fighting chance

    Args:
        text (str): String to be 'encoded'
        remove_rate (float): [0,1] proportion of characters to be removed.

    Returns:
        str: lossily encoded string
    """
    if remove_rate > 1:
        raise ValueError("Remove rate greater than 1")
    if remove_rate < 0:
        raise ValueError("Remove rate less than 0")

    text_list = list(text)

    # Strip out spaces, punctuation since we don't need them here
    stripped_punc = text.translate(str.maketrans("", "", string.punctuation + " "))
    letters_to_remove = int(remove_rate * len(stripped_punc))

    for _ in range(letters_to_remove):
        choice = random.randint(0, len(text_list) - 1)
        while text_list[choice] in string.punctuation + " ":
            choice = random.randint(0, len(text_list) - 1)

        del text_list[choice]

    removed_letters = "".join(text_list)

    # Make multiple spaces just one
    return re.sub(" +", " ", removed_letters)

Here we can provide a string and a proportion of the text to randomly remove. I opted to keep punctuation and spaces, but at the end I collapse consecutive spaces into just one (this keeps the string nice if whole words are removed). This also makes the decoder simple.

Some examples using the text

The Quick Brown Fox Jumps Over The Lazy Dog

With a 0.1 rate

Th Quick Bown Fox Jump Over The Lazy Dog

and a 0.9 rate

Qu u g

Okay so there's not a great chance that removing 90% of the letters in a text is very useful for decoding.

Decoder

from spellchecker import SpellChecker


def decode(text: str) -> str:
    """
    Tries to auto-correct a 'lossily' encoded string

    Args:
        text (str): lossy string

    Returns:
        str: Auto-corrected string, 'decoded'
    """
    spell = SpellChecker()
    words = text.split(" ")

    results = []
    for word in words:
        results.append(spell.correction(word))

    return " ".join(results)

We can try our 0.1 rate text and see if this produces a useful output

the Quick down Fox Jump Over The Lazy Dog

Okay, so we're slightly wrong but 'down' is probably more likely to be a misspelling than 'brown' I think this is understandable.

Let's try the 0.9 rate text for kicks

Qu u i

We notice a slightly interesting thing in that the spell checker will actually try and turn any lone consonant into 'i' (and not 'I'), I'm not really sure why it's lowercase.

Different Approach: Remove Only Vowels

Vowels area bit more of a 'gluey' letter, removing them from words tends to make more sense than just random letters. A word losing most of its consonants turns it into more 'primordial soup' than just the vowels generally.

import random
import re

VOWELS = ["a", "e", "i", "o", "u"]


def encode(text: str, remove_rate: float) -> str:
    """
    Remove vowels from the string

    We shouldn't remove any spaces, to give autocorrect a fighting chance

    Args:
        text (str): String to be 'encoded'
        remove_rate (float): [0,1] proportion of vowels to be removed.

    Returns:
        str: lossily encoded string
    """

    if remove_rate > 1:
        raise ValueError("Remove rate greater than 1")
    if remove_rate < 0:
        raise ValueError("Remove rate less than 0")

    text_list = list(text)

    # Strip out spaces, punctuation since we don't need them here
    stripped = [l for l in text_list if l in VOWELS]
    letters_to_remove = int(remove_rate * len(stripped))

    for _ in range(letters_to_remove):
        choice = random.randint(0, len(text_list) - 1)
        while text_list[choice] not in VOWELS:
            choice = random.randint(0, len(text_list) - 1)

        del text_list[choice]

    removed_letters = "".join(text_list)

    # Make multiple spaces just one
    return re.sub(" +", " ", removed_letters)

Again using that phrase:

The Quick Brown Fox Jumps Over The Lazy Dog

We can see the results for 0.5

Th Qck Brown Fx Jmps Over The Lazy Dog

This makes much more sense in my opinion but is obviously a smaller 'saving' than the first approach

We can even remove all the vowels

Th Qck Brwn Fx Jmps Ovr Th Lzy Dg

And this is pretty much intelligible still even without 'decoding'.

Using the same decoding scheme we end up with something like

the ack brown fix jumps or the lay do

However, this is probably because the autocorrect favours words that are the same length. It's also worth noting there's no real contextual information for each word 'brown fix' makes no sense where 'fox' would be a better option.

Demo on Longer Text

We can take Genesis 1:1 as our text to be encoded:

In the beginning God created the heavens and the earth. Now the earth was formless and empty, darkness was over the surface of the deep, and the Spirit of God was hovering over the waters.

Encoded with 50% of the vowels dropped:

In th bgnning Gd cratd th hevns and the arth. Nw the earth was formlss nd empty, darkness was ovr th surfac of th dep, and the Sprt of God was hverng ver th wtrs.

Decoded:

In the banning go crate the hens and the earth no the earth was formless and empty darkness was or the surface of the dept and the sort of God was hovering ver the worse

We get a compression ratio of 1.16, which is not great considering how different the decoded text is! However, perhaps this shows some promise of a sort of poetic re-imagining of the input text.

Removing all of the vowels from the text ends up with this after decoding:

In the boning go crud the hans and the the no the ruth is frills and empty dress is or the safe i the do and the sort i go is having or the worse

Which I think is at least somewhat interesting for having the same structure as the original text but interprets very differently.

Back to Frontpage