Welcome, guest | Sign In | My Account | Store | Cart

Compression, encryption, and data codecs are all related fields that most programmers will use ready-made solutions for. This recipe is a shallow adventure into the writing of original code and algorithms that explores a combination of those topics. Based on the work of recipe 502202, the code here is compliant with Python 3.1 and will run a test of itself when executed. From the program's report, one can gather that the novel procedures compress the source and accurately decompress it again. For those who wish to experiment further with the concept, note that fewer unique characters will yield higher compression ratios.

Python, 82 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#! /usr/bin/env python

import random
import sys

################################################################################

def compress(string):
    # Get the unique characters and numeric base.
    unique = set(string)
    base = len(unique)
    # Create a key that will encode data properly.
    key = random.sample(unique, base)
    mapping = dict(map(reversed, enumerate(key)))
    while not mapping[string[-1]]:
        key = random.sample(unique, base)
        mapping = dict(map(reversed, enumerate(key)))
    # Create a compressed numeric representation.
    value = 0
    for place, char in enumerate(string):
        value += mapping[char] * base ** place
    # Return the number as a string with the table.
    return decode(value), bytes(key)

def decode(value):
    # Change a number into a string.
    array = bytearray()
    while value:
        value, byte = divmod(value, 256)
        array.append(byte)
    return bytes(array)

################################################################################

def decompress(string, mapping):
    # Get the numeric value of the string.
    value = encode(string)
    # Find the numeric base and prepare storage.
    base = len(mapping)
    data = bytearray()
    # Decode the value into the original string.
    while value:
        value, key = divmod(value, base)
        data.append(mapping[key])
    # Return the "string" as a bytes object.
    return bytes(data)

def encode(array):
    # Change a string into a number.
    assert array and array[-1], 'Array has ambiguous value!'
    value = 0
    for shift, byte in enumerate(array):
        value += byte << 8 * shift
    return value

################################################################################

def test():
    # Get this program's source.
    txt = open(sys.argv[0], 'r').read().encode()
    
    print('Length of data:', len(txt))

    # Compress the source numerically.
    data, table = compress(txt)
    
    print('Length after compression:', len(data))
    print('Length of the table:', len(table))
    print('Total compressed size:', len(data + table))
    print('Compression ratio: {:%}'.format(len(data + table) / len(txt)))

    # Decompress the data using the table.
    new = decompress(data, table)
    
    print('Decompression was {}successful.'.format(('not ', '')[txt == new]))
    print('Showing the decompressed data:')
    print('==============================')
    print(new.decode())

# Test this program if run directly.
if __name__ == '__main__':
    test()

To follow the algorithm's operation, it helps to understand that an array of 8-bit bytes can be treated as a base-256 number. As such, any array of bytes can be translated into an actual number for manipulation purposes. Doing this in Python is quite easy since numbers are not limited to an arbitrary number of bits but by the amount of memory available. The encode and decode functions up above make great use of this fact and can be used solely on their own for the purpose of turning bytes objects and bytearray objects into int objects.

The text compressor takes advantage of the fact that written communications usually have less than 256 unique bytes in them. Such text could be more efficient if it were written in a lower base, one that used fewer bits than the 8 required for base-256. So that is just what the compression algorithm does in its second phase of operation (the first phase creates a base mapping for the characters). Once the text has been converted into a number where the base is unspecified, the data is reconstructed as an array of base-256 bytes.

Decompression follows the reverse operations of the compression algorithm in order to reconstitute the original text. The base-256 array of bytes are changed back into the number formed in the second phase of compression. The base that the original data was encoded into is determined by the size of the compression table, and the number is divided by that base to form indexes into that table. The bytes resulting from that operation are appended together to form the original character array that had been passed to compression.

Compression rates can be approximated with the formula log(N) / log(256), where N is the number of unique bytes in the data. For example: if using only the characters A-Z, the ratio of compressed to uncompressed data sizes would be about 58.75%. Compressing bytes expressed as two hexadecimal digits would lead to a ratio of exactly 100% (neither losing nor gaining any size). Since the text compressor is primarily meant for data encoded in 7-bit ASCII, the lowest percentage in size reduction should be expected to be about 12.5%.

It is not advisable to use the text compression on binary data since there could be a wide range of values present in such an array. Trying to compress encrypted or already-compressed data would most likely be a futile attempt. Furthermore, the performance of the algorithm will degrade given larger and larger arrays for processing. The compression algorithm in recipe 502202 compensated for this by splitting its text into manageable chunks joined on null bytes. As a result, any serious usage of this recipe would require optimizations.