Welcome, guest | Sign In | My Account | Store | Cart

A simple but slow way of compression using arithmetic coding.

Python, 113 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
#  A very slow arithmetic coder for Python.
#
#  "Rationals explode quickly in term of space and ... time."
#              -- comment in Rational.py (probably Tim Peters)
#
# Really.  Don't use this for real work.  Read Mark Nelson's
# Dr. Dobb's article on the topic at
#    http://dogma.net/markn/articles/arith/part1.htm
# It's readable, informative and even includes clean sample code.
#
# Contributed to the public domain
# Andrew Dalke < dalke @ dalke scientific . com >


import sys

import Rational, math
R = Rational.rational

def train(text):
    """text -> 0-order probability statistics as a dictionary

    Text must not contain the NUL (0x00) character because that's
    used to indicate the end of data.
    """
    assert "\x00" not in text
    counts = {}
    for c in text:
        counts[c]=counts.get(c,0)+1
    counts["\x00"] = 1
    tot_letters = sum(counts.values())

    tot = 0
    d = {}
    prev = R(0)
    for c, count in counts.items():
        next = R(tot + count, tot_letters)
        d[c] = (prev, next)
        prev = next
        tot = tot + count
    assert tot == tot_letters

    return d


def encode(text, probs):
    """text and the 0-order probability statistics -> longval, nbits

    The encoded number is rational(longval, 2**nbits)
    """
    minval = R(0)
    maxval = R(1)
    for c in text + "\x00":
        prob_range = probs[c]
        delta = maxval - minval
        maxval = minval + prob_range[1] * delta
        minval = minval + prob_range[0] * delta

    # I tried without the /2 just to check.  Doesn't work.
    # Keep scaling up until the error range is >= 1.  That
    # gives me the minimum number of bits needed to resolve
    # down to the end-of-data character.
    delta = (maxval - minval)/2
    nbits = 0L
    while delta < 1:
        nbits = nbits + 1
        delta = delta << 1
    if nbits == 0:
        return 0, 0
    else:
        avg = (maxval + minval)<<(nbits-1)  # using -1 instead of /2
    # Could return a rational instead ...
    return avg.n//avg.d, nbits  # the division truncation is deliberate


def decode(longval, nbits, probs):
    """decode the number to a string using the given statistics"""
    val = R(longval, 1L<<nbits)
    letters = []
    probs_items = [(c, minval, maxval) for (c, (minval, maxval))
                                 in probs.items()]

    while 1:
        for (c, minval, maxval) in probs_items:
            if minval <= val < maxval:
                break
        else:
            raise AssertionError("not found")

        if c == "\x00":
            break
        letters.append(c)
        delta = maxval - minval
        val = (val - minval)/delta
    return "".join(letters)

if __name__ == "__main__":
    # getopt? optparse? What are they?
    import sys
    trainfilename = sys.argv[1]  # must give a filename
    phrase = sys.argv[2]  # must give text to compress (slowly!)
    probs = train(open(trainfilename).read())
    n, nbits = encode(phrase, probs)
    # +1 for the NUL terminator or equivalent
    print "Orig. %d bits, compr. %d bits, ratio = %3.f%%" % (
        (len(phrase)+1)*8, nbits, (100.*nbits) / (len(phrase)*8+1))
    print n
    new_phrase = decode(n, nbits, probs)
    print "Was it '%s'?" % (new_phrase,)
    if phrase == new_phrase:
        print "Guess so."
    else:
        print "Why not?!"

I wanted to understand how to do arithmetic coding. It's somewhat tricky in C because of fixed width numeric types. Python has a rational data type which makes those problems go away, though at much higher computation cost.

The result is that this example is very easy to understand, but not that useful for compressing more than a few dozen characters.

To use this code, first use train() it on a block of text to gather statistics on character use. This creates a frequency table used by the encode() function. The encoded result is two numbers, 'N' and 'Nbits'. The coded number is N/2**Nbits. To decode the number, pass those two and the statistics table to the decode() function.

There are many places that describe the algorithm. The clearest one I found which also included easy to use source code is from Mark Nelson's paper in Dr. Dobb's Journal. Text and code can be found in http://dogma.net/markn/articles/arith/part1.htm .