LZ77 is a compression algorithm upon which most popular compression formats are based; PKZIP, GZIP, LHA, RAR, etc.
| Python |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | # lz77.py
#
# This code is Public Domain.
#
# By Nelson Rush
#
# Boyer-Moore-Horspool fast string search.
class BMH:
def pattern(self, data):
self.skip = []
self.m = len(data)
for k in range(256): self.skip.append(self.m)
for k in range(self.m - 1): self.skip[ord(data[k])] = self.m - k - 1
self.skip = tuple(self.skip)
self.data = data
def find(self, text):
n = len(text)
if self.m > n: return -1
k = self.m - 1
while k < n:
j = self.m - 1; i = k
while j >= 0 and text[i] == self.data[j]:
j -= 1; i -= 1
if j == -1: return i + 1
k += self.skip[ord(text[k])]
return -1
# Note this currently doesn't work for blocks larger than 256 bytes, its just an example.
# Normally you would use a larger 'pointer' to access 4k to 64k sized blocks.
# And this should really be implemented using bit manipulation for smaller size.
class LZ77:
def __init__(self, data):
self.position = 0
self.window = ""
self.stream = data
self.streamSize = len(self.stream)
self.search = BMH()
def Encode(self):
# Initialize pointer, last char, and last result to nothing.
p = 0
c = ''
lastresult = 0
found = 0
# Loop through the lookahead buffer.
for i in range(self.streamSize):
# Set pattern to find the longest match.
self.search.pattern(self.stream[self.position:self.position+i+1])
# Find the pattern match in the window.
result = self.search.find(self.window)
# If match failed, we've found the longest.
if result < 0: break
# Set the last result.
lastresult = result
found = 1
# Set c to the last character after result.
c = self.stream[self.position+i]
# Set pointer to the last result that worked.
p = lastresult
# Set the number of bytes backwards to travel.
B = 0
if i > 0: B = self.position - p
L = i
# Check if the lookahead buffer is not empty.
if self.streamSize > 0:
# Increase the position of the lookahead buffer.
self.position += i + 1
# Decrease the size of the lookahead buffer to match.
self.streamSize -= i + 1
# The window encompasses everything up to the lookahead.
self.window = self.stream[:self.position]
return ((B, L), c)
def Encoder(self):
print 'Encoding...'
output = ""
length = self.streamSize
# Loop through and save the code information.
while self.streamSize > 0:
((B, L), C) = self.Encode()
output += chr(B) + chr(L) + C
return (output, length)
def Decoder(self, length):
sz = len(self.stream)
P = []
# Set up the codes and the window.
for i in range(0, sz, 3):
B,L,C = self.stream[i:i+3]
P.append((ord(B),ord(L)))
self.window += C
P = tuple(P)
print 'Decoding...'
output = ""
i = 0
# Loop through executing the code instructions
# (move back B, copy length L, drop in C)
for p in P:
B = p[0]
L = p[1]
C = self.window[i]
if L > 0:
if i-B < 0:
c = self.window[0:L]
else:
pos = i-B
c = self.window[pos:pos+L]
c += C
else:
c = self.window[i]
output += c
i += 1
return output
if __name__ == '__main__':
original = 'AABCBBABC'
lz = LZ77(original)
stream, streamSize = lz.Encoder()
lz = LZ77(stream)
s = lz.Decoder(streamSize)
print 'Original string:',original
print 'Decoded string:',s
|
Discussion
A sliding dictionary is used on the stream of data itself to detect redundancy and store positions to reduce the number of redundant characters.
The string we are compressing is: A A B C B B A B C
The window and the look ahead buffer consist, the window is empty in the beginning and the look ahead buffer holds the stream.
To encode: Find the longest match from the look ahead buffer in the window buffer. For instance, first we check if the first character in the look ahead is in the window. If it isn't we set the (B)ackup/(L)ength code to (0,0) and then drop the character we were at (in this case 'A') in with the code.
If we, on the other hand, did find it in the window, then we would check the first and second character and so on until hit a no-match at which point we would set up the B to the position in the window at which they occur and the L for the length and then drop in the character that didn't match.
Keep it up until we've encoded the entire look ahead buffer, ie. look ahead is empty, it has shifted all over to the window. Then we can just store the resulting character/code table into a file and the size of the original stream which will be used by the decoder.
To decode just perform the code operations on the table, moving back and copying the length and so on until we have unpacked our original sized result.
This algorithm provides pretty good compression. LZSS is a more advanced version of this algorithm that most modern compression programs use these days. LZSS improves the compression ratio by only only storing the codes if the number of bytes being encoded is greater than the size of the code required to reference it.


Comments
Bug. I set the original string to 'lempel ziv text compression'. The decoded output was 'lempel ziv tlxv colerlssitn'.
Sign in to comment