Welcome, guest | Sign In | My Account | Store | Cart

Given a message encoded with a shift/rotation cipher, such as rot13, this recipe recovers the most probable plain text for the message. It does this by using statistics of bigram (2-character sequence) counts from a sample of text.

Python, 51 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class ShiftDecoder:
    """Decode text encoded with a shift cypher, a code that, like rot13,
    maps character[i] of the alphabet to character[(i+n)%26], for some n.
    The algorithm is simple: first train the decoder on some sample text.
    Training means keeping counts of how often each 2-character sequence
    occurs.  Then, when given a text to decode, try each of the 26 possible
    shifts (rotations) and choose the one that has the highest probability
    according to the counts of 2-character sequences."""
    def __init__(self, training_text):
        self.counts = DefaultDict(1)
        for bi in bigrams(text):
            self.counts[bi] += 1

    def score(self, plaintext):
        "Return a score for text based on how common letters pairs are."
        s = 1.0
        for bi in bigrams(plaintext):
            s = s * self.counts[bi]
        return s

    def decode(self, ciphertext):
        "Return the shift decoding of text with the best score."
        all_shifts = [shift_encode(ciphertext, n) for n in range(len(alphabet))]
        return argmax(all_shifts, self.score)


alphabet = 'abcdefghijklmnopqrstuvwxyz'

def shift_encode(plaintext, n):
    "Encode text with a shift cipher that moves each letter up by n letters."
    from string import maketrans
    code = alphabet[n:] + alphabet[:n]
    trans = maketrans(alphabet + alphabet.upper(), code + code.upper())
    return plaintext.translate(trans)

def bigrams(text):
    "Return a list of all consecutive pairs of letters in text."
    return [text[i:i+2] for i in range(len(text) - 1)]

def argmax(sequence, fn):
    "Return the element e of sequence with maximum fn(e) value."
    best_fn, best_e = max([(fn(e), e) for e in sequence])
    return best_e

class DefaultDict(dict):
    "Dictionary with a default value for unknown keys."
    def __init__(self, default):
        self.default = default

    def __getitem__(self, key):
        return self.get(key, self.default)
First you need some text to train the decoder.
Find a plain text file of at least a couple of pages:
>>> d = ShiftDecoder(file('something.txt').read())
Now encode a message:
>>> message = shift_encode('This is a test.', 8)
>>> message
'Bpqa qa i bmab.'
Use the decoder to decode it:
>>> d.decode(message)
'This is a test.'
Note that if the message is too long you'll get arithmetic overflow.
You can forestall that problem by changing the 1.0 to 1, but
ultimately you'd want to be adding logs rather than multiplying counts.
Note that the default count is 1, not 0. If it were 0, then any novel
sequence would have a 0 score, not what we want.

2 comments

Wai Yip Tung 15 years, 10 months ago  # | flag

by the way. are you the Peter Norvig from Google?

Cristian Consonni 9 years, 11 months ago  # | flag

There's a little bug in the code. In the constructor of ShiftDecoder the function bigrams should be called with argument 'training_text' instead of 'text' which is undefined.

Cristian (arrived here for a AI-class assignment ;-) )

Created by Peter Norvig on Fri, 14 Oct 2005 (PSF)
Python recipes (4591)
Peter Norvig's recipes (3)

Required Modules

Other Information and Tasks