Welcome, guest | Sign In | My Account | Store | Cart

How to use letter frequencies analysis to decipher French and English texts.

Python, 90 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import string

#
# Relative letter frequencies for the french and english languages. 
# First item is letter 'A' frequency and so on.
#
ENGLISH = (0.0749, 0.0129, 0.0354, 0.0362, 0.1400, 0.0218, 0.0174, 0.0422, 0.0665, 0.0027, 0.0047, 0.0357, 
           0.0339, 0.0674, 0.0737, 0.0243, 0.0026, 0.0614, 0.0695, 0.0985, 0.0300, 0.0116, 0.0169, 0.0028, 
           0.0164, 0.0004)
           
FRENCH =  (0.0840, 0.0106, 0.0303, 0.0418, 0.1726, 0.0112, 0.0127, 0.0092, 0.0734, 0.0031, 0.0005, 0.0601,
           0.0296, 0.0713, 0.0526, 0.0301, 0.0099, 0.0655, 0.0808, 0.0707, 0.0574, 0.0132, 0.0004, 0.0045, 
           0.0030, 0.0012)          
FREQUENCIES = (ENGLISH, FRENCH)

# This program can be used to decipher Caesar encoding as produced by the
# function bellow
def cipher(s, key):
    r = ''
    ASC_A = ord('a')
    for c in s.lower():
        if 'a'<=c<='z':
             r += chr(ASC_A+(ord(c)-ASC_A+key)%26)
        else:
             r += c 
    return r

#compute letter frequencies delta
def delta(source, dest):
    N = 0.0
    for f1,f2 in zip(source, dest):
        N += abs(f1-f2)
    return N
    
# compute letter frequencies from a text
def frequency(s):
    D = dict([(c,0) for c in string.lowercase])
    N = 0.0
    for c in s:
        if 'a'<=c<='z':
            N += 1
            D[c] += 1
    L = D.items()
    L.sort()
    return [f/N for (l,f) in L]

# deciphering caesar code by letter frequencies analysis 
def decipher(s):
    deltamin = 1000
    bestrot = 0
    freq = frequency(s)
    for key in range(26):
        d = min([delta(freq[key:]+freq[:key], x) for x in FREQUENCIES])
        if d<deltamin:
            deltamin = d
            bestrot = key
    return cipher(s, -bestrot)
    


#
# Some tests
#

T1 = """
Python is an easy to learn, powerful programming language. It has 
efficient high-level data structures and a simple but effective approach 
to object-oriented programming. Python's elegant syntax and dynamic
typing, together with its interpreted nature, make it an ideal language 
for scripting and rapid application development in many areas on most platforms. 

Python tutorial
"""

T2 = """
He quoi ? charmante Elise, vous devenez melancolique, apres
les obligeantes assurances que vous avez eu la bonte de me donner de
votre foi ? Je vous vois soupirer, helas ! au milieu de ma joie. Est-ce
du regret, dites-moi, de m'avoir fait heureux, et vous repentez-vous de
cet engagement ou mes feux ont pu vous contraindre ?

Moliere (l'Avare)
"""

import random
for text in (T1,T2):
    key = random.randrange(26)
    X = cipher(text, key)
    print X
    print decipher(X)

Caesar used a cypher when he wanted to transmit coded messages on his military campaigns. In those days, as nowadays, security was a problem, and he couldn't rely on the fact that an enemy who intercepted one of his messengers would be some hairy, ignorant barbarian who couldn't read Latin!

He used a simple shift technique to replace each letter of the alphabet with another letter further down the alphabet. In the first century B.C. it was enough to ensure that his messages couldn't be read unless the receiver knew the secret. But python make it really easy to crack.

The algorithm is rather primitive, it only compute letter frequencies and use the letter permutation which is the nearest from frequencies references. So it's amazing to see that only some few lines of text are enough to find the key.

The cipher function is quite suboptimal. A faster implementation could use the translate string method.

1 comment

W.J. van der Laan 20 years, 5 months ago  # | flag

Interesting... For even better cracking, you could use digrams as well or even trigrams, and things like most appearing beginning/final letters.

1) Most frequently appearing letters overall:

eiaorn tslcup mdhygb fvkwzx qj

2) Most frequently appearing letters BEGINNING words:

spcaut mbdrhi eofgnl wvkjqz yx

3) Most frequent final letters:

eysndr ltacmg hkopif xwubzv jq

4) Most frequent digrams (ordered pairs of letters)

er in ti on te al an at ic en is re ra le ri ro st ne ar