Welcome, guest | Sign In | My Account | Store | Cart

The misspell class takes a string and slightly mangles it by randomly transposing two adjacent characters while leaving the first and last characters intact. The resulting text is almost completely misspelled but still completely readable. Words less than four characters, numbers, email addresses and URLs are untouched. Each run will produce a message with a different signature (checksum).

Python, 69 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import random
import re
import StringIO

class misspell(object):
    def __init__(self):
        # create a regex to match a word with ending punctucation
        self.punctuation = re.compile('\S+[' + re.escape(",'.:;!?") + ']$')
    def misspell(self, text):
        self.text = StringIO.StringIO(text).readlines()
        misspelled = []
        for line in self.text:
            # split hyphenated words into independent words           
            line = re.sub(r'(\S+)\-(\S+)', r'\1 \2', line)
            
            # split each line in a list of words
            tokens = line.split()
        
            for token in tokens:
                # don't misspell a number
                if token.isdigit():
                    misspelled.append(token + ' ')
                    continue
                
                # don't misspell an email address or URL
                if '@' in token or '://' in token:
                    misspelled.append(token + ' ')
                    continue
                
                # does the word end with puncuation?                
                has_punc = re.match(self.punctuation, token)
                
                # explode the word to a list                
                token = list(token)

                # word doesn't end in puctuation and is longer than 4 chars
                if not has_punc and len(token) >= 4:
                    start = random.randint(1,len(token) - 3)
                    stop = start + 2
                    f,s = token[start:stop]
                    token[start:stop] = s,f
                    
                # word does end in puctuation and is longer that 5 chars
                elif has_punc and len(token) >=5:
                    start = random.randint(1,len(token) - 4)
                    stop = start + 2
                    f,s = token[start:stop]
                    token[start:stop] = s,f
                                   
                # add the word to the line
                misspelled.append((''.join(token) + ' '))
                
            # end the line                
            misspelled.append('\n')
            
        return ''.join(misspelled)

if __name__ == '__main__':
    # example usage of the misspell class
    message = """
    According to research at an English University, it doesn't matter 
    in what order the letters in a word are, the only important thing is 
    that the first and last letters be in the right places. The rest can
    be a total mess and you can still read it without problem. This is
    because the human mind does not read every letter by itself, but 
    the word as a whole."""
 
    msg = misspell()
    print msg.misspell(message)
    

The output of this example:

“Accoridng to reseacrh at an Engilsh Univeristy, it does'nt matetr in waht odrer the lettres in a wrod are, the olny imoprtant thnig is taht the frist and lsat lteters be in the rihgt palces. The rset can be a ttoal mses and you can stlil raed it wihtout prolbem. Tihs is becasue the hmuan mnid deos not raed evrey lteter by istelf, but the wrod as a whloe.”

Why would you want to do this? Well, it could be used to avoid detection by a Bayesian or signature based message filter. No, I’m not a spammer; I just found this interesting.

4 comments

Bibha Tripathi 18 years, 6 months ago  # | flag

nice. and I read the mangled paragraph with ease too!

Raymond Hettinger 18 years, 6 months ago  # | flag

Naive bayes still pretty smart. I doubt the algorithm's ability to fool a bayesian classifier armed with a reasonably large corpus. At best, the algorithm will render many words to be unrecognizable as spam words but it won't do anything to raise the count of non-spam words. At worst, a few of the misspellings for shorter words will unique to spam messages (for instance, "raed" is a typo that would only show up in a mangled message). Further, a high-percentage of unrecognizable words is its own cue that the message is spam (you can throw a blanket over the camel but can't hide its humps).

Bob Costello 18 years, 6 months ago  # | flag

Other use... Some methods of cryptanalysis look for common letter combinations such as 'th' and 'qu'. This can be used to throw a little "speed bump" into that.

Walter Brunswick 18 years, 6 months ago  # | flag

Well done. Well done. A nice class which performs the algorithm efficiently.