The misspell class takes a string and slightly mangles it by randomly transposing two adjacent characters while leaving the first and last characters intact. The resulting text is almost completely misspelled but still completely readable. Words less than four characters, numbers, email addresses and URLs are untouched. Each run will produce a message with a different signature (checksum).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
import random import re import StringIO class misspell(object): def __init__(self): # create a regex to match a word with ending punctucation self.punctuation = re.compile('\S+[' + re.escape(",'.:;!?") + ']$') def misspell(self, text): self.text = StringIO.StringIO(text).readlines() misspelled =  for line in self.text: # split hyphenated words into independent words line = re.sub(r'(\S+)\-(\S+)', r'\1 \2', line) # split each line in a list of words tokens = line.split() for token in tokens: # don't misspell a number if token.isdigit(): misspelled.append(token + ' ') continue # don't misspell an email address or URL if '@' in token or '://' in token: misspelled.append(token + ' ') continue # does the word end with puncuation? has_punc = re.match(self.punctuation, token) # explode the word to a list token = list(token) # word doesn't end in puctuation and is longer than 4 chars if not has_punc and len(token) >= 4: start = random.randint(1,len(token) - 3) stop = start + 2 f,s = token[start:stop] token[start:stop] = s,f # word does end in puctuation and is longer that 5 chars elif has_punc and len(token) >=5: start = random.randint(1,len(token) - 4) stop = start + 2 f,s = token[start:stop] token[start:stop] = s,f # add the word to the line misspelled.append((''.join(token) + ' ')) # end the line misspelled.append('\n') return ''.join(misspelled) if __name__ == '__main__': # example usage of the misspell class message = """ According to research at an English University, it doesn't matter in what order the letters in a word are, the only important thing is that the first and last letters be in the right places. The rest can be a total mess and you can still read it without problem. This is because the human mind does not read every letter by itself, but the word as a whole.""" msg = misspell() print msg.misspell(message)
The output of this example:
Accoridng to reseacrh at an Engilsh Univeristy, it does'nt matetr in waht odrer the lettres in a wrod are, the olny imoprtant thnig is taht the frist and lsat lteters be in the rihgt palces. The rset can be a ttoal mses and you can stlil raed it wihtout prolbem. Tihs is becasue the hmuan mnid deos not raed evrey lteter by istelf, but the wrod as a whloe.
Why would you want to do this? Well, it could be used to avoid detection by a Bayesian or signature based message filter. No, Im not a spammer; I just found this interesting.
nice. and I read the mangled paragraph with ease too!
Naive bayes still pretty smart. I doubt the algorithm's ability to fool a bayesian classifier armed with a reasonably large corpus. At best, the algorithm will render many words to be unrecognizable as spam words but it won't do anything to raise the count of non-spam words. At worst, a few of the misspellings for shorter words will unique to spam messages (for instance, "raed" is a typo that would only show up in a mangled message). Further, a high-percentage of unrecognizable words is its own cue that the message is spam (you can throw a blanket over the camel but can't hide its humps).
Other use... Some methods of cryptanalysis look for common letter combinations such as 'th' and 'qu'. This can be used to throw a little "speed bump" into that.
Well done. Well done. A nice class which performs the algorithm efficiently.