The misspell class takes a string and slightly mangles it by randomly transposing two adjacent characters while leaving the first and last characters intact. The resulting text is almost completely misspelled but still completely readable. Words less than four characters, numbers, email addresses and URLs are untouched. Each run will produce a message with a different signature (checksum).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | import random
import re
import StringIO
class misspell(object):
def __init__(self):
# create a regex to match a word with ending punctucation
self.punctuation = re.compile('\S+[' + re.escape(",'.:;!?") + ']$')
def misspell(self, text):
self.text = StringIO.StringIO(text).readlines()
misspelled = []
for line in self.text:
# split hyphenated words into independent words
line = re.sub(r'(\S+)\-(\S+)', r'\1 \2', line)
# split each line in a list of words
tokens = line.split()
for token in tokens:
# don't misspell a number
if token.isdigit():
misspelled.append(token + ' ')
continue
# don't misspell an email address or URL
if '@' in token or '://' in token:
misspelled.append(token + ' ')
continue
# does the word end with puncuation?
has_punc = re.match(self.punctuation, token)
# explode the word to a list
token = list(token)
# word doesn't end in puctuation and is longer than 4 chars
if not has_punc and len(token) >= 4:
start = random.randint(1,len(token) - 3)
stop = start + 2
f,s = token[start:stop]
token[start:stop] = s,f
# word does end in puctuation and is longer that 5 chars
elif has_punc and len(token) >=5:
start = random.randint(1,len(token) - 4)
stop = start + 2
f,s = token[start:stop]
token[start:stop] = s,f
# add the word to the line
misspelled.append((''.join(token) + ' '))
# end the line
misspelled.append('\n')
return ''.join(misspelled)
if __name__ == '__main__':
# example usage of the misspell class
message = """
According to research at an English University, it doesn't matter
in what order the letters in a word are, the only important thing is
that the first and last letters be in the right places. The rest can
be a total mess and you can still read it without problem. This is
because the human mind does not read every letter by itself, but
the word as a whole."""
msg = misspell()
print msg.misspell(message)
|
The output of this example:
Accoridng to reseacrh at an Engilsh Univeristy, it does'nt matetr in waht odrer the lettres in a wrod are, the olny imoprtant thnig is taht the frist and lsat lteters be in the rihgt palces. The rset can be a ttoal mses and you can stlil raed it wihtout prolbem. Tihs is becasue the hmuan mnid deos not raed evrey lteter by istelf, but the wrod as a whloe.
Why would you want to do this? Well, it could be used to avoid detection by a Bayesian or signature based message filter. No, Im not a spammer; I just found this interesting.
nice. and I read the mangled paragraph with ease too!
Naive bayes still pretty smart. I doubt the algorithm's ability to fool a bayesian classifier armed with a reasonably large corpus. At best, the algorithm will render many words to be unrecognizable as spam words but it won't do anything to raise the count of non-spam words. At worst, a few of the misspellings for shorter words will unique to spam messages (for instance, "raed" is a typo that would only show up in a mangled message). Further, a high-percentage of unrecognizable words is its own cue that the message is spam (you can throw a blanket over the camel but can't hide its humps).
Other use... Some methods of cryptanalysis look for common letter combinations such as 'th' and 'qu'. This can be used to throw a little "speed bump" into that.
Well done. Well done. A nice class which performs the algorithm efficiently.