This recipe shows a simple approach to using the Python email package to strip out attachments and file types from an email message that might be considered dangerous. This is particularly relevant in Python 2.4, as the email Parser is now much more robust in handling mal-formed messages (which are typical for virus and worm emails)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
ReplaceString = """ This message contained an attachment that was stripped out. The original type was: %(content_type)s The filename was: %(filename)s, (and it had additional parameters of: %(params)s) """ import re BAD_CONTENT_RE = re.compile('application/(msword|msexcel)', re.I) BAD_FILEEXT_RE = re.compile(r'(\.exe|\.zip|\.pif|\.scr|\.ps)$') def sanitise(msg): # Strip out all payloads of a particular type ct = msg.get_content_type() # We also want to check for bad filename extensions fn = msg.get_filename() # get_filename() returns None if there's no filename if BAD_CONTENT_RE.search(ct) or (fn and BAD_FILEEXT_RE.search(fn)): # Ok. This part of the message is bad, and we're going to stomp # on it. First, though, we pull out the information we're about to # destroy so we can tell the user about it. # This returns the parameters to the content-type. The first entry # is the content-type itself, which we already have. params = msg.get_params()[1:] # The parameters are a list of (key, value) pairs - join the # key-value with '=', and the parameter list with ', ' params = ', '.join([ '='.join(p) for p in params ]) # Format up the replacement text, telling the user we ate their # email attachment. replace = ReplaceString % dict(content_type=ct, filename=fn, params=params) # Install the text body as the new payload. msg.set_payload(replace) # Now we manually strip away any paramaters to the content-type # header. Again, we skip the first parameter, as it's the # content-type itself, and we'll stomp that next. for k, v in msg.get_params()[1:]: msg.del_param(k) # And set the content-type appropriately. msg.set_type('text/plain') # Since we've just stomped the content-type, we also kill these # headers - they make no sense otherwise. del msg['Content-Transfer-Encoding'] del msg['Content-Disposition'] else: # Now we check for any sub-parts to the message if msg.is_multipart(): # Call the sanitise routine on any subparts payload = [ sanitise(x) for x in msg.get_payload() ] # We replace the payload with our list of sanitised parts msg.set_payload(payload) # Return the sanitised message return msg # And a simple driver to show how to use this import email, sys m = email.message_from_file(open(sys.argv)) print sanitise(m)
I've seen this come up a few times on comp.lang.python, so here's a cookbook entry for it. This recipe shows how to read in an email message, strip out any dangerous or suspicious attachments, and replace them with a harmless text message informing the user of this.
This is particularly important if the end-users are using something like Outlook, which is targetted by unpleasant virus and worm messages on a daily basis.
The email parser in Python 2.4 has been completely rewritten to be robust first, correct second - prior to this, the parser was written for correctness first. This was a problem, because many virus/worm messages would send email messages that were broken and non-conformant - this made the old email parser choke and die. The new parser is designed to never actually break when reading a message - instead it tries it's best to fix up whatever it can in the message. (If you have a message that causes the parser to crash, please let us know - that's a bug, and we'll fix it).
The code itself is heavily commented, and should be easy enough to follow. A mail message consists of one or more parts - these can each contain nested parts. We call the 'sanitise()' function on the top level Message object, and it calls itself recursively on the sub-objects. The sanitise() function checks the Content-Type of the part, and if there's a filename, also checks that, against a known-to-be-bad list.
If the message part is bad, we replace the message itself with a short text description describing the now-removed part, and clean out the headers that are relevant. We set this message part's Content-Type to 'text/plain', and remove other headers that related to the now-removed message.
Finally, we check if the message is a multipart message. This means it has sub-parts, so we recursively call the sanitise function on each of those. We then replace the payload with our list of sanitised sub-parts.
Extensions, further work, etc:
Instead of destroying the attachment, it would be a small amount of work to instead store the attachment away in a directory, and supply the user with a link to the file.
You could add other filters into the sanitise() code - for instance, checking other headers for known signs of worm or virus messages. Or removing all large powerpoint files sent to you by your marketing department, if that's what you want to do.