Welcome, guest | Sign In | My Account | Store | Cart

This recipe shows a simple approach to using the Python email package to strip out attachments and file types from an email message that might be considered dangerous. This is particularly relevant in Python 2.4, as the email Parser is now much more robust in handling mal-formed messages (which are typical for virus and worm emails)

Python, 64 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
ReplaceString = """

This message contained an attachment that was stripped out. 

The original type was: %(content_type)s
The filename was: %(filename)s, 
(and it had additional parameters of:
%(params)s)

"""

import re
BAD_CONTENT_RE = re.compile('application/(msword|msexcel)', re.I)
BAD_FILEEXT_RE = re.compile(r'(\.exe|\.zip|\.pif|\.scr|\.ps)$')

def sanitise(msg):
    # Strip out all payloads of a particular type
    ct = msg.get_content_type()
    # We also want to check for bad filename extensions
    fn = msg.get_filename()
    # get_filename() returns None if there's no filename
    if BAD_CONTENT_RE.search(ct) or (fn and BAD_FILEEXT_RE.search(fn)):
        # Ok. This part of the message is bad, and we're going to stomp
        # on it. First, though, we pull out the information we're about to
        # destroy so we can tell the user about it.

        # This returns the parameters to the content-type. The first entry
        # is the content-type itself, which we already have.
        params = msg.get_params()[1:] 
        # The parameters are a list of (key, value) pairs - join the
        # key-value with '=', and the parameter list with ', '
        params = ', '.join([ '='.join(p) for p in params ])
        # Format up the replacement text, telling the user we ate their
        # email attachment.
        replace = ReplaceString % dict(content_type=ct, 
                                       filename=fn, 
                                       params=params)
        # Install the text body as the new payload.
        msg.set_payload(replace)
        # Now we manually strip away any paramaters to the content-type 
        # header. Again, we skip the first parameter, as it's the 
        # content-type itself, and we'll stomp that next.
        for k, v in msg.get_params()[1:]:
            msg.del_param(k)
        # And set the content-type appropriately.
        msg.set_type('text/plain')
        # Since we've just stomped the content-type, we also kill these
        # headers - they make no sense otherwise.
        del msg['Content-Transfer-Encoding']
        del msg['Content-Disposition']
    else:
        # Now we check for any sub-parts to the message
        if msg.is_multipart():
            # Call the sanitise routine on any subparts
            payload = [ sanitise(x) for x in msg.get_payload() ]
            # We replace the payload with our list of sanitised parts
            msg.set_payload(payload)
    # Return the sanitised message
    return msg

# And a simple driver to show how to use this
import email, sys
m = email.message_from_file(open(sys.argv[1]))
print sanitise(m)

I've seen this come up a few times on comp.lang.python, so here's a cookbook entry for it. This recipe shows how to read in an email message, strip out any dangerous or suspicious attachments, and replace them with a harmless text message informing the user of this.

This is particularly important if the end-users are using something like Outlook, which is targetted by unpleasant virus and worm messages on a daily basis.

The email parser in Python 2.4 has been completely rewritten to be robust first, correct second - prior to this, the parser was written for correctness first. This was a problem, because many virus/worm messages would send email messages that were broken and non-conformant - this made the old email parser choke and die. The new parser is designed to never actually break when reading a message - instead it tries it's best to fix up whatever it can in the message. (If you have a message that causes the parser to crash, please let us know - that's a bug, and we'll fix it).

The code itself is heavily commented, and should be easy enough to follow. A mail message consists of one or more parts - these can each contain nested parts. We call the 'sanitise()' function on the top level Message object, and it calls itself recursively on the sub-objects. The sanitise() function checks the Content-Type of the part, and if there's a filename, also checks that, against a known-to-be-bad list.

If the message part is bad, we replace the message itself with a short text description describing the now-removed part, and clean out the headers that are relevant. We set this message part's Content-Type to 'text/plain', and remove other headers that related to the now-removed message.

Finally, we check if the message is a multipart message. This means it has sub-parts, so we recursively call the sanitise function on each of those. We then replace the payload with our list of sanitised sub-parts.

Extensions, further work, etc:

Instead of destroying the attachment, it would be a small amount of work to instead store the attachment away in a directory, and supply the user with a link to the file.

You could add other filters into the sanitise() code - for instance, checking other headers for known signs of worm or virus messages. Or removing all large powerpoint files sent to you by your marketing department, if that's what you want to do.

8 comments

Hans-Peter Jansen 19 years, 8 months ago  # | flag

some good and some bad attachments. Being in the python email filter business for quite some years now, I have to admit, that this is a very nice receipt!

While studying it, I asked myself, what will happen, if a multipart mail contain some good and some bad attachments (given, one will keep the good one's...).

As far as I understand the script, any good attachment will be lost in presence of a bad one, since then the type is forced to text/plain or am I plain wrong?

Dan Perl 19 years, 8 months ago  # | flag

Re: some good and some bad attachments. I'm not sure I understand your example and I am not familiar with processing email. But anyway, thanks for the 'nice receipt' comment, I understood that.

If I get it right, you are thinking of a handler stack filtering one email, with a different handler for each attachment type. This is as opposed to the handler stack filtering each attachment at a time.

With that assumption, a filter that detects a bad attachment doesn't have to return 'False' and thus stop all the other filters above it from processing the email. It can return 'True' and if it cannot just remove the bad attachment (here is where my lack of knowledge on processing email shows) it can at least pass some data up to the other handlers. You then need a handler at the top of the stack that uses that data and knows what to do with bad attachments, independently of their type.

Does this answer your comment?

Hamish Lawson 19 years, 8 months ago  # | flag

Mixed-up comments. "I'm not sure I understand your example and I am not familiar with processing email." I think Peter-Hans's comments actually belong to another recipe (#302086). There have a number of occasions recently where the Cookbook system has mixed up comments in this way.

Hamish Lawson 19 years, 8 months ago  # | flag

Sorry, Hans-Peter. ... for getting your name the wrong way round.

Dan Perl 19 years, 8 months ago  # | flag

Re: Mixed-up comments. You're right. Hans-Peter's comment showed up in my recipe (#302422). All these comments now show in both recipes.

And I thought I got a positive comment. I'm crushed. ;-)

Hans-Peter Jansen 19 years, 8 months ago  # | flag

#302422 is very nice, too. Obviously, all comments appear in both recipes: #302422 and #302086 (at least). Until ActiveState fix up this mess, I urge everybody to mention the commented recipe in the title.

@Hamish: Thanks for uncover this problem. I'm not sure, if I would have figured this out myself.

@Dan: This hazard brought your recipe to my attention, and will soon fit some of my problems to solve, I'm sure ;-)

I very much like the idea of concatenating handler objects via an overloaded __add__ method. Well done! I vote for including your recipe into the second edition and hope, this comment will rectify the confusion a bit.

Pete

anthony baxter (author) 19 years, 8 months ago  # | flag

multipart. I assume you mean what will happen if there's a multipart/alternative with one good and one bad subpart. In that case, the bad subpart will be replaced with a text/plain saying "Neener neener I ate your attachment" (or whatever value you choose to use for the ReplaceString).

You _could_ make it modify the message so that if there's only a single "good" alternative left, it gets rid of the multipart/alternative and moves the good subpart into the enclosing Message directly. I'm not sure that's a good idea, as it hides information (that the message was modified).

Barry Jover 18 years, 11 months ago  # | flag

what about not deleting the attachment?? How would one go about saving the attachment instead of replacing it with text?