Welcome, guest | Sign In | My Account | Store | Cart

For good reasons, the email module's new feed parser can return a message that's internally inconsistent. This recipe fixes up one sort of inconsistency that I've seen in the wild.

Python, 89 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import email
import email.FeedParser
import re
import sys
import sgmllib

# How much of the text must be outside the ASCII range
# before we guess that it's a binary part. Threshold
# picked almost at random.
kGuessBinaryThreshold=0.2
kGuessBinaryRE=re.compile("[\\0000-\\0025\\0200-\\0377]") # Non ASCII characters

# How much of the text must be HTML tags before we guess
# that it's HTML. Threshold picked almost at random.
kGuessHTMLThreshold=0.05


# For stripping HTML tags. Very slightly modified from
# Alex Martelli's news post <9cpm4202cv1@news1.newsguy.com>
# of May 2, 2001, Subject: Stripping HTML tags from a string

class Cleaner(sgmllib.SGMLParser):
  entitydefs={"nbsp": " "} # I'll break if I want to

  def __init__(self):
    sgmllib.SGMLParser.__init__(self)
    self.result = []
  def do_p(self, *junk):
    self.result.append('\n')
  def do_br(self, *junk):
    self.result.append('\n')
  def handle_data(self, data):
    self.result.append(data)
  def cleaned_text(self):
    return ''.join(self.result)

def stripHTML(text):
  c=Cleaner()
  try:
    c.feed(text)
  except sgmllib.SGMLParseError:
    return text
  else:
    t=c.cleaned_text()
    return t


def guessIsBinary(text):
  lt=len(text)
  if lt==0:
    return False
  nMatches=float(len(kGuessBinaryRE.findall(text)))
  return nMatches/lt>=kGuessBinaryThreshold

# This does some relatively expensive parsing to
# try to figure out if the text is HTML. In cases
# in which it's used often, a simple regular
# expression would be faster and might be
# sufficiently accurate.
def guessIsHTML(text):
  lt=len(text)
  if lt==0:
    return False
  textWithoutTags=stripHTML(text)
  tagsChars=float(lt-len(textWithoutTags))
  if tagsChars==0:
    return False
  return lt/tagsChars>=kGuessHTMLThreshold

def getMungedMessage(openFile):
  openFile.seek(0)
  p=email.FeedParser.FeedParser()
  p.feed(openFile.read())
  m=p.close()

  # Fix up multipart content-type when message isn't multi-part
  if m.get_content_maintype()=="multipart" and not m.is_multipart():
    
    t=m.get_payload(decode=1)

    if guessIsBinary(t):
      # Use generic "opaque" type
      m.set_type("application/octet-stream")
    elif guessIsHTML(t):
      m.set_type("text/html")
    else:
      m.set_type("text/plain")

  return m

The feed parser is new in Python 2.4's email module. Its name comes from the fact that it maintains a buffer so that you don't have to give it all the text at once. Possibly more interesting is that it doesn't raise an error when it's called on malformed messages and instead tries to make some sense of them and return a useful email.Message object. That's useful because so much of mail is spam and so much of spam is malformed.

The other side of the fact that the feed parser works on incorrect messages is that you can get back an email.Message object that's internally inconsistent. This recipe tries to make sense of one kind of inconsistency I've seen in the wild. That's a message with a content-type header that says that the message is multipart but a body that isn't multipart.

The heuristics that the recipe uses to guess at the correct content-type are inevitably messy. Better that they're in examples rather than in Python's library.

Edited to use now-preferred get_content_maintype().

1 comment

Matthew Cowles (author) 17 years, 2 months ago  # | flag

Not actually necessary to use FeedParser explicitly. Python 2.4's email module uses the FeedParser by default so if you use the Parser class's parse() method or the from_file() or from_string() functions, you'll get the FeedParser's functionality for free.