Welcome, guest | Sign In | My Account | Store | Cart

The Python mailbox.mbox class require a real file to initialize, which was an issue in my case. These simple functions let you iter through a mailbox read from a read-only file descriptor (like sys.stdin).

This script use the generators which were introduced in Python-2.2. Let me know if you are interested a similar functionnality on older Python versions.

Python, 67 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
'''
A simple mbox read-only mailbox generator.

Usage:
import sys
for msg in mboxo_generator(sys.stdin):
        print msg['Subject']
'''
# uncomment the following line on Python 2.2
#from __future__ import generators

import email.parser


def mboxo_generator(input, parser=email.parser.Parser()):
        '''Yield each message found in a ``input`` in ``mboxo`` / ``mboxrd`` format
        '''
        assert type(input) is file
        data = []
        for line in input:
                if line[:5] == 'From ' or line == '':
                        if data:
                                yield parser.parsestr(''.join(data))
                                data = []
                        elif line == '':
                                raise StopIteration
                data.append(line)


def mboxcl_generator(input, parser=email.parser.Parser()):
        '''Yield each message found in a ``input`` in ``mboxcl`` / ``mboxcl2`` format

        Do *not* use the "From " delimiter but *only* the ``Content-Lenght``
        header; in the case this field appear many times in the headers, the
        last one will prevail and if the field is missing an assertion might be raised.
        '''
        assert type(input) is file
        content_length = None
        length = 0
        in_header = None
        data = []
        for line in input:
                if in_header is None:
                        if line == '\n':
                                # eat empty lines before headers
                                # (usually between messages)
                                continue
                        in_header = True

                data.append(line)

                if in_header:
                        if line == '\n':
                                assert content_length is not None, 'header Content-Lenght not found (not an mboxcl file?)'
                                in_header = False
                        elif line[:16] == 'Content-Length: ':
                                content_length = int(line[16:].rstrip())
                else:
                        length+= len(line)
                        assert not length > content_length
                        if length == content_length:
                                yield parser.parsestr(''.join(data))
                                data = []
                                in_header = None
                                content_length = None
                                length = 0
        assert not length

After a few years i had to parse some mbox again, and this time it was not the plain old mboxo/mboxrd format (the only one supported by mailbox.mbox in Python stdlib); so here it is, a mboxcl/mboxcl2 generator!

Please note that if these functions emit messages from a mbox, they do not transform the message body (like "From quoting"); this exercice is left to the parser you can subclass.

For more informations on thoses different formats, and the body transformation which may apply, you can refer to: http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/mail-mbox-formats.html

2 comments

Matthias Kluwe 15 years, 4 months ago  # | flag

The separation of messages by the string "From " at the beginning of the line may not be reliable: A message body can contain this string as well.

Romain Dartigues (author) 15 years, 4 months ago  # | flag

Hi Matthias.

The separation of messages by the string "From " at the beginning of the line may not be reliable: A message body can contain this string as well.

As far as i know, no. My understanding of the mbox(5) format is that any line which begin with the four characters "From" followed by a space must be "quoted" (or escaped if you prefer) by adding a leading ">" character.

Example:

>>> import email.message
>>> msg = email.message.Message()
>>> msg['Subject'] = 'test'
>>> msg.set_payload('Hello,\n\nFrom a friend for a test\n\nCheers')
>>> print msg
From nobody Fri Nov 28 14:54:33 2008
Subject: test

Hello,

>From a friend for a test

Cheers

Well, to make it short, i copied the mailbox.mbox._generate_toc() function :)