Welcome, guest | Sign In | My Account | Store | Cart

I've found a bug in import EML file into Thunderbird using ImportExportTools addon: when I import eml file into TB there are a 'From' line added to mbox followed with EML file contents. TB maintains right 'From' line for messages fetched from mailservers:

From - Tue Apr 27 19:42:22 2010

ImportExportTools formats this line wrong I suppose that used some system function with default specifier so I saw in mbox file:

From - Sat May 01 2010 15:07:31 GMT+0400 (Russian Daylight Time)

So there are two errors: 1) sequence 'time year' broken into 'year time' 2) extra trash with GMT info along with time zone name

This prevents the mbox file parsing using Python standard library (for sample) because there are a hardcoded regexp for matching From line (file lib/mailbox.py, class UnixMailbox):

_fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+" \
                   r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"

Attached script fixes incorrect From lines so parsing those mboxes using Python standard library will become ok.

Python, 64 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# -*- coding: Windows-1251 -*-
'''
fix_mbox_from.py

  Utility for fixing incorrect 'From' line after batch .EML files import 
  via Thunderbird's ImportExportTools version LE 2.3.2.1.

  2010-05-01 bug report sent to addon author.

  
mailbox.py (Python 2.4.5) pattern for matching 'From' line:

    _fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+" \
                       r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"

Correct 'From':
From - Tue Apr 27 19:42:22 2010

Broken 'From':
From - Sat May 01 2010 15:07:31 GMT+0400 (Russian Daylight Time)
'''
import sys
import re
import os

__author__ = 'Denis Barmenkov <denis.barmenkov@gmail.com>'
__source__ = 'http://code.activestate.com/recipes/577214-fix-mbox-files-after-importing-eml-into-tb-using-i/'

bad_pattern_text = r"^(From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d)\s+" \
                   r"(\d\d\d\d)\s+(\d?\d:\d\d(:\d\d)?)\s+" \
                   r"GMT\+\d\d\d\d\s+\([^\)]+\)\s*$"

bad_pattern = re.compile(bad_pattern_text)

mbox_fn = sys.argv[1]
print 'File: %s' % mbox_fn
temp_fn = mbox_fn + '.temp'
orig_fn = mbox_fn + '.source'
assert not os.path.exists(orig_fn)

#src_size = os.path.getsize(mbox_fn)

fsrc = open(mbox_fn, 'r')
fdest = open(temp_fn, 'w')

fix_count = 0
for line_index, rawline in enumerate(fsrc):
    #if line_index % 100 == 0:
    #    pos = fsrc.tell()
    #    print '%d%%,' % (100 * pos // src_size),
    line = rawline.splitlines()[0]
    m = bad_pattern.match(line)
    if m:
        line = '%s %s %s' % m.group(1, 3, 2)
        fix_count += 1
    fdest.write(line + '\n')
print 
print 'Fixed %s "From" lines' % fix_count

fdest.close()
fsrc.close()

os.rename(mbox_fn, orig_fn)
os.rename(temp_fn, mbox_fn)

1 comment

Denis Barmenkov (author) 11 years, 7 months ago  # | flag

Simple line-by-line filter :).

Created by Denis Barmenkov on Sun, 2 May 2010 (GPL3)
Python recipes (4591)
Denis Barmenkov's recipes (20)

Required Modules

  • (none specified)

Other Information and Tasks

  • Licensed under the GPL 3
  • Viewed 4828 times
  • Revision 2 (updated 11 years ago)