I've found a bug in import EML file into Thunderbird using ImportExportTools addon: when I import eml file into TB there are a 'From' line added to mbox followed with EML file contents. TB maintains right 'From' line for messages fetched from mailservers:
From - Tue Apr 27 19:42:22 2010
ImportExportTools formats this line wrong I suppose that used some system function with default specifier so I saw in mbox file:
From - Sat May 01 2010 15:07:31 GMT+0400 (Russian Daylight Time)
So there are two errors: 1) sequence 'time year' broken into 'year time' 2) extra trash with GMT info along with time zone name
This prevents the mbox file parsing using Python standard library (for sample) because there are a hardcoded regexp for matching From line (file lib/mailbox.py, class UnixMailbox):
_fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+" \
r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"
Attached script fixes incorrect From lines so parsing those mboxes using Python standard library will become ok.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | # -*- coding: Windows-1251 -*-
'''
fix_mbox_from.py
Utility for fixing incorrect 'From' line after batch .EML files import
via Thunderbird's ImportExportTools version LE 2.3.2.1.
2010-05-01 bug report sent to addon author.
mailbox.py (Python 2.4.5) pattern for matching 'From' line:
_fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+" \
r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"
Correct 'From':
From - Tue Apr 27 19:42:22 2010
Broken 'From':
From - Sat May 01 2010 15:07:31 GMT+0400 (Russian Daylight Time)
'''
import sys
import re
import os
__author__ = 'Denis Barmenkov <denis.barmenkov@gmail.com>'
__source__ = 'http://code.activestate.com/recipes/577214-fix-mbox-files-after-importing-eml-into-tb-using-i/'
bad_pattern_text = r"^(From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d)\s+" \
r"(\d\d\d\d)\s+(\d?\d:\d\d(:\d\d)?)\s+" \
r"GMT\+\d\d\d\d\s+\([^\)]+\)\s*$"
bad_pattern = re.compile(bad_pattern_text)
mbox_fn = sys.argv[1]
print 'File: %s' % mbox_fn
temp_fn = mbox_fn + '.temp'
orig_fn = mbox_fn + '.source'
assert not os.path.exists(orig_fn)
#src_size = os.path.getsize(mbox_fn)
fsrc = open(mbox_fn, 'r')
fdest = open(temp_fn, 'w')
fix_count = 0
for line_index, rawline in enumerate(fsrc):
#if line_index % 100 == 0:
# pos = fsrc.tell()
# print '%d%%,' % (100 * pos // src_size),
line = rawline.splitlines()[0]
m = bad_pattern.match(line)
if m:
line = '%s %s %s' % m.group(1, 3, 2)
fix_count += 1
fdest.write(line + '\n')
print
print 'Fixed %s "From" lines' % fix_count
fdest.close()
fsrc.close()
os.rename(mbox_fn, orig_fn)
os.rename(temp_fn, mbox_fn)
|
Simple line-by-line filter :).