Fix mbox files after importing EML into TB using ImportExportTools « Python recipes

I've found a bug in import EML file into Thunderbird using ImportExportTools addon: when I import eml file into TB there are a 'From' line added to mbox followed with EML file contents. TB maintains right 'From' line for messages fetched from mailservers:

From - Tue Apr 27 19:42:22 2010

ImportExportTools formats this line wrong I suppose that used some system function with default specifier so I saw in mbox file:

From - Sat May 01 2010 15:07:31 GMT+0400 (Russian Daylight Time)

So there are two errors: 1) sequence 'time year' broken into 'year time' 2) extra trash with GMT info along with time zone name

This prevents the mbox file parsing using Python standard library (for sample) because there are a hardcoded regexp for matching From line (file lib/mailbox.py, class UnixMailbox):

_fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+" \
                   r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"

Attached script fixes incorrect From lines so parsing those mboxes using Python standard library will become ok.

      # -*- coding: Windows-1251 -*-
'''
fix_mbox_from.py

  Utility for fixing incorrect 'From' line after batch .EML files import 
  via Thunderbird's ImportExportTools version LE 2.3.2.1.

  2010-05-01 bug report sent to addon author.

  
mailbox.py (Python 2.4.5) pattern for matching 'From' line:

    _fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+" \
                       r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"

Correct 'From':
From - Tue Apr 27 19:42:22 2010

Broken 'From':
From - Sat May 01 2010 15:07:31 GMT+0400 (Russian Daylight Time)
'''
import sys
import re
import os

__author__ = 'Denis Barmenkov <denis.barmenkov@gmail.com>'
__source__ = 'http://code.activestate.com/recipes/577214-fix-mbox-files-after-importing-eml-into-tb-using-i/'

bad_pattern_text = r"^(From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d)\s+" \
                   r"(\d\d\d\d)\s+(\d?\d:\d\d(:\d\d)?)\s+" \
                   r"GMT\+\d\d\d\d\s+\([^\)]+\)\s*$"

bad_pattern = re.compile(bad_pattern_text)

mbox_fn = sys.argv[1]
print 'File: %s' % mbox_fn
temp_fn = mbox_fn + '.temp'
orig_fn = mbox_fn + '.source'
assert not os.path.exists(orig_fn)

#src_size = os.path.getsize(mbox_fn)

fsrc = open(mbox_fn, 'r')
fdest = open(temp_fn, 'w')

fix_count = 0
for line_index, rawline in enumerate(fsrc):
    #if line_index % 100 == 0:
    #    pos = fsrc.tell()
    #    print '%d%%,' % (100 * pos // src_size),
    line = rawline.splitlines()[0]
    m = bad_pattern.match(line)
    if m:
        line = '%s %s %s' % m.group(1, 3, 2)
        fix_count += 1
    fdest.write(line + '\n')
print 
print 'Fixed %s "From" lines' % fix_count

fdest.close()
fsrc.close()

os.rename(mbox_fn, orig_fn)
os.rename(temp_fn, mbox_fn)

      

Tags: eml, from, import, importexporttools, mbox, tb, thunderbird

1 comment

Denis Barmenkov (author) 13 years, 11 months ago # | flag

Simple line-by-line filter :).

◄	Python recipes (4591)	►
◄	Denis Barmenkov's recipes (20)	►

Fix mbox files after importing EML into TB using ImportExportTools (Python recipe) by Denis Barmenkov
ActiveState Code (http://code.activestate.com/recipes/577214/)

1 comment

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Fix mbox files after importing EML into TB using ImportExportTools (Python recipe) by Denis Barmenkov ActiveState Code (http://code.activestate.com/recipes/577214/)

1 comment

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Fix mbox files after importing EML into TB using ImportExportTools (Python recipe) by Denis Barmenkov
ActiveState Code (http://code.activestate.com/recipes/577214/)