Welcome, guest | Sign In | My Account | Store | Cart

extract email addresses from files (Python recipe) by carl scharenberg
ActiveState Code (http://code.activestate.com/recipes/138889/)

A quick way to find valid email addresses in text files using a regular expression search. It then removes duplicate entries and returns the results in a list.

      def grab_email(files = []):
    # if passed a list of text files, will return a list of
    # email addresses found in the files, matched according to
    # basic address conventions. Note: supports most possible
    # names, but not all valid ones.
    
    found = []
    if files != None:
        mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
        
        for file in files:            
            for line in open(file,'r'):                
                found.extend(mailsrch.findall(line))    

    # remove duplicate elements
    # borrowed from Tim Peters' algorithm on ASPN Cookbook
    u = {}
    for item in found:
        u[item] = 1

    # return list of unique email addresses
    return u.keys()

      

The regular expression is by no means complete and perfect. The email address naming conventions allow other valid characters and forms, but this expression should cover most areas. It's my first try at RE's - doubtless I will improve it in the future.

As mentioned in the code, the duplicate-element code is Tim Peters'. All email addresses should be hashable, so dictionary conversion is the best way to go.

I've only tested the code on a few text and html files on a Windows machine. If the code breaks in unix let me know.

Tags: search

3 comments

Peter Bengtsson 21 years, 1 month ago # | flag

Not quite right.

>>> mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
>>> mailsrch.findall("peter@grenna. net")
['peter@grenna']

And do you know an equivalent regular expression for words that start with www., http:// etc?

Dan O'Huiginn 16 years, 11 months ago # | flag

Couple of points:

1) since python2.4, you can use a set to keep a list without duplicates. Something like

found=set()
...
for file in files:
  for line in open(file,'r'):
    found.update(mailsrch.findall(line))

2) The regex doesn't match email addresses like 'a@somewhere.com', where the part before the @ is just one letter. You can fix it by changing a '+' to a '': r'[\w-][\w-.]@[\w-][\w-.]+[a-zA-Z]{1,4}'

Alan Miller 16 years, 3 months ago # | flag

Should verify that the file is a file. Adding this in before it tries to open the file helps:

if os.path.isfile(file):

Created by carl scharenberg on Wed, 10 Jul 2002 (PSF)

◄	Python recipes (4591)	►
◄	carl scharenberg's recipes (1)	►

Required Modules

(none specified)

Other Information and Tasks

Licensed under the PSF License
Viewed 38518 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

extract email addresses from files (Python recipe) by carl scharenberg ActiveState Code (http://code.activestate.com/recipes/138889/)