A quick way to find valid email addresses in text files using a regular expression search. It then removes duplicate entries and returns the results in a list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def grab_email(files = []):
# if passed a list of text files, will return a list of
# email addresses found in the files, matched according to
# basic address conventions. Note: supports most possible
# names, but not all valid ones.
found = []
if files != None:
mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
for file in files:
for line in open(file,'r'):
found.extend(mailsrch.findall(line))
# remove duplicate elements
# borrowed from Tim Peters' algorithm on ASPN Cookbook
u = {}
for item in found:
u[item] = 1
# return list of unique email addresses
return u.keys()
|
The regular expression is by no means complete and perfect. The email address naming conventions allow other valid characters and forms, but this expression should cover most areas. It's my first try at RE's - doubtless I will improve it in the future.
As mentioned in the code, the duplicate-element code is Tim Peters'. All email addresses should be hashable, so dictionary conversion is the best way to go.
I've only tested the code on a few text and html files on a Windows machine. If the code breaks in unix let me know.
Not quite right.
And do you know an equivalent regular expression for words that start with www., http:// etc?
Couple of points:
1) since python2.4, you can use a set to keep a list without duplicates. Something like
2) The regex doesn't match email addresses like 'a@somewhere.com', where the part before the @ is just one letter. You can fix it by changing a '+' to a '': r'[\w-][\w-.]@[\w-][\w-.]+[a-zA-Z]{1,4}'
Should verify that the file is a file. Adding this in before it tries to open the file helps: