Welcome, guest | Sign In | My Account | Store | Cart

Spits out sorted, deduplicated lines.

Python, 10 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Unique lines case insensitive
filename = r"h:\keywords.txt"
li = list(file(filename))
# Note: listifying file() leaves \n at end of each list element
st = "".join(li)
# comment out next line to get case-sensitive version
st = st.lower()
se = set(st.split("\n"))
result = "\n".join(sorted(se))
print result

Working with websites, I've often got a bunch of keywords from different sources that I want deduplicating. I drop words onto this Python script as a text file, one phrase to a line.

Deduplication is done using the excellent Python set() type. Well worth investigating.

4 comments

Daniel Lepage 12 years, 8 months ago  # | flag

Converting it to a list, joining that, and then removing the newlines seems wasteful - why not just read the whole file at once and call splitlines()?

txt = file(filename).read()
txt = txt.lower()
return '\n'.join(sorted(set(txt.splitlines())))
nick (author) 12 years, 8 months ago  # | flag

Thanks Daniel!

Matteo Dell'Amico 12 years, 8 months ago  # | flag

To iterate through a file line by line, you can simply use "for line in f":

with open(filename) as f:
    lines = sorted(set(line.strip('\n').lower() for line in f))
for line in lines:
    print line
Eric-Olivier LE BIGOT 12 years, 8 months ago  # | flag

Matteo's solution seems to be the way to go, for me: it's clean (the file is closed), it's memory efficient (not all lines are read at once), and it should be reasonably fast (the set object does all the work).

Now, I would even use rstrip() instead of strip(); this may slightly speed things up, and is more appropriate anyway (we only want to remove trailing newlines).

Created by nick on Sun, 5 Apr 2009 (MIT)
Python recipes (4591)
nick's recipes (2)

Required Modules

  • (none specified)

Other Information and Tasks