Spits out sorted, deduplicated lines.
1 2 3 4 5 6 7 8 9 10 | # Unique lines case insensitive
filename = r"h:\keywords.txt"
li = list(file(filename))
# Note: listifying file() leaves \n at end of each list element
st = "".join(li)
# comment out next line to get case-sensitive version
st = st.lower()
se = set(st.split("\n"))
result = "\n".join(sorted(se))
print result
|
Working with websites, I've often got a bunch of keywords from different sources that I want deduplicating. I drop words onto this Python script as a text file, one phrase to a line.
Deduplication is done using the excellent Python set() type. Well worth investigating.
Tags: uniq, unique_lines
Converting it to a list, joining that, and then removing the newlines seems wasteful - why not just read the whole file at once and call splitlines()?
Thanks Daniel!
To iterate through a file line by line, you can simply use "for line in f":
Matteo's solution seems to be the way to go, for me: it's clean (the file is closed), it's memory efficient (not all lines are read at once), and it should be reasonably fast (the set object does all the work).
Now, I would even use rstrip() instead of strip(); this may slightly speed things up, and is more appropriate anyway (we only want to remove trailing newlines).