Welcome, guest | Sign In | My Account | Store | Cart

Unique lines from a text file, case sensitive or insensitive (Python recipe) by nick
ActiveState Code (http://code.activestate.com/recipes/576713/)

Spits out sorted, deduplicated lines.

      # Unique lines case insensitive
filename = r"h:\keywords.txt"
li = list(file(filename))
# Note: listifying file() leaves \n at end of each list element
st = "".join(li)
# comment out next line to get case-sensitive version
st = st.lower()
se = set(st.split("\n"))
result = "\n".join(sorted(se))
print result

      

Working with websites, I've often got a bunch of keywords from different sources that I want deduplicating. I drop words onto this Python script as a text file, one phrase to a line.

Deduplication is done using the excellent Python set() type. Well worth investigating.

Tags: uniq, unique_lines

4 comments

Daniel Lepage 15 years ago # | flag

Converting it to a list, joining that, and then removing the newlines seems wasteful - why not just read the whole file at once and call splitlines()?

txt = file(filename).read()
txt = txt.lower()
return '\n'.join(sorted(set(txt.splitlines())))

nick (author) 15 years ago # | flag

Thanks Daniel!

Matteo Dell'Amico 15 years ago # | flag

To iterate through a file line by line, you can simply use "for line in f":

with open(filename) as f:
    lines = sorted(set(line.strip('\n').lower() for line in f))
for line in lines:
    print line

Eric-Olivier LE BIGOT 15 years ago # | flag

Matteo's solution seems to be the way to go, for me: it's clean (the file is closed), it's memory efficient (not all lines are read at once), and it should be reasonably fast (the set object does all the work).

Now, I would even use rstrip() instead of strip(); this may slightly speed things up, and is more appropriate anyway (we only want to remove trailing newlines).

Created by nick on Sun, 5 Apr 2009 (MIT)

◄	Python recipes (4591)	►
◄	nick's recipes (2)	►

Required Modules

(none specified)

Other Information and Tasks

Licensed under the MIT License
Viewed 11613 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Unique lines from a text file, case sensitive or insensitive (Python recipe) by nick ActiveState Code (http://code.activestate.com/recipes/576713/)