Popular Python recipes tagged "unique_words"http://code.activestate.com/recipes/langs/python/tags/unique_words/2009-03-26T23:00:54-07:00ActiveState Code RecipesPython word frequency count using sets and lists (Python)
2009-03-26T23:00:54-07:00nickhttp://code.activestate.com/recipes/users/4169647/http://code.activestate.com/recipes/576699-python-word-frequency-count-using-sets-and-lists/
<p style="color: grey">
Python
recipe 576699
by <a href="/recipes/users/4169647/">nick</a>
(<a href="/recipes/tags/text/">text</a>, <a href="/recipes/tags/unique_words/">unique_words</a>, <a href="/recipes/tags/word_frequency/">word_frequency</a>).
</p>
<p>This lists unique words and word frequencies occurring in a Python string. You can ignore or take account of letter case in distinguishing words, and you can pass it your own inclusion list of characters allowed in words (e.g. is "import123" the kind of word you want to list, or not? It might be if you're a programmer.) By default only alpha chars are allowed in words.</p>
<p>At first glance having the whole piece of text, and intermediate results, in memory at once is a problem for large files. But it's zippy: it found 1600 unique words in a 7M SQL script (472,000 words in original) in 20 seconds, and hardly notices a 4000-word document cut and pasted across from a word processor.</p>
<p>With a bit of extra work, the algorithm could be fed a very large file in chunks. Anyone?</p>