This lists unique words and word frequencies occurring in a Python string. You can ignore or take account of letter case in distinguishing words, and you can pass it your own inclusion list of characters allowed in words (e.g. is "import123" the kind of word you want to list, or not? It might be if you're a programmer.) By default only alpha chars are allowed in words.
At first glance having the whole piece of text, and intermediate results, in memory at once is a problem for large files. But it's zippy: it found 1600 unique words in a 7M SQL script (472,000 words in original) in 20 seconds, and hardly notices a 4000-word document cut and pasted across from a word processor.
With a bit of extra work, the algorithm could be fed a very large file in chunks. Anyone?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
import re def wordList(text, wordChars = None, ignoreCase = True): """converts string simple-mindedly to a list of words""" if wordChars is None: lcLetters = "abcdefghijklmnopqrstuvwxyz" ucLetters = lcLetters.upper() otherLegalChars = """-""" wordChars = set(lcLetters + ucLetters + otherLegalChars) # make sure spaces are always allowed-- they are word boundaries wordChars.add(" ") # while still a string, convert word boundaries to spaces whiteSpace = "\n\r\f\t\v" wordBoundaries = whiteSpace + ".," if ignoreCase: text = text.lower() li = list(text) # drop unwanted chars li = [x for x in li if x in wordChars] # convert back to string text = "".join(li) for s in wordBoundaries: text = text.replace(s, " ") # need to retain at least one space for word boundaries # collapse groups of spaces to a single space # Method: replace 64 spaces with a space, 32 spaces with a space, ..., 2 spaces with a space for i in [x*x for x in [6,5,4,3,2,1]]: text = text.strip(" " * i) # split into list again at word boundaries text = text.split(" ") # if any "empty words" have been manufactured by previous processes, drop them text = [x for x in text if x != ""] return text def reWordList(text, wordChars = None, ignoreCase = True): """converts string simple-mindedly to a list of words""" if wordChars is None: lcLetters = "abcdefghijklmnopqrstuvwxyz" ucLetters = lcLetters.upper() wordChars = lcLetters + ucLetters # make sure spaces are always allowed-- they are word boundaries extraWordBoundaries = """.",""" # convert all word boundaries (white spaces and additional chars specified above) to "spaces" rx = re.compile('[\s' + extraWordBoundaries + ']') text = rx.sub(" ", text) # collapse groups of spaces to a single "space" # Method: replace 64 spaces with a space, 32 spaces with a space, ..., 2 spaces with a "space" for i in [x*x for x in [6,5,4,3,2,1]]: text = text.strip(" " * i) if ignoreCase: text = text.lower() li = list(text) # drop unwanted chars (which do not delimit words or form a part of them) # note "space" is retained rx = re.compile("[^" + wordChars + " ]") text = rx.sub("", text) # split into list at word boundaries text = text.split(" ") # if any "empty words" have been manufactured by previous processes, drop them text = [x for x in text if x != ""] return text def wordFreqs(text, wordChars = None): """takes a list of words and returns list of unique words and their frequencies""" words = wordList(text, wordChars) uniqueWords = set(words) return sorted([(word, words.count(word)) for word in uniqueWords]) if __name__ == "__main__": # to load a string from a file here, loadFile = True loadFile = True if loadFile: a = list(file(r"d:\partdb.sql")) a = "\n".join(a) else: a = """ Nor again is there anyone who loves or pursues or desires to obtain pain of itself because it is pain, but because occasionally circumstances occur in which toil and pain can procure him some great pleasure. To take a trivial example, which of us ever undertakes laborious physical exercise, except to obtain some advantage from it?""" b = wordFreqs(a) # example: find integers only #b = wordFreqs(a, wordChars = "0123456789") print b print "\n" * 5 print "number of words in original: ", len(wordList(a)) print "Number of unique words: ", len(b)
This solution uses Python sets as an easy and efficient way to deduplicate a list (see def wordFreqs). Converting a string to a list of one-char elements is another powerful (and fast) feature of Python, and has been used in conjunction with list comprehensions to drop any characters not allowed in words from the original string.
There are two versions of the wordList function, one using regular expressions, the other string and list operations. Interestingly the regular expression version is much slower, although it's more memory-efficient.
It is assumed that "not allowed" characters are neither part of a word, nor represent a word boundary. A few chars, including white space, do represent word boundaries, and are converted to spaces. Blocks of word boundary characters are converted to a single space each, and then single spaces are used to split the text into a list of words.
An inefficient part of the algorithm is:
[(word, words.count(word)) for word in uniqueWords]
which serially searches the undeduplicated text for each unique word in order to count frequencies. If anyone can think of a way to speed this up without sorting 472,000 words into alphabetical order, I'd be interested to see it!