Welcome, guest | Sign In | My Account | Store | Cart

This lists unique words and word frequencies occurring in a Python string. You can ignore or take account of letter case in distinguishing words, and you can pass it your own inclusion list of characters allowed in words (e.g. is "import123" the kind of word you want to list, or not? It might be if you're a programmer.) By default only alpha chars are allowed in words.

At first glance having the whole piece of text, and intermediate results, in memory at once is a problem for large files. But it's zippy: it found 1600 unique words in a 7M SQL script (472,000 words in original) in 20 seconds, and hardly notices a 4000-word document cut and pasted across from a word processor.

With a bit of extra work, the algorithm could be fed a very large file in chunks. Anyone?

Python, 81 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import re

def wordList(text, wordChars = None, ignoreCase = True):
    """converts string simple-mindedly to a list of words"""
    if wordChars is None:
        lcLetters = "abcdefghijklmnopqrstuvwxyz"
        ucLetters = lcLetters.upper()
        otherLegalChars = """-"""
        wordChars = set(lcLetters + ucLetters + otherLegalChars)
    # make sure spaces are always allowed-- they are word boundaries
    wordChars.add(" ")
    # while still a string, convert word boundaries to spaces
    whiteSpace = "\n\r\f\t\v"
    wordBoundaries = whiteSpace + ".,"
    if ignoreCase: text = text.lower()
    li = list(text)
    # drop unwanted chars
    li = [x for x in li if x in wordChars]
    # convert back to string
    text = "".join(li)
    for s in wordBoundaries: text = text.replace(s, " ")
    # need to retain at least one space for word boundaries
    # collapse groups of spaces to a single space
    # Method: replace 64 spaces with a space, 32 spaces with a space, ..., 2 spaces with a space
    for i in [x*x for x in [6,5,4,3,2,1]]: text = text.strip(" " * i)
    # split into list again at word boundaries
    text = text.split(" ")
    # if any "empty words" have been manufactured by previous processes, drop them
    text = [x for x in text if x != ""]
    return text

def reWordList(text, wordChars = None, ignoreCase = True):
    """converts string simple-mindedly to a list of words"""
    if wordChars is None:
        lcLetters = "abcdefghijklmnopqrstuvwxyz"
        ucLetters = lcLetters.upper()
        wordChars = lcLetters + ucLetters
    # make sure spaces are always allowed-- they are word boundaries
    extraWordBoundaries = """.","""
    # convert all word boundaries (white spaces and additional chars specified above) to "spaces"
    rx = re.compile('[\s' + extraWordBoundaries + ']')
    text = rx.sub(" ", text)
    # collapse groups of spaces to a single "space"
    # Method: replace 64 spaces with a space, 32 spaces with a space, ..., 2 spaces with a "space"
    for i in [x*x for x in [6,5,4,3,2,1]]: text = text.strip(" " * i)
    if ignoreCase: text = text.lower()
    li = list(text)
    # drop unwanted chars (which do not delimit words or form a part of them)
    # note "space" is retained
    rx = re.compile("[^" + wordChars + " ]")
    text = rx.sub("", text)
    # split into list at word boundaries
    text = text.split(" ")
    # if any "empty words" have been manufactured by previous processes, drop them
    text = [x for x in text if x != ""]
    return text

def wordFreqs(text, wordChars = None):
    """takes a list of words and returns list of unique words and their frequencies"""
    words = wordList(text, wordChars)
    uniqueWords = set(words)
    return sorted([(word, words.count(word)) for word in uniqueWords])

if __name__ == "__main__":
    # to load a string from a file here, loadFile = True
    loadFile = True
    if loadFile:
        a = list(file(r"d:\partdb.sql"))
        a = "\n".join(a)
    else:
        a = """ Nor again is there anyone who loves or pursues or desires to obtain pain of itself
                because it is pain, but because occasionally circumstances occur in which toil and pain can
                procure him some great pleasure. To take a trivial example, which of us ever undertakes
                laborious physical exercise, except to obtain some advantage from it?"""
    b = wordFreqs(a)
    # example: find integers only
    #b = wordFreqs(a, wordChars = "0123456789")
    print b
    print "\n" * 5
    print "number of words in original: ", len(wordList(a))
    print "Number of unique words: ", len(b)

This solution uses Python sets as an easy and efficient way to deduplicate a list (see def wordFreqs). Converting a string to a list of one-char elements is another powerful (and fast) feature of Python, and has been used in conjunction with list comprehensions to drop any characters not allowed in words from the original string.

There are two versions of the wordList function, one using regular expressions, the other string and list operations. Interestingly the regular expression version is much slower, although it's more memory-efficient.

It is assumed that "not allowed" characters are neither part of a word, nor represent a word boundary. A few chars, including white space, do represent word boundaries, and are converted to spaces. Blocks of word boundary characters are converted to a single space each, and then single spaces are used to split the text into a list of words.

An inefficient part of the algorithm is:

[(word, words.count(word)) for word in uniqueWords]

which serially searches the undeduplicated text for each unique word in order to count frequencies. If anyone can think of a way to speed this up without sorting 472,000 words into alphabetical order, I'd be interested to see it!

8 comments

Eric-Olivier LE BIGOT 15 years ago  # | flag

How about counting words with the following?

from collections import defaultdict wordFreqs = defaultdict(lambda: 0) for word in words: wordFreqs[word] += 1

And how about splitting the text into words with re.split(), which would automatically take care of repeated spaces?

I haven't tried any of these suggestions, but they might be interesting to explore!

Eric-Olivier LE BIGOT 15 years ago  # | flag

PS: the correctly formatted code is:

from collections import defaultdict
wordFreqs = defaultdict(lambda: 0)
for word in words:
    wordFreqs[word] += 1
Eric-Olivier LE BIGOT 15 years ago  # | flag

PPS: with the "dict approach" above, you don't have to go through the list for each word, as you do with words.count(word)... This might speed things up.

Matteo Dell'Amico 15 years ago  # | flag

I think this is the right application for a bag:

#!/usr/bin/env python3

import re
import sys

from bag import bag

wordre = re.compile(r'\w+')

def words(f):
    for line in f:
        for word in wordre.findall(line):
            yield word

with open(sys.argv[1]) as f:
    b = bag(words(f))

print(b.mostcommon(10))

This way you get this:

della@raker:~/src/py3$ ./wordcount.py /usr/share/common-licenses/GPL
[('the', 309), ('of', 210), ('to', 177), ('a', 171), ('or', 138), ('you', 106), ('work', 97), ('that', 91), ('and', 91), ('in', 76)]
david.gaarenstroom 15 years ago  # | flag

My 2 cents:

  • I would recommend using a generator (for example utilizing "yield") for each word, so you won't need to keep a list of all words in memory. You can then support much larger files.
  • I would recommend using a regular expression to either split the words or to select them, so "finditer" would be a nice generator.
  • You could try using a memory mapped file.
  • For keeping track of the occurrences you can use a dict this way:

    worddict[word] = 1 + worddict.setdefault(word, 0)
    

With all that, the code could be significantly simpler/smaller.

Raymond Hettinger 15 years ago  # | flag

The collections module in Py2.7 provides a Counter object: http://docs.python.org/dev/library/collections.html#counter-objects. The Py2.5 code for it is in recipe 576611.

David Lambert 15 years ago  # | flag

I'd write this in awk, which you could pipe to sort or sort -n +1.

$  gawk '{for (i=1; i<=NF; ++i) {a[$i]+=1}} END{for(word in a){print word,a[word]}}'

You could use translit or the gawk tolower function for the case stuff. Writing trite programs on the command line is powerful. Learn gawk.

Michael Thamm 12 years, 6 months ago  # | flag

Word Count

from operator import itemgetter

class words:
    def __init__(self,fh):
        self.fh = fh
    def read(self):
        for line in fh:
            yield line.split()

if __name__ == "__main__":
    wordCnt = {}
    fh = open("trans.txt", "r")
    fileHand = words(fh)
    for k in fileHand.read():
        for j in k:
            if j in wordCnt:
                wordCnt[j] += 1
            else:
                wordCnt[j] = 1

    print sorted(wordCnt.items(), key=itemgetter(1))       

    fh.close()