Welcome, guest | Sign In | My Account | Store | Cart

This is a program Jon Bentley asked Donald Knuth to write, and is one that’s become familiar to people who use languages with serious text-handling capabilities: Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies. I wrote 2 solutions for it earlier, in Python and in Unix shell. Also see the comment by a user on the post, giving another solution.

Python, 80 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# bentley_knuth.py
# Author: Vasudev Ram - http://www.dancingbison.com
# Version: 0.1

# The problem this program tries to solve is from the page:
# http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

# Description: The program Bentley asked Knuth to write:

# Read a file of text, determine the n most frequently 
# used words, and print out a sorted list of those words 
# along with their frequencies.

import sys
import os
import string

sys_argv = sys.argv

def usage():
 sys.stderr.write("Usage: %s n file\n" % sys_argv[0])
 sys.stderr.write("where n is the number of most frequently\n")
 sys.stderr.write("used words you want to find, and \n")
 sys.stderr.write("file is the name of the file in which to look.\n")

if len(sys_argv) < 3:
 usage()
 sys.exit(1)

try:
 n = int(sys_argv[1])
except ValueError:
 sys.stderr.write("%s: Error: %s is not a decimal numeric value" % (sys_argv[0], 
  sys_argv[1]))
 sys.exit(1)

print "n =", n
if n < 1:
 sys.stderr.write("%s: Error: %s is not a positive value" % 
  (sys_argv[0], sys_argv[1]))

in_filename = sys.argv[2]
print "%s: Finding %d most frequent words in file %s" % \
 (sys_argv[0], n, in_filename)

try:
 fil_in = open(in_filename)
except IOError:
 sys.stderr.write("%s: ERROR: Could not open in_filename %s\n" % \
  (sys_argv[0], in_filename))
 sys.exit(1)

word_freq_dict = {}

for lin in fil_in:
 words_in_line = lin.split()
 for word in words_in_line:
  if word_freq_dict.has_key(word):
   word_freq_dict[word] += 1
  else:
   word_freq_dict[word] = 1

word_freq_list = []
for item in word_freq_dict.items():
 word_freq_list.append(item)

wfl = sorted(word_freq_list, 
 key=lambda word_freq_list: word_freq_list[1], reverse=True)
#wfl.reverse()
print "The %d most frequent words sorted by decreasing frequency:" % n
len_wfl = len(wfl)
if n > len_wfl:
 print "n = %d, file has only %d unique words," % (n, len_wfl)
 print "so printing %d words" % len_wfl
print "Word: Frequency"
m = min(n, len_wfl)
for i in range(m):
 print wfl[i][0], ": ", wfl[i][1]

fil_in.close()

This problem was posed to Donald Knuth by Jon Bentley, apparently:

http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

4 comments

Mark Lawrence 7 years, 8 months ago  # | flag

I suggest looking at the collections Counter object and its most_common method, which have been available since Python 2.7. I'll also point out that the year is 2014 and that Python 3.4 is due for full release any day now, so why write Python 2 code?

Vasudev Ram (author) 7 years, 8 months ago  # | flag

Thanks for the points.

See the 1st comment on this post, meant to include it above earlier:

http://jugad2.blogspot.in/2012/07/the-bentley-knuth-problem-and-solutions.html

As for 2 vs. 3, opinions vary on that, with reasons ...

Vasudev Ram (author) 7 years, 8 months ago  # | flag
James Mills 7 years, 8 months ago  # | flag

I'm a little bit disappointed to see imperative solutions to problems in today's wonderful world of multi-paradigm languages such as Python :/