A classic algorithm which can produce entertaining output, given a sufficiently large input
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | import random;
import sys;
nonword = "\n" # Since we split on whitespace, this can never be a word
w1 = nonword
w2 = nonword
# GENERATE TABLE
table = {}
for line in sys.stdin:
for word in line.split():
table.setdefault( (w1, w2), [] ).append(word)
w1, w2 = w2, word
table.setdefault( (w1, w2), [] ).append(nonword) # Mark the end of the file
# GENERATE OUTPUT
w1 = nonword
w2 = nonword
maxwords = 10000
for i in xrange(maxwords):
newword = random.choice(table[(w1, w2)])
if newword == nonword: sys.exit()
print newword;
w1, w2 = w2, newword
|
The Markov Chain algorithm is an entertaining way of taking existing texts, and sort of mixing them up. The basic premise is that for every pair of words in your text, there are some set of words that follow those words. What we effectively do is for every pair of words in the text, record the word that comes after it into a list in a dictionary. Then we "regenerate" the file by taking the two current words, and randomly selecting a word to follow from the appropriate list.
The algorithm as it stands works quite well for large texts, although you're more likely to hit up against the 10000 maximum word counter. I've tested this myself on the text of Les Miserables (from Project Gutenberg), and it took about 30 seconds to read and generate an output file.
One problem with this particular algorithm is it prints out each word on one line. It's simple enough to write a program to print out the text more prettily, but it's uninportant for the main algorithm.
The basic structure of the algorithm was taken from an example in the book "The Practice of Programming" by Brian W. Kernighan and Rob Pike, but was originally written in Perl. I wrote this implementation partially because I liked the algorithm, but also to prove to myself that the algorithm could be written easily in Python too. The ability to use tuples as keys into a dict really make the code look cleaner than the Perl equivalent.
an error in code? I believe there is an error in code, it should be::
Not so minor changes to have sentence structured output. Some quick hacks to output sentences. Could have used a regular expression split (from re) instead of the 'if' but this was easy, worked, and ran fast enough.
Problems and todo
things like "Dr." are not the end of sentence
quotes, brackets and other punctuation
paragraphs, chapters
Seems like using a parser would be the right way to proceed if trying to do all of the above.
Examples from Gutenbergs "Tao de Ching"