The Markov Chain Algorithm « Python recipes

A classic algorithm which can produce entertaining output, given a sufficiently large input

      import random;
import sys;

nonword = "\n" # Since we split on whitespace, this can never be a word
w1 = nonword
w2 = nonword

# GENERATE TABLE
table = {}

for line in sys.stdin:
    for word in line.split():
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word

table.setdefault( (w1, w2), [] ).append(nonword) # Mark the end of the file

# GENERATE OUTPUT
w1 = nonword
w2 = nonword

maxwords = 10000

for i in xrange(maxwords):
    newword = random.choice(table[(w1, w2)])
    if newword == nonword: sys.exit()
    print newword;
    w1, w2 = w2, newword

      

The Markov Chain algorithm is an entertaining way of taking existing texts, and sort of mixing them up. The basic premise is that for every pair of words in your text, there are some set of words that follow those words. What we effectively do is for every pair of words in the text, record the word that comes after it into a list in a dictionary. Then we "regenerate" the file by taking the two current words, and randomly selecting a word to follow from the appropriate list.

The algorithm as it stands works quite well for large texts, although you're more likely to hit up against the 10000 maximum word counter. I've tested this myself on the text of Les Miserables (from Project Gutenberg), and it took about 30 seconds to read and generate an output file.

One problem with this particular algorithm is it prints out each word on one line. It's simple enough to write a program to print out the text more prettily, but it's uninportant for the main algorithm.

The basic structure of the algorithm was taken from an example in the book "The Practice of Programming" by Brian W. Kernighan and Rob Pike, but was originally written in Perl. I wrote this implementation partially because I liked the algorithm, but also to prove to myself that the algorithm could be written easily in Python too. The ability to use tuples as keys into a dict really make the code look cleaner than the Perl equivalent.

Tags: algorithms

2 comments

Phil f. 21 years ago # | flag

an error in code? I believe there is an error in code, it should be::

        w1, w2 = w2, word

table.setdefault( (w1, w2), [] ).append(nonword) # Mark the end of the  file

# GENERATE OUTPUT

norm harman 21 years ago # | flag

Not so minor changes to have sentence structured output. Some quick hacks to output sentences. Could have used a regular expression split (from re) instead of the 'if' but this was easy, worked, and ran fast enough.

Problems and todo

things like "Dr." are not the end of sentence
quotes, brackets and other punctuation
paragraphs, chapters

Seems like using a parser would be the right way to proceed if trying to do all of the above.

Examples from Gutenbergs "Tao de Ching"

Project Gutenberg to get the kingdom by force of arms.

Only he who overcomes himself is intelligent.

He who knows these two things finds in them ensued (in the same way) to accomplish the greatest things.

Trees and plants, in their ordinary life; let them not thoughtlessly indulge themselves in their early growth, are soft and weak than water, and yet these are the designations which kings and princes use for themselves.


import random;
import sys;

stopword = "\n" # Since we split on whitespace, this can never be a word
stopsentence = (".", "!", "?",) # Cause a "new sentence" if found at the end of a word
sentencesep  = "\n" #String used to seperate sentences


# GENERATE TABLE
w1 = stopword
w2 = stopword
table = {}

for line in sys.stdin:
    for word in line.split():
        if word[-1] in stopsentence:
            table.setdefault( (w1, w2), [] ).append(word[0:-1])
            w1, w2 = w2, word[0:-1]
            word = word[-1]
        table.setdefault( (w1, w2), [] ).append(word)
        w1, w2 = w2, word
# Mark the end of the file
table.setdefault( (w1, w2), [] ).append(stopword)

# GENERATE SENTENCE OUTPUT
maxsentences  = 5

w1 = stopword
w2 = stopword
sentencecount = 0
sentence = []

#note replace lessthan with the symbol
# I was having trouble with aspn commets
while sentencecount lessthan maxsentences:
    newword = random.choice(table[(w1, w2)])
    if newword == stopword: sys.exit()
    if newword in stopsentence:
        print "%s%s%s" % (" ".join(sentence), newword, sentencesep)
        sentence = []
        sentencecount += 1
    else:
        sentence.append(newword)
    w1, w2 = w2, newword

◄	Python recipes (4591)	►
◄	Brian Chin's recipes (1)	►

The Markov Chain Algorithm (Python recipe) by Brian Chin
ActiveState Code (http://code.activestate.com/recipes/194364/)

2 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

The Markov Chain Algorithm (Python recipe) by Brian Chin ActiveState Code (http://code.activestate.com/recipes/194364/)

2 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

The Markov Chain Algorithm (Python recipe) by Brian Chin
ActiveState Code (http://code.activestate.com/recipes/194364/)