Welcome, guest | Sign In | My Account | Store | Cart

You've found a bug in an application that reads XML files. You'd prefer not to share your data, though, but you'd really like to help the vendor fix the bug. This program maintains the document's structure, but it randonly converts each word into random gibberish, replacing vowels with vowels, consonants with consonants, and digits with digits. It also interns each element and attribute name, so all instances of "foo" as an element or attribute name would show up as, say, "xeu", but instances of "foo" in character data would be mapped to a different string each time.

Python, 64 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import sys, string
from xml.sax import saxutils, handler, make_parser
import random

class ContentGenerator(handler.ContentHandler):
    vowels = "aeiouAEIOU"
    len_vowels = len(vowels)
    consonants = "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"
    len_consonants = len(consonants)
    digits = "1234567890"
    len_digits = len(digits)
        
    def __init__(self, out = sys.stdout):
        handler.ContentHandler.__init__(self)
        self._out = out
        self.convertedEltNames = {}
    
    def _pick(self, source, len_source):
        return source[random.randrange(len_source)]
        
    def _convert(self, c):
        if c in self.vowels:
            return self._pick(self.vowels, self.len_vowels)
        elif c in self.consonants:
            return self._pick(self.consonants, self.len_consonants)
        elif c in self.digits:
            return self._pick(self.digits, self.len_digits)
        else:
            return c
        
    def _replace(self, text):
        return "".join([self._convert(c) for c in text])
        
    def _internName(self, name):
        if name not in self.convertedEltNames:
            self.convertedEltNames[name] = self._replace(name)
        return self.convertedEltNames[name]

    # ContentHandler methods
        
    def startDocument(self):
        self._out.write('<?xml version="1.0" encoding="iso-8859-1"?>\n')

    def startElement(self, name, attrs):
        self._out.write('<' + self._internName(name))
        for (name, value) in attrs.items():
            self._out.write(' %s="%s"' % (self._internName(name), saxutils.escape(self._replace(value))))
        self._out.write('>')

    def endElement(self, name):
        self._out.write('</%s>' % self._internName(name))

    def characters(self, content):
        self._out.write(saxutils.escape(self._replace(content)))

    def ignorableWhitespace(self, content):
        self._out.write(content)
        
    def processingInstruction(self, target, data):
        self._out.write('<?%s %s?>' % (self._internName(target), self._replace(data)))

parser = make_parser()
parser.setContentHandler(ContentGenerator())
parser.parse(sys.argv[1])

The sax code is based on a venerable example given on the Python mailing list http://mail.python.org/pipermail/python-dev/2000-October/009946.html

2 comments

Éric Araujo 12 years, 12 months ago  # | flag

“Vowel” and “consonants” are not as distinct as we usually think, but I guess the simple distinction works well enough for the purposes of this script :)

Eric Promislow (author) 12 years, 12 months ago  # | flag

If I had more time, I would've used a markov process to generate more plausible-looking text. But the main purpose of this recipe is to report a bug in an XML processor without revealing your data, only the structure.