ActiveState Code

Recipe 576544: Generate random user names from local dictionary file


Sometimes for testing purposes you need to fill a database with randomly generated user names. Or maybe you're just offering distinguishable anonymity to users for whatever reason. Or maybe your product needs a codename! This describes a very simple way to get a bunch of "names".

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import random

names_file = file('/etc/dictionaries-common/words')
num_dict_lines = 9900            # A-Z, no apostrophes, approximate!
bytes = num_dict_lines * 10 * 8  # lines * avg word len * bytes/char
rand_words = [ln for ln in names_file.readlines(bytes) if "'" not in ln]
names_file.close()

def gen_name():
    idx = random.randint(2, num_dict_lines)
    username = rand_words[idx]
    #print 'last:', rand_words[num_dict_lines]
    return username.strip()

# Generate a few samples.
for i in range(3):
    print gen_name(),

# Printed: Sister Frankfort Babbitt

Discussion

The simple reason this works is that a dictionary file has capitalized proper names listed at the top of the file. This script simply grabs the top-most lines and assumes that they're names (which is generally true, but you'll see exceptions).

Assumptions and limitations:

  • you want it to be fast and local and don't care much about relevance/accuracy of names
  • hard-coded for Ubuntu word dictionary (adjust for location of yours)
  • your word dictionary may be shorter or longer (but it's of little consequence)
  • not all generated words are proper names
  • you could be more accurate by slurping the whole file and grabbing only capitalized words, but it would be slower

Comments

  1. 1. At 4:23 a.m. on 30 oct 2008, sebastien.renard said:

    Hello,

    Why do you bother about number of lines and words lenght ?

    names_file.readlines() is ok, it reads all file and return it as list of str.

  2. 2. At 7:39 p.m. on 3 nov 2008, Micah Elliott (the author) said:

    I don't want to read the whole file; just the first 20%, which contains the names. The other 80,000 lines are just words. It would probably be cleaner to just call readline 18,000 times.

Sign in to comment