Welcome, guest | Sign In | My Account | Store | Cart

Just a quick recipe I developed a few years ago that I thought might be useful to others. Basically it takes as input a data file with comma separated values (CSV) and translates this into a series of SQL "INSERT" statements allowing you to then feed this into MySQL, SQLite, or any other database.

Example Usage:

$ cat cars.csv Year,Make,Model,Length 1997,Ford,E350,2.34 2000,Mercury,Cougar,2.38

$ sqlite3 cars.db "CREATE TABLE cars (Year, Make, Model, Length)"

$ ./csv2sql.py cars.csv | sqlite3 cars.db

$ sqlite3 cars.db "SELECT * FROM cars" 1997|Ford|E350|2.34 2000|Mercury|Cougar|2.38

Enjoy! Feedback welcome!

cheers James Mills / prologic

Python, 97 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#!/usr/bin/env python

"""csv2sql

Tool to convert CSV data files into SQL statements that
can be used to populate SQL tables. Each line of text in
the file is read, parsed and converted to SQL and output
to stdout (which can be piped).

A table to populate is given by the -t/--table option or
by the basename of the input file (if not standard input).

Fields are either given by the -f/--fields option (comma
separated) or determinted from the first row of data.
"""

__version__ = "0.4"
__author__ = "James Mills"
__date__ = "3rd February 2011"

import os
import csv
import sys
import optparse

USAGE = "%prog [options] <file>"
VERSION = "%prog v" + __version__

def parse_options():
    parser = optparse.OptionParser(usage=USAGE, version=VERSION)

    parser.add_option("-t", "--table",
            action="store", type="string",
            default=None, dest="table",
            help="Specify table name (defaults to filename)")

    parser.add_option("-f", "--fields",
            action="store", type="string",
            default=None, dest="fields",
            help="Specify a list of fields (comma-separated)")

    parser.add_option("-s", "--skip",
            action="append", type="int",
            default=[], dest="skip",
            help="Specify records to skip (multiple allowed)")

    opts, args = parser.parse_args()

    if len(args) < 1:
        parser.print_help()
        raise SystemExit, 1

    return opts, args

def generate_rows(f):
    sniffer = csv.Sniffer()
    dialect = sniffer.sniff(f.readline())
    f.seek(0)

    reader = csv.reader(f, dialect)
    for line in reader:
        yield line

def main():
    opts, args = parse_options()

    filename = args[0]

    if filename == "-":
        if opts.table is None:
            print "ERROR: No table specified and stdin used."
            raise SystemExit, 1
        fd = sys.stdin
        table = opts.table
    else:
        fd = open(filename, "rU")
        if opts.table is None:
            table = os.path.splitext(filename)[0]
        else:
            table = opts.table

    rows = generate_rows(fd)

    if opts.fields:
        fields = ", ".join([x.strip() for x in opts.fields.split(",")])
    else:
        fields = ", ".join(rows.next())

    for i, row in enumerate(rows):
        if i in opts.skip:
            continue

        values = ", ".join(["\"%s\"" % x for x in row])
        print "INSERT INTO %s (%s) VALUES (%s);" % (table, fields, values)

if __name__ == "__main__":
    main()

6 comments

Charlie Clark 13 years, 2 months ago  # | flag

Most RDBMS already have optimised CSV importers that use the COPY or similar approach for dumping large amounts of data quickly by disabling transactional checks. That said, there may well be a situation where you don't have access to the database directly so this kind of solution is required. In that case you should definitely use the DB-API approach of executemany(STMT, [rows]).

James Mills (author) 13 years, 2 months ago  # | flag

Thanks for your comment.

This tool was designed originally to just be a filter - piping into MySQL for example (or SQLite). You could add the kind of support you're talking about by utilizing the SQLAlchemy library :)

--James

Michael Branson 11 years, 6 months ago  # | flag

This is a very useful and efficient piece of code for when you need to upload huge amounts of data and other ways of uploading data your SQL server won't do (e.g. when the LOAD DATA command is blocked on MySQL databases on shared webspaces). Though I am new to Python I managed to improve it so that for example it now allows for bulk inserts which are much faster than single inserts. Good stuff!

James Mills (author) 11 years, 6 months ago  # | flag

@Michael: I'm very happy you found it useful! This recipe is actually part of my personal set of tools I carry around with me (http://bitbucket.org/prologic/tools). Would you mind sharing your improved code so I can re-integrate your changes? :) --JamesMills / prologic

Dimitry Voytenko 11 years, 3 months ago  # | flag

James,

values = ", ".join(["\"%s\"" % x for x in row])

Wouldn't you have issues with values that may have double quotes in the text or other illegal characters from DB's point of view?

James Mills (author) 11 years, 3 months ago  # | flag

@Dimitry: You should probably sanatize your data before running it through this simple tool then. Or you could fork this at your own will and add this fature. --JamesMills / prologic