Just a quick recipe I developed a few years ago that I thought might be useful to others. Basically it takes as input a data file with comma separated values (CSV) and translates this into a series of SQL "INSERT" statements allowing you to then feed this into MySQL, SQLite, or any other database.
Example Usage:
$ cat cars.csv Year,Make,Model,Length 1997,Ford,E350,2.34 2000,Mercury,Cougar,2.38
$ sqlite3 cars.db "CREATE TABLE cars (Year, Make, Model, Length)"
$ ./csv2sql.py cars.csv | sqlite3 cars.db
$ sqlite3 cars.db "SELECT * FROM cars" 1997|Ford|E350|2.34 2000|Mercury|Cougar|2.38
Enjoy! Feedback welcome!
cheers James Mills / prologic
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | #!/usr/bin/env python
"""csv2sql
Tool to convert CSV data files into SQL statements that
can be used to populate SQL tables. Each line of text in
the file is read, parsed and converted to SQL and output
to stdout (which can be piped).
A table to populate is given by the -t/--table option or
by the basename of the input file (if not standard input).
Fields are either given by the -f/--fields option (comma
separated) or determinted from the first row of data.
"""
__version__ = "0.4"
__author__ = "James Mills"
__date__ = "3rd February 2011"
import os
import csv
import sys
import optparse
USAGE = "%prog [options] <file>"
VERSION = "%prog v" + __version__
def parse_options():
parser = optparse.OptionParser(usage=USAGE, version=VERSION)
parser.add_option("-t", "--table",
action="store", type="string",
default=None, dest="table",
help="Specify table name (defaults to filename)")
parser.add_option("-f", "--fields",
action="store", type="string",
default=None, dest="fields",
help="Specify a list of fields (comma-separated)")
parser.add_option("-s", "--skip",
action="append", type="int",
default=[], dest="skip",
help="Specify records to skip (multiple allowed)")
opts, args = parser.parse_args()
if len(args) < 1:
parser.print_help()
raise SystemExit, 1
return opts, args
def generate_rows(f):
sniffer = csv.Sniffer()
dialect = sniffer.sniff(f.readline())
f.seek(0)
reader = csv.reader(f, dialect)
for line in reader:
yield line
def main():
opts, args = parse_options()
filename = args[0]
if filename == "-":
if opts.table is None:
print "ERROR: No table specified and stdin used."
raise SystemExit, 1
fd = sys.stdin
table = opts.table
else:
fd = open(filename, "rU")
if opts.table is None:
table = os.path.splitext(filename)[0]
else:
table = opts.table
rows = generate_rows(fd)
if opts.fields:
fields = ", ".join([x.strip() for x in opts.fields.split(",")])
else:
fields = ", ".join(rows.next())
for i, row in enumerate(rows):
if i in opts.skip:
continue
values = ", ".join(["\"%s\"" % x for x in row])
print "INSERT INTO %s (%s) VALUES (%s);" % (table, fields, values)
if __name__ == "__main__":
main()
|
Most RDBMS already have optimised CSV importers that use the COPY or similar approach for dumping large amounts of data quickly by disabling transactional checks. That said, there may well be a situation where you don't have access to the database directly so this kind of solution is required. In that case you should definitely use the DB-API approach of executemany(STMT, [rows]).
Thanks for your comment.
This tool was designed originally to just be a filter - piping into MySQL for example (or SQLite). You could add the kind of support you're talking about by utilizing the SQLAlchemy library :)
--James
This is a very useful and efficient piece of code for when you need to upload huge amounts of data and other ways of uploading data your SQL server won't do (e.g. when the LOAD DATA command is blocked on MySQL databases on shared webspaces). Though I am new to Python I managed to improve it so that for example it now allows for bulk inserts which are much faster than single inserts. Good stuff!
@Michael: I'm very happy you found it useful! This recipe is actually part of my personal set of tools I carry around with me (http://bitbucket.org/prologic/tools). Would you mind sharing your improved code so I can re-integrate your changes? :) --JamesMills / prologic
James,
Wouldn't you have issues with values that may have double quotes in the text or other illegal characters from DB's point of view?
@Dimitry: You should probably sanatize your data before running it through this simple tool then. Or you could fork this at your own will and add this fature. --JamesMills / prologic