Use generators for fetching large db record sets « Python recipes

When using the python DB API, it's tempting to always use a cursor's fetchall() method so that you can easily iterate through a result set. For very large result sets though, this could be expensive in terms of memory (and time to wait for the entire result set to come back). You can use fetchmany() instead, but then have to manage looping through the intemediate result sets. Here's a generator that simplifies that for you.

      # This code require Python 2.2.1 or later
from __future__ import generators    # needs to be at the top of your module

def ResultIter(cursor, arraysize=1000):
    'An iterator that uses fetchmany to keep memory usage down'
    while True:
        results = cursor.fetchmany(arraysize)
        if not results:
            break
        for result in results:
            yield result

      

To iterate through the result of a query, you often see code like this:

# where con is a DB API 2.0 database connection object
cursor = con.cursor()
cursor.execute('select * from HUGE_TABLE')
for result in cursor.fetchall():
    doSomethingWith(result)

This is fine if fetchall() returns a small result set, but not so great if the query result is very large, or takes a long time to return. 'very large' and 'long time' is relative of course, but in any case it's easy to see that cursor.fetchall() is going to need to allocate enough memory to store the entire result set in memory at once. In addition, the doSomethingWith function isn't going to get called until that entire query finishes as well.

Doing it one at a time with cursor.fetchone() is an option, but doesn't take advantage of the database's efficiency when returning multiple records for a single (as opposed to multiple) queries.

To address this, there's a cursor.fetchmany() method that returns the next 'n' rows of the query, allowing you to strike a time/space compromise between the other two options. The ResultIter function shown here provides a generator-based implementation that lets you take advantage of fetchmany(), but still use the simple notation of fetchall()

ResultIter would be used like so:

...
# where con is a DB-API 2.0 database connection object
cursor = con.cursor()
cursor.execute('select * from HUGE_TABLE')

for result in ResultIter(cursor):
    doSomethingWith(result)

This looks similar to code above, but internally the ResultIter generator is chunking the database calls into a series of fetchmany() calls. The default here is that a 1000 records at a time are fetched, but you can change that according to your own requirements (either by changing the default, or just using the second parameter to ResultIter(). As always, trying different values with the profiler is probably a good idea...performance could vary based on schema, database type, and/or choice of python DB API 2.0 module.

Tags: database

6 comments

Hamish Lawson 21 years, 9 months ago # | flag

Use of True and False requires Python 2.2.1. The code is stated to require Python 2.2 or later, but the use of True and False means that Python 2.2.1 will actually be required.

Christopher Prinos (author) 21 years, 9 months ago # | flag

True/False and 2.2.1. Right you are, I updated the comment.

2.2 users can of course just replace True/False with 1/0

anthony baxter 18 years, 10 months ago # | flag

code cleanup. Why not instead this? It's a bit cleaner, and certainly easier on the eyes...

def ResultIter(cursor, arraysize=1000):
    'An iterator that uses fetchmany to keep memory usage down'
    while True:
        results = cursor.fetchmany(arraysize)
        if not results:
            break
        for result in results:
            yield result

joseph sheedy 18 years, 9 months ago # | flag

previous comment fixes problem. MySQLdb returns results as tuples, so the comparison ( results == [] ) fails. The

while True:
  if not results:
    break

approach works correctly.

Dharol Tankersley 12 years, 6 months ago # | flag

I am using python 2.7 with pyodbc and freeTDS (and pymongo) to query an mssql database.

I can successfully execute a query that returns 9,400 results, both with and without the python generator recipe. I continue to get the following error for queries returning > 10,000 rows:

Traceback (most recent call last):
  File "/Users/dmoney/eclipse/workspace/mongodb/pygeneratorattempt.py", line 77, in <module>
    for tup in ResultIter(cursor):
  File "/Users/dmoney/eclipse/workspace/mongodb/pygeneratorattempt.py", line 21, in ResultIter
   results = cursor.fetchmany(arraysize)
MemoryError

My script:

from __future__ import generators    #needs to be at the top of your module
def ResultIter(cursor, arraysize=1000):
'An iterator that uses fetchmany to keep memory usage down'
while True:
    results = cursor.fetchmany(arraysize)
    if not results:
        break
    for tup in results:
        yield tup

import pyodbc
url = 'DSN=myserver;UID=myuserid;PWD=mypswrd;PORT=1433;DATABASE=mydb
pyodbcconn = pyodbc.connect(url); cursor = pyodbcconn.cursor(); 
from pymongo import Connection; mongoconn = Connection('localhost') 
from pymongo.collection import Collection
mongodb = mongoconn.t4xquery; 

 for tup in ResultIter(cursor):
     dosomethingwith(tup)

i am a newbie to everything - python, pymongo, sql, etc. I am guessing that the generator is failing to chunk the data, since the generator works with smaller sets, but still gives memory errors with the larger queries?

Thanks for help, -d

Grijesh Chauhan 7 years, 11 months ago # | flag

In Django, once I had to join to unrelated models-tables to select valid rows, for this I written raw Postgresql and utilized django connection object but I have not write ResultIter kind-of code. In Django, you can find cursor_iter that works well. My code was something like this:

from django.db import connection
from django.db.models.sql.compiler import cursor_iter

def get_table_name(model):
    return model._meta.db_table

def select_pan_invalidname_validname(...):
    query = """
    SELECT ...
    FROM (
      SELECT...
        FROM {table1}
        WHERE ...
      UNION...
      SELECT...
        FROM {table1}
        WHERE ...
    ) viewtable, {table3}
    WHERE ...
    ....
    """.format(
        table1=get_table_name(model1),
        table2=get_table_name(model2),
        table3=get_table_name(model3),        
        arg1=value1,
        .
        .
    )
    cursor = connection.cursor()
    try:
        cursor.execute(query)
    except Exception:
        cursor.close()
        raise
    for rows in cursor_iter(cursor, connection.features.empty_fetchmany_value):
        for row in rows:
            yield row

◄	Python recipes (4591)	►
◄	Christopher Prinos's recipes (2)	►

Use generators for fetching large db record sets (Python recipe) by Christopher Prinos
ActiveState Code (http://code.activestate.com/recipes/137270/)

6 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Use generators for fetching large db record sets (Python recipe) by Christopher Prinos ActiveState Code (http://code.activestate.com/recipes/137270/)

6 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Use generators for fetching large db record sets (Python recipe) by Christopher Prinos
ActiveState Code (http://code.activestate.com/recipes/137270/)