Reading large files from zip archive « Python recipes

The standard zipfile module provides only a method to extract the entire content of a file from within a zip-file. This extension adds a generator method to iterate over the lines in a file, avoiding the memory problems.

      class MyZipFile(ZipFile):
    def __init__(self, file, mode="r", compression=ZIP_STORED):
        ZipFile.__init__(self, file, mode, compression)

    def lines(self, name, split="\n", bs=100*1024*1024):
        """ Generator function to allow iteration over content of a file.

        The content of the file is read in chunks (maximal size = <bs>),
        split by the character <split>, and provided for iteration.
        The intention is to prevent the need to store the entire amount
        of decompressed data in memory (which does not work for bigger zip-files).

        Choose <bs> as high as possible before having to fear OutOfMemory exceptions,
        as this will give maximum performance.
        The default value of 100 MB does a good job for me.
        """

        if self.mode not in ("r", "a"):
            raise RuntimeError, 'read() requires mode "r" or "a"'
        if not self.fp:
            raise RuntimeError, \
                  "Attempt to read ZIP archive that was already closed"
        zinfo = self.getinfo(name)
        filepos = self.fp.tell()
        
        self.fp.seek(zinfo.file_offset, 0)
        bytes = self.fp.read(zinfo.compress_size)
        self.fp.seek(filepos, 0)
        if zinfo.compress_type == ZIP_STORED:
            for line in bytes.split(split): yield line
        elif zinfo.compress_type == ZIP_DEFLATED:
            if not zlib:
                raise RuntimeError, \
                      "De-compression requires the (missing) zlib module"
            dc = zlib.decompressobj(-15)

            # While most of this routine is copied from the read() method of
            # the original ZipFile class definition, the following code is
            # specific to the new functionality. We decompress chunks,
            # split them, and "yield" the pieces as long as there is either
            # one more left or no more compressed data available. Then we "yield"
            # the rest.
            # The "decompress('Z')"-stund is again taken from the original code.
            rest = ""
            while True:
                # += was faster than + was faster than "%s%s" % (a,b)
                rest += dc.decompress(bytes, bs)
                rs = rest.split(split)
                bytes = dc.unconsumed_tail
                rl = len(rs)
                if rl == 1:
                    rest = rs[0]
                else:
                    for i in xrange(rl - 1): yield rs[i]
                    rest = rs[-1]
                if len(bytes) == 0: break
            ex = dc.decompress('Z') + dc.flush()
            if ex: rest = rest + ex
            if len(rest) > 0:
                for r in rest.split(split): yield r
        else:
            raise BadZipfile, \
                  "Unsupported compression method %d for file %s" % \
                  (zinfo.compress_type, name)


def main():
    # to test this, change the file names to something you have
    zfn = "results_0067.zip"
    fn = "properties.csv"

    z = MyZipFile(zfn, "r", ZIP_DEFLATED)
    for line in z.lines(fn):
        print "+",
    z.close()

if __name__ == "__main__": main()

      

I need to deal with large data files, of which the entire content cannot fit into memory (as a monolithic string).

I derived MyZipFile (well, we can discuss the name) from ZipFile, adding a method

MyZipFile.lines(name, split="\n", bs=10 * 1024 * 1024)

representing a generator to iterate over the lines (or pieces separated by split) of the file without unpacking more than bs bytes.

You see in the main() function how it can be used.

Still, the compressed data needs to fit into memory (and somebody might like to eliminate that), but I suspect that for most purposes, it is sufficient to chunk the decompressed data.

Btw: Sorry, I didn't yet jump on the python 3.0 train, thus my code should run as is for 2.x only.

Tags: large_files, memory, zip

2 comments

Chris Jones 14 years, 8 months ago # | flag

For the record, zipfile in 2.6 has an open() method that returns a file-like object with a readline method. I was surprised to discover 2.4/2.5 does not have this functionality.

Volker S. (author) 14 years, 8 months ago # | flag

Yes, actually I use 2.4 (job) and 2.5 (home) so far. A file-like object a nice solution. Good that it has become standard.

◄	Python recipes (4591)	►
◄	Volker S.'s recipes (1)	►

Reading large files from zip archive (Python recipe) by Volker S.
ActiveState Code (http://code.activestate.com/recipes/576882/)

2 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Reading large files from zip archive (Python recipe) by Volker S. ActiveState Code (http://code.activestate.com/recipes/576882/)

2 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Reading large files from zip archive (Python recipe) by Volker S.
ActiveState Code (http://code.activestate.com/recipes/576882/)