High memory usage when using Python’s tarfile module

The Python tarfile module is a handy way to access files within tar archives without needing to unpack them first. You can iterate over files using the following pattern:

import tarfile

tar = tarfile.open(filename, 'r:gz')
for tar_info in tar:                  # tar_info is the metadata for a
                                      #   file in the archive.
    file = tar.extractfile(tar_info)  # file is a file-like object.
    for line in file:                 # We can do standard file-like
        print line,                   #   things.

Behind the scenes, each TarFile object maintains a list of members of the archive, and keeps this updated whenever you read or write members. This is fine for small archives, particularly if you want to access the metadata without having to re-read the archive. (TarFile objects have getmember, getmembers, and getnames methods for this kind of access.)

This list of members contains the TarInfo objects for every file in the archive. When you’ve got an archive with 18 million members (as I have), this list will no longer conceivably fit in memory. It’s not documented (as far as I can tell), but the solution is to periodically set the members attribute on the TarFile object to the empty list:

import tarfile

tar = tarfile.open(filename, 'r:gz')
for tar_info in tar:
    file = tar.extractfile(tar_info)
    do_something_with(file)
    tar.members = []

Obviously one loses some functionality as specified above, but hopefully now my scripts will terminate in reasonable time!

Posted in Python | Tagged , , , | 1 Comment

One Response to “High memory usage when using Python’s tarfile module”

  1. I ran into this problem too. Thanks for sharing. It fixed my problem.

Leave a Reply