tarfile module is a handy way to access files within tar archives without needing to unpack them first. You can iterate over files using the following pattern:
import tarfile tar = tarfile.open(filename, 'r:gz') for tar_info in tar: # tar_info is the metadata for a # file in the archive. file = tar.extractfile(tar_info) # file is a file-like object. for line in file: # We can do standard file-like print line, # things.
Behind the scenes, each
TarFile object maintains a list of members of the archive, and keeps this updated whenever you read or write members. This is fine for small archives, particularly if you want to access the metadata without having to re-read the archive. (
TarFile objects have
getnames methods for this kind of access.)
This list of members contains the
TarInfo objects for every file in the archive. When you’ve got an archive with 18 million members (as I have), this list will no longer conceivably fit in memory. It’s not documented (as far as I can tell), but the solution is to periodically set the
members attribute on the
TarFile object to the empty list:
import tarfile tar = tarfile.open(filename, 'r:gz') for tar_info in tar: file = tar.extractfile(tar_info) do_something_with(file) tar.members = 
Obviously one loses some functionality as specified above, but hopefully now my scripts will terminate in reasonable time!