One of the challenges of being a bioinformatician is efficiently dealing with large files. Imagine parsing through a text file that is a couple of terabytes--not a feat as simple as one would think. Usually, even though we get mass amounts of data in a text file, we can avoid worrying about the size of the file since we have powerful computing clusters. I know, we are all so spoiled. However, every now and then we have to bite the bullet and not rely on ample resources to make up for our blissful ignorance.

I wrote a python script that does some plotting for me. However, it takes about a week to run since it has to parse through about 500 gigabytes of data and process them.

There are two programming amendments that are easy to implement and very helpful in keeping a low memory footprint (and also speeding up the program):

(1) range
In python, using the range function actually creates a list which it stores in memory. However, using xrange just iterates through the numbers.

range(0,5000) takes up more memory than xrange(0,5000)

Here is a quick example of what I mean:

range.py:
import sys

for i in range(0,int(sys.argv[1])):
        pass

print " "

xrange.py:
import sys

for i in xrange(0,int(sys.argv[1])):
        pass

print " "

[$] time python range.py 10000000

real    0m1.139s
user    0m0.911s
sys     0m0.228s

[$] time python xrange.py 10000000

real    0m0.506s
user    0m0.502s
sys     0m0.004s

(2) readlines
when reading in a file, I usually write something along the lines of:

file = open('somefile.txt', 'r')
for i in file.readlines():
    # operation
file.close()

This is great. It is clear, concise, and elegant. However, the file being read in is stored into memory. If the text file is gigantic, this may not be a very good idea. Luckily, there are several ways around this:

Method 1:
with open('somefile.txt', 'r') as FILE:
    for i in FILE:
        # operation

Please note that the "with" construct is only available from Python 2.6 onwards.

Method 2:
import fileinput
for i in fileinput.input('somefile.txt'):
    # operation

The fileinput module is great for folks who like to use as many pre-built modules as possible rather than re-inventing the wheel.

Method 3:
BUFFER = int(10E6) #10 megabyte buffer
file = open('somefile.txt', 'r')
text = file.readlines(BUFFER)
while text != []:
    for t in text:
        # operation
    text = file.readlines(BUFFER)

Although this method is the messiest of the three, it also provides the most control over how much memory the program can suck up.

Even though it might be more memory efficient to not load the entire text file into memory, the program may end up being slower if one is not careful. It is a balancing act and there is no single correct way to go about something like this. One just needs to weigh out the options and do what seems best. From my experience, the three methods have a minimal variance is execution time (no matter how large the file), so it really comes down to a matter of personal preference. I've used all three of the suggested methods at one time or another. My personal preferences are either methods one or three.