One of the challenges of being a bioinformatician is efficiently dealing with large files. Imagine parsing through a text file that is a couple of terabytes--not a feat as simple as one would think. Usually, even though we get mass amounts of data in a text file, we can avoid worrying about the size of the file since we have powerful computing clusters. I know, we are all so spoiled. However, every now and then we have to bite the bullet and not rely on ample resources to make up for our blissful ignorance.

I wrote a python script that does some plotting for me. However, it takes about a week to run since it has to parse through about 500 gigabytes of data and process them.

There are two programming amendments that are easy to implement and very helpful in keeping a low memory footprint (and also speeding up the program):

(1) range
In python, using the range function actually creates a list which it stores in memory. However, using xrange just iterates through the numbers.

range(0,5000) takes up more memory than xrange(0,5000)

Here is a quick example of what I mean:

range.py:
import sys

for i in range(0,int(sys.argv[1])):
pass

print " "

xrange.py:
import sys

for i in xrange(0,int(sys.argv[1])):
pass

print " "

[$] time python range.py 10000000 real 0m1.139s user 0m0.911s sys 0m0.228s [$] time python xrange.py 10000000

real    0m0.506s
user    0m0.502s
sys     0m0.004s

when reading in a file, I usually write something along the lines of:

file = open('somefile.txt', 'r')
for i in file.readlines():
# operation
file.close()

This is great. It is clear, concise, and elegant. However, the file being read in is stored into memory. If the text file is gigantic, this may not be a very good idea. Luckily, there are several ways around this:

Method 1:
with open('somefile.txt', 'r') as FILE:
for i in FILE:
# operation

Please note that the "with" construct is only available from Python 2.6 onwards.

Method 2:
import fileinput
for i in fileinput.input('somefile.txt'):
# operation

The fileinput module is great for folks who like to use as many pre-built modules as possible rather than re-inventing the wheel.

Method 3:
BUFFER = int(10E6) #10 megabyte buffer
file = open('somefile.txt', 'r')