Today you will all learn to be dorks. Perhaps not as dorky as I am, perhaps not even as dorky as the other instructors, but dorks nonetheless. And why's that? Because you're going to start to think about how to make programs fast.
Our goal here is not to show you, comprehensively, what makes things fast and what makes things slow. There are entire courses devoted to that-- you would start with the famously difficult one called 'Algorithms'-- and even they can't tell you how to speed up every problem. But we do want to have you learn that there are often several ways to do things, and often enough one way is better than the others. We'd also like you to have tools to figure out what's slowing you down so that you can figure out how to speed it up.
This is not to say that faster is always better. Indeed, with the speed of today's computers, good algorithms are arguably less important than they have ever been. After all, what's the difference between a program taking .001 seconds to run and five minutes to write vs one that takes .0001 seconds to run but an hour to write? The latter is tenfold faster but takes ten times as much time out of your day. Most of the time, it's not worth it.
However, needless to say, sometimes it is. As an example, today we're going to write a program that maps sequencing reads to the genome. And then we're going to make it better.
Let's map those reads
Let's first define the problem. For each position in the genome, we want to know how many reads start at that position. In order to that roughly the way that they did it, we will have to look at the dark and foreboding supplementary materials of the paper, from which few who enter ever return. But providing that we do, we find that they did a few things to process these data:
1. Filter out reads that look like garbage
2. Filter out reads that look like rRNA
3. Align remaining reads to ORF positions
Of course, this isn't quite that simple. We need to add a few steps around the margins:
1. Parse and load reads file, filtering out 'garbage' reads
2. Parse and load fasta files for ORFs and rRNAs
3. Filter out reads that look like rRNA
4. Align remaining reads to ORF positions
5. Print everything out sensibly
So what does our program looks like?
mapper.py
#!/usr/bin/env pythonimportsysimport seqTools
import project
readFile =sys.argv[1]# this file needs to include orfs + a hundred bases upstream of each
orfFile =sys.argv[2]
rnaFile =sys.argv[3]# step 1
reads = project.parseAndFilterReads(readFile)# step 2
orfs = seqTools.parseFasta(orfFile)foridin orfs:
orfs[id]= orfs[id][900:]
rnas = seqTools.parseFasta(rnaFile)
results =[0] * 400for r in reads:
# step 3if project.matches(r,rnas): continue# step 4 (alters the 'results' dictionary within the function)
project.findRead(r,reads[r],orfs,results)# step 5for p,c inenumerate(results):
# print the position and the count of reads at that positionprint p,c
Here we've stubbed it out as per yesterday's lecture. Now let's take a break for folks to write some of these functions.
Exercises
1. Write project.parseAndFilterReads(readFile). It should take the file of reads that you downloaded on Monday as an argument and return a dictionary. This dictionary should have the read sequences as keys and the number of times that read sequence appears as values. However, there are some modifications that need to be made to the reads. As you discovered yesterday, it looks like there's a lot of junk in there. We're going to use a similar filtration process that is similar to Ingolia et al's:
a) Discard reads where 18 of the first 22 nucleotides are As.
b) Treat the reads as 21-mers, starting with the first nucleotide. That is, your dictionary should have as keys 21-mers formed from the first 21 nucleotides of each read. The values should be the counts of those 21-mers. If more than one read starts with the same 21-mer, then the value should be the sum of the counts of all of those reads.
2. Write project.matches(r,rnas). This function should check and see whether or not a given read r is found inside the RNAs dictionary (returning 1), or not (returning 0).
3. Write project.findRead(r,count,orfs,results). This program is more complicated. Your goal here is to take in a read (r), the number of times that read appeared in the sequencing data (count), a dictionary of orfs (orfs), and a list into which you will place your results (results). The results should take this form: for each index from -100 to 300, there should be a count of the number of reads that fall into that position with respect to the ORF start. Feel free to consider doing this other ways, but for the sake of expediency, I'll give you a hint and tell you how I first did it: I looped through all of the ORFs in my ORF dictionary. For each ORF, I used the string 'find' method to find where in the sequence the read matched, if at all. If it matched, I incremented the results list at that position's index by the value of count. For now, do not worry about the possibility that a read will match more than one ORF and thus be counted more than once. This will do for a first pass.
If you have extra time, further extend your program so that it finds and accounts for multiple matches for a read to a given ORF.
4. Use the 'head' command to make a test data set that only includes 500 reads. Run your program using this test data set. When that has completed, start running it on the full data set.
Part II: making things faster
Those of you who completed all of the exercises know by now that the program that you have written is slow. On my workstation, the test run of 500 reads took more than twenty seconds-- extrapolating that out to more than a million reads, that's over sixteen hours of runtime! While we might imagine being able to wait that long for these data, you're likely to run into a case where a 'slow' program looks like it's going to take days, weeks, months, or even years to finish-- these waits are impractical. So, it's worth exploring how to make programs faster.
The first step in making a program faster is finding out what is slow. Your program isn't, after all, just doing one task-- it's doing several, and it only takes one bogging down to bog down the whole virtual team. Often enough, we can use our intuition to guess at what the slow bit is. Unfortunately, at least my intuition is wrong a good deal of the time, leading me to sometimes spend hours speeding up parts of my code that, at the end of the day, were only taking a few seconds of time to run to begin with. So, this problem can be a bit of a hassle. Since this is such a common problem that programmers face, other, better programmers have made software that makes it very, very easy. What we're going to do is use what's called a profiler-- a helper program that tells us which parts of our program are taking the most time and which parts are taking the least.
You should get output that looks a little like this. The first thing that we notice is that the program took longer to run-- instead of twenty seconds, this took just over thirty. That's the profiler running in the background. As it has to keep track of everything that the program is doing, it naturally slows things down a little.
The main part of the output is a table with six columns. On the far right-hand-side column, we see functions and methods. Some of these are functions that we have defined, such as project.py:1(parseAndFilterReads). That just means that the function parseAndFilterReads is defined on the first line of project.py. Others are methods and functions that python has defined for us-- such as 'open' and 'len' methods and the 'find' method. Other columns, from left to right, are the number of times that function or method was called, the total time spent within that function (but not including the time spent in functions and methods called by that function), that time divided by the number of calls, the cumulative time spent within that function (the difference from the second column being that this column includes the time spent in subfunctions and methods), and that cumulative time divided by the number of calls.
I find the cumulative time column most useful. Let's focus on the line corresponding to project.findread:
First of all, this is a line of interest: of the thirty-three seconds spent in the program total, twenty-nine were spent either in findRead or in some function or method called by findRead-- the cumulative time. Looking at the total time, we see that the time spent in findRead itself was only about seven and a half seconds, leaving about twenty-two seconds of time that must have been spent in functions and methods called by findRead. Looking a little further down the list, we see this line:
Here's the real problem: find. Clearly, we need to figure out how to make findRead faster, and if we're going to do that, we're going to have to figure out a way to get rid of 'find.' But how do we do this?
There's nothing wrong in particular with find; indeed, it's a very, very fast way to find subsequences in sequences. However, consider what we're doing here: we're comparing each of ~one million subsequences to each position in each of ~6000 open reading frames. That strikes me as fishy-- should we really have to look at every position in the yeast genome a million times? There should be some way to remember what we've seen.
I decided to pre-process the yeast genome information. Ultimately, the information that we want to get out of the genome is a translation from a 21-mer (the read) to the positions where that read occurs. We're currently using find to do that: put in a 21-mer, and then run find a few thousand times on the orfs to pull out the position information. However, we can get that information more directly. In particular, I decided to make a dictionary that contained that information: put in a 21-mer, and then just look it up in a dictionary to find out where it is.
To do this, we to create a dictionary that has as keys every 21-mer in the genome and values corresponding to the positions of those 21-mers. This isn't all that hard to do:
mapper.py
...
# step 4if project.matches(r,rnas): continue# new step 5, takes the place of findReadtry:
hits = readPosns[r]exceptKeyError:
continue
portion =float(reads[r]) / len(hits)for p in hits:
results[p] += portion
...
project.py
def getAllSubSeqPosns(seqs):
rv ={}for seq in seqs:
for i inrange(min(len(seq) - 21 + 1,400)):
subseq = seq[i:i+21]if subseq in rv:
rv[subseq].append(i)else:
rv[subseq]=[i]return rv
# or, somewhat faster but more complicated-lookingdef getAllSubSeqPosns(seqs):
rv ={}for seq in seqs:
limit =min(400,len(seq))for i,subseq inenumerate([seq[i:i+21]for i inrange(limit)]):
if subseq in rv:
rv[subseq].append(i)else:
rv[subseq]=[i]return rv
Now, let's run that again with a smaller test set. This time, since I'm confident that it should be a little faster, I'm going to run with all of the orfs against 100000 reads. Or, rather, I already have run it. It still takes a little time, but this is what came out of my profiler:
Along with the one-time 25.5s cost of pre-processing, that's over 164 of the 170 total seconds used-- almost 97%! It turns out that we can basically pre-process that away, too. Instead of comparing every read to all of the RNAs, we can just not look at the reads that we know are RNA sequences:
mapper.py
...
orfs = seq_tools.parse_fasta(orfFile)
rnas = seq_tools.parse_fasta(rnaFile)
rnaPosns = project.getAllSubSeqPosns(rnas.values())
for seq in rnaPosns:
if seq in reads:
del reads[seq]
..
# remove previous step 4
And now to run it:
lusk@capablanca:DataProc$ time ./fastScriptFin.py gsmRich.txt orf_coding_all.fasta rna_coding.fasta
...
real 0m41.514s
user 0m40.715s
sys 0m0.612s
Better! That's the whole shebang in forty seconds. The complete code for mapper.py is below.
#!/usr/bin/env pythonimport seq_tools
import project
importsys
readFile =sys.argv[1]# this file needs to include orfs + a few hundred bases upstream of each
orfFile =sys.argv[2]
rnaFile =sys.argv[3]# step 1
reads = project.parseAndFilterReads(readFile)# step 2
orfs = seq_tools.parse_fasta(orfFile)foridin orfs:
orfs[id]= orfs[id][900:]
rnas = seq_tools.parse_fasta(rnaFile)
rnaPosns = project.getAllSubSeqPosns(rnas.values())for seq in rnaPosns:
if seq in reads:
del reads[seq]# new pre-processing step
readPosns = project.getAllSubSeqPosns(orfs.values())
results =[0] * 400for r in reads:
# removed the old matches function# new step 5, takes the place of findReadtry:
hits = readPosns[r]exceptKeyError:
continue
portion =float(reads[r]) / len(hits)for p in hits:
results[p] += portion
# step sixfor p,c inenumerate(results):
# print the position and the count of reads at that positionprint p,c
Today you will all learn to be dorks. Perhaps not as dorky as I am, perhaps not even as dorky as the other instructors, but dorks nonetheless. And why's that? Because you're going to start to think about how to make programs fast.
Our goal here is not to show you, comprehensively, what makes things fast and what makes things slow. There are entire courses devoted to that-- you would start with the famously difficult one called 'Algorithms'-- and even they can't tell you how to speed up every problem. But we do want to have you learn that there are often several ways to do things, and often enough one way is better than the others. We'd also like you to have tools to figure out what's slowing you down so that you can figure out how to speed it up.
This is not to say that faster is always better. Indeed, with the speed of today's computers, good algorithms are arguably less important than they have ever been. After all, what's the difference between a program taking .001 seconds to run and five minutes to write vs one that takes .0001 seconds to run but an hour to write? The latter is tenfold faster but takes ten times as much time out of your day. Most of the time, it's not worth it.
However, needless to say, sometimes it is. As an example, today we're going to write a program that maps sequencing reads to the genome. And then we're going to make it better.
Let's map those reads
Let's first define the problem. For each position in the genome, we want to know how many reads start at that position. In order to that roughly the way that they did it, we will have to look at the dark and foreboding supplementary materials of the paper, from which few who enter ever return. But providing that we do, we find that they did a few things to process these data:
1. Filter out reads that look like garbage
2. Filter out reads that look like rRNA
3. Align remaining reads to ORF positions
Of course, this isn't quite that simple. We need to add a few steps around the margins:
1. Parse and load reads file, filtering out 'garbage' reads
2. Parse and load fasta files for ORFs and rRNAs
3. Filter out reads that look like rRNA
4. Align remaining reads to ORF positions
5. Print everything out sensibly
So what does our program looks like?
mapper.py
project.py
Here we've stubbed it out as per yesterday's lecture. Now let's take a break for folks to write some of these functions.
Exercises
1. Write project.parseAndFilterReads(readFile). It should take the file of reads that you downloaded on Monday as an argument and return a dictionary. This dictionary should have the read sequences as keys and the number of times that read sequence appears as values. However, there are some modifications that need to be made to the reads. As you discovered yesterday, it looks like there's a lot of junk in there. We're going to use a similar filtration process that is similar to Ingolia et al's:
a) Discard reads where 18 of the first 22 nucleotides are As.
b) Treat the reads as 21-mers, starting with the first nucleotide. That is, your dictionary should have as keys 21-mers formed from the first 21 nucleotides of each read. The values should be the counts of those 21-mers. If more than one read starts with the same 21-mer, then the value should be the sum of the counts of all of those reads.
2. Write project.matches(r,rnas). This function should check and see whether or not a given read r is found inside the RNAs dictionary (returning 1), or not (returning 0).
3. Write project.findRead(r,count,orfs,results). This program is more complicated. Your goal here is to take in a read (r), the number of times that read appeared in the sequencing data (count), a dictionary of orfs (orfs), and a list into which you will place your results (results). The results should take this form: for each index from -100 to 300, there should be a count of the number of reads that fall into that position with respect to the ORF start. Feel free to consider doing this other ways, but for the sake of expediency, I'll give you a hint and tell you how I first did it: I looped through all of the ORFs in my ORF dictionary. For each ORF, I used the string 'find' method to find where in the sequence the read matched, if at all. If it matched, I incremented the results list at that position's index by the value of count. For now, do not worry about the possibility that a read will match more than one ORF and thus be counted more than once. This will do for a first pass.
If you have extra time, further extend your program so that it finds and accounts for multiple matches for a read to a given ORF.
4. Use the 'head' command to make a test data set that only includes 500 reads. Run your program using this test data set. When that has completed, start running it on the full data set.
Part II: making things faster
Those of you who completed all of the exercises know by now that the program that you have written is slow. On my workstation, the test run of 500 reads took more than twenty seconds-- extrapolating that out to more than a million reads, that's over sixteen hours of runtime! While we might imagine being able to wait that long for these data, you're likely to run into a case where a 'slow' program looks like it's going to take days, weeks, months, or even years to finish-- these waits are impractical. So, it's worth exploring how to make programs faster.
The first step in making a program faster is finding out what is slow. Your program isn't, after all, just doing one task-- it's doing several, and it only takes one bogging down to bog down the whole virtual team. Often enough, we can use our intuition to guess at what the slow bit is. Unfortunately, at least my intuition is wrong a good deal of the time, leading me to sometimes spend hours speeding up parts of my code that, at the end of the day, were only taking a few seconds of time to run to begin with. So, this problem can be a bit of a hassle. Since this is such a common problem that programmers face, other, better programmers have made software that makes it very, very easy. What we're going to do is use what's called a profiler-- a helper program that tells us which parts of our program are taking the most time and which parts are taking the least.
This is how you run it:
Let's see what we get as output:
2980344 function calls in 33.104 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 33.104 33.104 <string>:1(<module>) 1 0.021 0.021 33.104 33.104 primitiveScriptFin.py:2(<module>) 1 0.000 0.000 0.000 0.000 project.py:1(<module>) 1 0.006 0.006 0.010 0.010 project.py:1(parseAndFilterReads) 326 0.363 0.001 0.819 0.003 project.py:17(matches) 326 7.485 0.023 29.401 0.090 project.py:23(findRead) 1 0.000 0.000 0.000 0.000 seq_tools.py:1(<module>) 2 1.461 0.731 2.853 1.426 seq_tools.py:3(parse_fasta) 1 0.000 0.000 33.104 33.104 {execfile} 332 0.001 0.000 0.001 0.000 {len} 156117 0.277 0.000 0.277 0.000 {method 'append' of 'list' objects} 134483 0.444 0.000 0.444 0.000 {method 'count' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 2189742 21.791 0.000 21.791 0.000 {method 'find' of 'str' objects} 7128 0.045 0.000 0.045 0.000 {method 'join' of 'str' objects} 2 0.001 0.000 0.001 0.000 {method 'keys' of 'dict' objects} 163243 0.304 0.000 0.304 0.000 {method 'rstrip' of 'str' objects} 7625 0.079 0.000 0.079 0.000 {method 'split' of 'str' objects} 164241 0.360 0.000 0.360 0.000 {method 'startswith' of 'str' objects} 156115 0.329 0.000 0.329 0.000 {method 'upper' of 'str' objects} 652 0.136 0.000 0.136 0.000 {method 'values' of 'dict' objects} 3 0.000 0.000 0.000 0.000 {open}You should get output that looks a little like this. The first thing that we notice is that the program took longer to run-- instead of twenty seconds, this took just over thirty. That's the profiler running in the background. As it has to keep track of everything that the program is doing, it naturally slows things down a little.The main part of the output is a table with six columns. On the far right-hand-side column, we see functions and methods. Some of these are functions that we have defined, such as project.py:1(parseAndFilterReads). That just means that the function parseAndFilterReads is defined on the first line of project.py. Others are methods and functions that python has defined for us-- such as 'open' and 'len' methods and the 'find' method. Other columns, from left to right, are the number of times that function or method was called, the total time spent within that function (but not including the time spent in functions and methods called by that function), that time divided by the number of calls, the cumulative time spent within that function (the difference from the second column being that this column includes the time spent in subfunctions and methods), and that cumulative time divided by the number of calls.
I find the cumulative time column most useful. Let's focus on the line corresponding to project.findread:
ncalls tottime percall cumtime percall filename:lineno(function) ... 326 7.485 0.023 29.401 0.090 project.py:23(findRead)First of all, this is a line of interest: of the thirty-three seconds spent in the program total, twenty-nine were spent either in findRead or in some function or method called by findRead-- the cumulative time. Looking at the total time, we see that the time spent in findRead itself was only about seven and a half seconds, leaving about twenty-two seconds of time that must have been spent in functions and methods called by findRead. Looking a little further down the list, we see this line:
ncalls tottime percall cumtime percall filename:lineno(function) ... 326 7.485 0.023 29.401 0.090 project.py:23(findRead) ... 2189742 21.791 0.000 21.791 0.000 {method 'find' of 'str' objects}Here's the real problem: find. Clearly, we need to figure out how to make findRead faster, and if we're going to do that, we're going to have to figure out a way to get rid of 'find.' But how do we do this?
There's nothing wrong in particular with find; indeed, it's a very, very fast way to find subsequences in sequences. However, consider what we're doing here: we're comparing each of ~one million subsequences to each position in each of ~6000 open reading frames. That strikes me as fishy-- should we really have to look at every position in the yeast genome a million times? There should be some way to remember what we've seen.
I decided to pre-process the yeast genome information. Ultimately, the information that we want to get out of the genome is a translation from a 21-mer (the read) to the positions where that read occurs. We're currently using find to do that: put in a 21-mer, and then run find a few thousand times on the orfs to pull out the position information. However, we can get that information more directly. In particular, I decided to make a dictionary that contained that information: put in a 21-mer, and then just look it up in a dictionary to find out where it is.
To do this, we to create a dictionary that has as keys every 21-mer in the genome and values corresponding to the positions of those 21-mers. This isn't all that hard to do:
mapper.py
... # step 4 if project.matches(r,rnas): continue # new step 5, takes the place of findRead try: hits = readPosns[r] except KeyError: continue portion = float(reads[r]) / len(hits) for p in hits: results[p] += portion ...project.py
Now, let's run that again with a smaller test set. This time, since I'm confident that it should be a little faster, I'm going to run with all of the orfs against 100000 reads. Or, rather, I already have run it. It still takes a little time, but this is what came out of my profiler:
lusk@capablanca:DataProc$ python -m cProfile ./mapper.py test.gsm orf_coding_all.fasta rna_coding.fasta 26872990 function calls in 169.820 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 169.820 169.820 <string>:1(<module>) 1 0.508 0.508 169.820 169.820 fastScriptFin.py:2(<module>) 1 0.000 0.000 0.000 0.000 project.py:1(<module>) 1 1.147 1.147 2.020 2.020 project.py:1(parseAndFilterReads) 62221 58.795 0.001 138.846 0.002 project.py:17(matches) 1 25.028 25.028 25.504 25.504 project.py:36(getAllSubSeqPosns) 1 0.000 0.000 0.000 0.000 seq_tools.py:1(<module>) 2 1.481 0.740 2.927 1.463 seq_tools.py:3(parse_fasta) 1 0.001 0.001 169.820 169.820 {execfile} 14846 0.052 0.000 0.052 0.000 {len} 267980 0.624 0.000 0.624 0.000 {method 'append' of 'list' objects} 25655420 79.482 0.000 79.482 0.000 {method 'count' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 7128 0.047 0.000 0.047 0.000 {method 'join' of 'str' objects} 2 0.001 0.000 0.001 0.000 {method 'keys' of 'dict' objects} 163243 0.313 0.000 0.313 0.000 {method 'rstrip' of 'str' objects} 107125 0.351 0.000 0.351 0.000 {method 'split' of 'str' objects} 363241 0.765 0.000 0.765 0.000 {method 'startswith' of 'str' objects} 156115 0.343 0.000 0.343 0.000 {method 'upper' of 'str' objects} 62222 0.784 0.000 0.784 0.000 {method 'values' of 'dict' objects} 6717 0.021 0.000 0.021 0.000 {min} 3 0.000 0.000 0.000 0.000 {open} 6717 0.079 0.000 0.079 0.000 {range}Better, but still could use work-- this would take more than an hour. In particular, what's with:
Along with the one-time 25.5s cost of pre-processing, that's over 164 of the 170 total seconds used-- almost 97%! It turns out that we can basically pre-process that away, too. Instead of comparing every read to all of the RNAs, we can just not look at the reads that we know are RNA sequences:
mapper.py
... orfs = seq_tools.parse_fasta(orfFile) rnas = seq_tools.parse_fasta(rnaFile) rnaPosns = project.getAllSubSeqPosns(rnas.values()) for seq in rnaPosns: if seq in reads: del reads[seq] .. # remove previous step 4And now to run it:
Better! That's the whole shebang in forty seconds. The complete code for mapper.py is below.