Beyond learning programming: how to write programs


Introduction

Each year that we've run this course, the second week has been driven by the completion of a project. Not only does this allow us to naturally talk about more science-specialized aspects of programming, but it also gives people in the class an opportunity to see how larger programs are structured. This project has, in the past, been come up with by the course instructors specifically for the course. This has the advantage of being tailor made: relatively straightforward to push through in a week, with plenty of teachable moments along the way. However, this has the disadvantage of being boring: we all do a bunch of work, and then we wind up with a bunch of basically unremarkable and unsurprising results.

This year, we're trying something different. Instead of having everyone complete a self-contained project, we're going to replicate the data analysis of a recent high-profile paper-- what, hopefully, you all hope to be doing with all of your high-profile data after you complete your course. By now, you should all have read it: Ingolia et als 2009 Science paper that aimed to track down all of the ribosomes in yeast. We anticipate that this paper will provide a similar density of teachable moments as our previous toy projects did while proving that real data is more interesting than toy data.

My lecture this morning will be in two parts: first, I want to have everyone on the same page with respect to the motivation, terms, and technology behind the paper, and second, I want to step through a broad overview of what we're going to need to do to analyze the study's raw data. This will include both a general strategy for approaching large programming projects and a specific example of how we're going to approach this one.

So first-- why?

Why is this interesting?
We've got a pile of different backgrounds in the class this year-- people doing neuroscience, biochemistry, molecular evolution, genetics, just about everything. I've been informed that not everyone will have a natural understanding of why making a bar chart of the number of ribosomes in different parts of the genome is a landmark Science paper.

Most of this comes down to the first sentence: "The ability to monitor the identity and quantity of proteins that a cell produces would inform nearly all aspects of biology." In some sense, we're already able to do this. A lot of work coming out of the Weissman lab in the past has focused on using mass spectrometry to quantify protein levels in yeast. However, this is not cheap, easy, or even particularly effective-- mass spec has a hard time finding proteins that exist at low concentrations. Microarrays have also often been used towards the original goal, but they have plenty of problems of their own. By measuring mRNA, they are a couple steps away from the actual protein concentration, as mRNAs have different rates of translation, and translated proteins have different rates of degradation.

This paper is the first example of a new method of measuring translation by quantifying ribosome activity. The approach had promise to be easier, cheaper, and more sensitive than mass spec while being more accurate than microarrays, and this paper largely holds up those claims. The authors are able to use this new method to confirm several long-standing hypotheses about translation and fashion a number of new ones. The paper potentially marks the beginning of a new era in quantifying what's happening inside a cell.

What did they do?
When we asked people in the class what they wanted to do for a project, many people brought up problems involving next-generation sequencing. This is natural: people are starting to do really interesting stuff with it, including the work that led to this paper.

New sequencing technologies, in particular the Illumina/Solexa technologies, have certainly made sequencing whole genomes easier. However, it's found plenty of other applications outside of genome sequencing-- for example, sequencing mRNA looks like it may be more accurate than using microarrays for quantifying gene expression and chromatin immunoprecipitation. This is one such application.

To fully understand how they organized their experiments and then analyzed their data, it's best to have an understanding of how the sequencing platform upon which their strategy is based works.

An Illumina sequencing run starts, predictably enough, with DNA that you want to sequence. This DNA is reduced, either through fragmentation or some other means, to short segments. After ligating primers to these fragments, they are input to the sequencing machine, which fixes them to a glass slide. The fragments are amplified in place, creating millions of dense clusters of identical sequence on the slide. Sequencing is performed by synthesis; each nucleotide incorporation is associated with a different color, this light being captured with a high-resolution camera. Efficiency and accuracy decreases with length, ultimately limiting the length of the reaction. Each resulting short sequence is called a sequencing 'read.'

This all sums up to the result that Illumina sequencing is known to produce very very many, very short reads (although they're getting longer). In contrast, 454 sequencing produces very many, somewhat short reads, and old-school Sanger sequencing produces very, very, very, very, very few, somewhat long reads. However, the total amount of sequence, number of reads times length of each, is much greater in Illumina sequencing than any other currently available sequencing technology.

This amount of data can be leveraged in a number of ways. Here, the authors wanted to use it to quantify the relative abundance of different sequences. They reasoned, as has been established, that if a given DNA molecule is overrepresented by a certain multiplier within their library, then if they sequence that library, the number of reads that correspond to that DNA molecule will also be overrepresented by that multiplier. This made their experimental protocol relatively straightforward. As summarized in their paper:

"To establish ribosome profiling as a quantitative tool for monitoring translation, we needed to implement three steps: (i) robustly generate ribosome-protected mRNA fragments ("footprints") whose sequences indicate the position of active ribosomes; (ii) convert these RNA footprints into a library of DNA molecules in a form that is suitable for deep sequencing with minimal distortion; and (iii) measure the abundance of different footprints in this library by means of deep sequencing."

In this case, nucleases were used to create small subsequences protected by translating ribosomes. This library was then transformed into a DNA library-- the sequences already being small, no further fragmentation was needed. And then the library was sequenced, bringing us to the data that we will be working with this week.

What are we going to do?
With short reads, assigning each read to a position on the genome can be a challenge. This is the first step that we'll be undertaking, and we'll see that it can be a challenging process that may require some extra thought from you and some extra sweat from the computer. You can call this the computer-science-interesting part. After that, you come to the biology-interesting part: analyzing the processed data. In particular, you'll be attempting to replicate the periodicity of ribosome binding and certain observations with respect to the processivity of ribosome binding. Of course, you're also welcome to do your own investigations-- all the data will be at your fingertips.

How are we going to do it?
This is a larger project, and putting it together will require some more thought than putting together a simple set of scripts. We've already talked about the basic tools that you'll need to use here: functions and modules. However, we haven't talked all that much about how to use them. While what I'm going to talk about now is not necessary to make your programs work, it's a strategy adopted by many programmers to make tackling these larger tasks more manageable. What is this strategy?

Stubbing and the 'pass' statement
Let's say that we're writing a complicated piece of code-- we'd like to break it down into simpler parts. This is an intuitive concept, and one that we've touched on before. Let's say that we want to make a program that gambles online and makes money for you so that you are free to pursue the standard academic career path of postdoctoral positions ad infinitum.

All stubbing is is writing what your program should be doing, without actually getting around to filling in the details. It's like writing an outline of a paper. In this case:

gambler.py
#!/usr/bin/env python
import sys
import internet
import gambling
 
accountID = sys.argv[1]
password = sys.argv[2]
 
[balance,sessionInfo] = internet.loginToIllegalGamblingServer(accountID,password)
while balance > 0:
    balance = gambling.playGame('poker',sessionInfo)
    if balance > 1000000000:
         print 'Congratulations: you are a billionaire.'
         internet.logoffFromIllegalGamblingServer()
         exit()
print 'Darn!'
internet.logoffFromIllegalGamblingServer()
 
internet.py
def loginToIllegalGamblingServer(accountID,password):
     pass
 
def logoffFromIllegalGamblingServer():
     pass
gambling.py
def playGame(gameType,sessionInfo,balance):
     if gameType == 'poker':
         hand = requestHand(sessionInfo)
         if handIsGood(hand):
             amountWon = goAllIn(hand,balance,sessionInfo)
             if amountWon:
                 return amountWon + balance
             else:
                 return 0
         else:
             return balance
     else:
         print "I don't know how to play that game yet"
         return balance
 
def requestHand(sessionInfo):
    pass
 
def handIsGood(hand):
     pass
 
def goAllIn(hand,balance,sessionInfo):
     pass
 
We're free to write the easy parts first, saving the hard parts for later-- we've already created the logical flow of the program, and by doing this early, we can keep it organized.

The only thing new that we've covered here is the statement 'pass.' It's pretty simple: it does nothing. Although this sounds somewhat pointless, in this case it allows you to write little function stubs without python (or your text editor) complaining. However, it pops up in other places as well, usually as a shortcut where you mean to write more code later. This could be in an if or else statement or while raising an exception. We wont cover those applications here, but keep them in mind while you're programming-- we encourage you to try it out.

Exercises

1. Stub out the basics of this project. For the purposes of this exercise, the scope of our project is to create a dataset that links a position relative to an open reading frame start to the number of reads that land in such positions. From this dataset, we will look for periodicity and decay of that signal.

After you're done, discuss your outline with a TA or an instructor.

2. Retrieve the footprinted sequences from the gene expression omnibus (GEO). The accession number is in the paper-- you'll want both replicates of the footprints taken in rich media and when starving. These files should have the basic format:
AATGCC...AGA  5
AATGCT...AGA  6
AATGCG...AGA  1
The first column is the thirty-six-base read sequence and the second column is the number of times that read sequence was recovered.

3. Let's do some preliminary data processing to examine some of the data's basic properties that may need to be accounted for. First, write a program that lists, for each position from zero to thirty-five, for each nucleotide A, T, C, or G, the number of nucleotides that have that base. For example, for the small example listed above, the output should be:

<example has been corrected>
0   12  0  0  0
1   12  0  0  0
2   0  12  0  0
3   0  0  0  12
4   0  0  12  0
5   0  6  5  1
...
33  12  0  0  0
34  0  0  0  12
35  12  0  0  0
In this example, the first column is the position, the second is the count of As at that position, with the remaining columns the counts of Ts, Cs, and Gs. When run on a full data set, do you notice a trend for certain nucleotides at some or all positions? What could be happening here, and how might we deal with it? This is discussed in the supplemental material.

This may take a long time to run on any of the files that you downloaded. For testing purposes, it may be useful to create a smaller file using the 'head' linux command.

4. What's the basic problem with sequencing RNAs in the cell? How could we compensate for this?

Solutions
3.
#!/usr/bin/env python
import sys
 
# get file off standard input
gsmFile = sys.argv[1]
 
# initialize data structure
dat = []
for i in range(36):
    dat.append({'A':0,'T':0,'C':0,'G':0})
 
 
# open file and loop through it
for line in open(gsmFile):
    # get rid of garbage header
    if line.startswith('#'):
        continue
    if line.startswith('ID'):
        continue
 
    # split line into read and read count
    spl = line.split()
    read = spl[0]
    count = int(spl[1])
 
    # assign
    for i,letter in enumerate(read):
        # only count well assigned data
        if letter == 'N': continue
        dat[i][letter] += count
 
for i,count in enumerate(dat):
    print '%i\t%i\t%i\t%i\t%i' % (i,count['A'],count['T'],count['C'],count['G'])
 
4. Other RNA besides mRNA-- however, we should be able to filter those reads out.