Manipulate the data the ol'e fashioned way, using lists and boolean tests and coding up some basic mathematic functions all on our own. Tomorrow, we'll see Numerical Python (NumPy) and all the
new-fangled ways to do these things, more, and faster.===
List Comprehension and Boolean Tests for Counting
One of the findings of our paper is that more of the reads map to the first reading frame than the other
two frames. What's more, the authors report a three-nucleotide periodicity in the footprinting. Here, we'll
use the data processed in Rich's section this morning to see if we can recover these relationships with some
more advanced Python techniques. We'll be doing this in the iPython interpretter, so let's go ahead and fire that up:
$ ipython
First off, we need to load our data into memory. Remember to use the %cpaste magic command if you want
to just copy and paste these lines of code into ipython (and the '--' to end the entry).
# Look at periodicity and reading frame bias in our processed data# a list to reload our data into
results =[]#load Rich's output from this morning, casting the entries as integer and float as appropriate
dataFile ='res.orfU1000.tab'
fh =open(dataFile,'r')for line in fh:
line = line.strip().split()
results.append([int(line[0]),float(line[1])])
results[0:5]#And we see that the first five lines of results look like we wanted (hopefully).
Out[6]:
[[0,86.006993007000005],[1,30.666666666699999],[2,288.212121212],[3,170.76923076899999],[4,141.54545454500001]]
results = results[0:379]
So, in the list results we have a series of positions, relative to the translation start site (at 100), and
their averaged read scores. Because of the order we wrote the output file in, it is currently ordered by the
nucleotide position.
Let's start off looking for the periodicity that the authors see in figure 2A. We'll do this by playing a little
game with each set of three positions. For each set, we'll see which position has the highest read score
(0, 1, or 2), and we'll add that position to the list of winner positions. At the end, we'll add up each of the
values, and see who won.**
# to store our list of winning positions; and we'll need the counter underneath that start at the start site
fcount =[]
i=100for each in results:
tmp = results[i:i+3]# now a nasty little elif chain to find the winners:if tmp[0][1]> tmp[1][1]and tmp[0][1]> tmp[2][1]:
fcount.append(0)elif tmp[1][1]> tmp[0][1]and tmp[1][1]> tmp[2][1]:
fcount.append(1)else:
fcount.append(2)# and an incrementer that checks for not going over the end of the results listif((i + 3)<(len(results) - 1)):
i +=3else:
break# And to check for the correct outcome:len(fcount)
Out[10]: 93
Now we have a list of the positions that had the highest score, so we need to tally them up. We could do this
with a sophomoric for loop, but instead we'll use list comprehension, since that was sorta the point of these
little calisthenics.
sum([fcount[i]==0for i inrange(len(fcount))])
Out[15]: 21sum([fcount[i]==1for i inrange(len(fcount))])
Out[16]: 93sum([fcount[i]==2for i inrange(len(fcount))])
Out[17]: 19# Alrigtht. I, for one, and pleasantly surprised that this really crappy analysis seems to agree with the authors
observation: way more times than not, the first reading frame had the highest
# read score.
Sorting to see decay
Now that I feel good about the universe, let's make sure we keep feeling good about the authors statements.
We're supposed to see something like twice as much total score in the first reading frame. To investiage, we
should be able to just sum the scores at each position.
i =100
frame =[0] * 3for each in results:
tmp = results[i:i+3]
frame[0] += tmp[0][1]
frame[1] += tmp[1][1]
frame[2] += tmp[2][1]if((i + 3)<(len(results) -1)):
i +=3else:
break
frame
Out[107]: [170867.57206423001,68513.261835259997,128092.131771944]
And... weeeelllll, the first frame definitely has more read score, but not twice frame three, and frames two
and three are definitely not even as figure 2C suggests. This could be because of our processing differences:
perhaps because we're using a subset of their data, or because that we're effectively using 21-mers instead
of 28-mers, etc. All in all, we still see a convincing conclusion that the first reading frame is more bound
than the others on average.
So, just out of curiosity, let's wonder how else we could have use too confusing a tool to query our data.
It will make you appreciate everything you learn tomorrow morning :-)
Instead of counting the scores, let's just look at the ordering of the list a bit. For starters, this should allow
use to if the authors' claim in figure 2F holds up: reads fall off as we the ribosome gets farther from the translation start site.
#lets start off by just looking at the results list and eye-balling it:
results
Out[109]:
[[0,86.006993007000005],[1,30.666666666699999]
...
...
...
[375,1135.0],[376,1485.0],[377,881.16666666699996],[378,1250.78571429]]# we can see as we scan up and down that for certain, the 100nt ahead of the start site is much lower than the# 100nt after that. But if we could just sort the list to see where the most bound read positions are. The problem,# of course, is that results.sort() will operate on the first entry in each list of the results list. Annnnd, the list is# already sorted that way. So how do we get to sort the list by the second column? We have to define a function# to feed the .sort() method. All the .sort() method does is compare two variables with cmp(), and depending on# whether it returns a -1, 0, or 1, switches the variables in the list order or not.# we're still going to be in essence comparing values in a pair-wise fashion, so we'll need two variables passed to our new function.def sort2(a, b):
# but insted of comparing the first element, we're going to compare the second of each list in the results listreturncmp(a[1], b[1])# all we have to do to use our new trick is tell sort to use it instead of the normal cmp() function:
results.sort(sort2)# A quick look at the results list will show that it's been sorted in place, from lowest score to highest. Of course# we could use the reverse=1 argument to .sort() if we wanted it the other way.
results.sort(sort2, reverse=1)#So, if we were to break the data into thirds, the position of the top-scoring third should be higher than the# other two thirds, right? First we'll need a function to calculate the mean.def mean(values):
returnfloat(sum(values)) / float(len(values))
idx1 =len(results) / 3
idx2 =(len(results) / 3) * 2
idx3 =len(results)
mean([results[i][0]for i inrange(0, idx1)])
Out[51]: 193.68253968253967
mean([results[i][0]for i inrange(idx1, idx2)])
Out[52]: 251.22222222222223
mean([results[i][0]for i inrange(idx2, idx3)])
Out[53]: 122.62204724409449
Okay, let's discuss: we didn't get the right answer because we left in the first 100 positions before we sorted,
instead of trimming out those 100. We'll leave it as an exercise to get the right answer for this part.
Coding our basic statistics
Tomorrow we're going to see all the wonderfully powerful functions that someone else has already coded with
which we can do numerical analysis. But for today, we're going to practice once more coding our own functions
by calculating the variance for this dataset.
def variance(values):
var =0
avg = mean(values)for value in values:
var +=(value - avg)2
var = var / len(values)return var
variance([results[i][1]for i inrange(len(results))])
Out[99]: 903655.72535965173
Exercises
1. Re-read the data from the morning session, and this time, cast all data as integers. What is the difference in
variance between the two datasets that is accounted for by the round-off error?
2. Modify the variance function to also return the standard deviation (the square-root of the variance), but only
if it is requested in the function call. If it is requested, return only the standard deviation, not the variance.
3. Combine the function from Exercise 2 with the mean() function such that when called, it can return any of
the statistics calculated.
First Pass at Analysis
Topics
This morning's results
Today You Will:
Manipulate the data the ol'e fashioned way, using lists and boolean tests and coding up some basic mathematic functions all on our own. Tomorrow, we'll see Numerical Python (NumPy) and all thenew-fangled ways to do these things, more, and faster.===
List Comprehension and Boolean Tests for Counting
One of the findings of our paper is that more of the reads map to the first reading frame than the other
two frames. What's more, the authors report a three-nucleotide periodicity in the footprinting. Here, we'll
use the data processed in Rich's section this morning to see if we can recover these relationships with some
more advanced Python techniques. We'll be doing this in the iPython interpretter, so let's go ahead and fire that up:
$ ipython
First off, we need to load our data into memory. Remember to use the %cpaste magic command if you want
to just copy and paste these lines of code into ipython (and the '--' to end the entry).
results = results[0:379]
So, in the list results we have a series of positions, relative to the translation start site (at 100), and
their averaged read scores. Because of the order we wrote the output file in, it is currently ordered by the
nucleotide position.
Let's start off looking for the periodicity that the authors see in figure 2A. We'll do this by playing a little
game with each set of three positions. For each set, we'll see which position has the highest read score
(0, 1, or 2), and we'll add that position to the list of winner positions. At the end, we'll add up each of the
values, and see who won.**
Now we have a list of the positions that had the highest score, so we need to tally them up. We could do this
with a sophomoric for loop, but instead we'll use list comprehension, since that was sorta the point of these
little calisthenics.
Sorting to see decay
Now that I feel good about the universe, let's make sure we keep feeling good about the authors statements.
We're supposed to see something like twice as much total score in the first reading frame. To investiage, we
should be able to just sum the scores at each position.
And... weeeelllll, the first frame definitely has more read score, but not twice frame three, and frames two
and three are definitely not even as figure 2C suggests. This could be because of our processing differences:
perhaps because we're using a subset of their data, or because that we're effectively using 21-mers instead
of 28-mers, etc. All in all, we still see a convincing conclusion that the first reading frame is more bound
than the others on average.
So, just out of curiosity, let's wonder how else we could have use too confusing a tool to query our data.
It will make you appreciate everything you learn tomorrow morning :-)
Instead of counting the scores, let's just look at the ordering of the list a bit. For starters, this should allow
use to if the authors' claim in figure 2F holds up: reads fall off as we the ribosome gets farther from the translation start site.
Okay, let's discuss: we didn't get the right answer because we left in the first 100 positions before we sorted,
instead of trimming out those 100. We'll leave it as an exercise to get the right answer for this part.
Coding our basic statistics
Tomorrow we're going to see all the wonderfully powerful functions that someone else has already coded with
which we can do numerical analysis. But for today, we're going to practice once more coding our own functions
by calculating the variance for this dataset.
Exercises
1. Re-read the data from the morning session, and this time, cast all data as integers. What is the difference in
variance between the two datasets that is accounted for by the round-off error?
2. Modify the variance function to also return the standard deviation (the square-root of the variance), but only
if it is requested in the function call. If it is requested, return only the standard deviation, not the variance.
3. Combine the function from Exercise 2 with the mean() function such that when called, it can return any of
the statistics calculated.