Graphing


Today, we are going to learn how to make pretty pictures out of our data. In addition to creating figures for papers, graphs are often really useful in interpreting complex data visually. You were already introduced to a really basic graph this morning, the scatter plot. This afternoon, we'll start by redoing the scatter plot, this time using a script instead of iPython. As we go through, you will see that iPython was actually doing quite a bit of work behind the scenes that you will need to be more explicit about in the script. For this script to work, you will need the output from Rich's Monday morning lecture, which you can download from here if you don't have it:
#read in datafile
results = []
dataFile = 'all400.tab'
fh = open(dataFile, 'r')
for line in fh:
    line = line.strip().split()
    results.append( [int(line[0]), float(line[1])] )
 
#convert list into an array
import numpy as np
a = np.array(results)
 
#add a uniform and exponential distribution function
import pylab as pL
b = a[:,1] + pL.uniform(0,1000,len(a[:,1]))
c = a[:,1] + pL.exponential(1000,len(a[:,1]))
 
#plot things
import matplotlib.pyplot as plot
plot.scatter(a[:,1],b)
plot.show()
First, how can I figure this stuff out on my own? There is some decent (not amazing) documentation on all of the modules on-line. In particular, for this afternoon's lecture, we are going to talk about things documented for the module 'matplotlib', which integrated into pylab and scipy. You can find additional information about 'matplotlib' at http://matplotlib.sourceforge.net/index.html. In particular, you want to check out the gallery as each image has the code used to generate it.

OK. Back to the scatter plot. There are many ways to make this graph prettier and easier to read. For example, you might want to change the marker to be x's instead of circles:
plot.scatter(a[:,1],b, marker='x')
Or change the color of the circles to be red instead of blue:
plot.scatter(a[:,1],b, marker='o', c='r')
Next, we might want to label the axes and add a title:
plot.xlabel('Hi, I\'m the X axis')
plot.ylabel('Hi, I\'m the Y axis')
plot.title('Hello, I\'m the title!')
Alright, what about a different type of data? We used numerical analysis to discovery the three nucleotide periodicity in the data. Now, let's use graphs to make the same discovery. Here, however, a scatter plot would not be overly informative. Instead, let's make a bar graph (NOTE: The goal of this example is to create a graph that looks approximately like Figure 2.A from the paper):
#read in data
results = []
dataFile = 'all400.tab'
fh = open(dataFile, 'r')
for line in fh:
    line = line.strip().split()
    results.append( [int(line[0]), float(line[1])] )
 
# convert list into array:
import numpy as np
a = np.array(results)
 
#find mean for entire dataset
b = a[:,1][np.arange(0,len(a))] / len(np.arange(0,len(a)))
 
import matplotlib.pyplot as plot
ind = np.arange(0,len(b))
 
plot.bar(ind,b)
plot.show()
Yikes! That's too much data! Let's cut down our test set and make it prettier:
#read in data
results = []
dataFile = 'all400.tab'
fh = open(dataFile, 'r')
for line in fh:
    line = line.strip().split()
    results.append( [int(line[0]), float(line[1])] )
results = results[75:125]
 
# convert list into array:
import numpy as np
a = np.array(results)
 
#find mean for entire dataset
b = a[:,1][np.arange(0,len(a))] / len(np.arange(0,len(a)))
 
import matplotlib.pyplot as plot
ind = np.arange(-25,len(b)-25)
plot.bar(ind,b,width=1,color='r')
plot.show()
 
Getting closer. Let's figure out how to get those vertical lines on the graph to emphasize the periodicity. To do this, we need to have more control over the layout of the plot itself using the subplot function:
ind = np.arange(-25,len(b)-25)
p1 = plot.subplot(111)
p1.bar(ind,b,width=1,color='r')
plot.show()
 
Visually, this looks exactly the same as before. But now we have a lot more control over the figure that is containing the plots. While we are here, let's play around a bit with the new subplot functionality. First, we can use it to add multiple graphs to the same figure:
ind = np.arange(-25,len(b)-25)
p1 = plot.subplot(211)
p1.bar(ind,b,width=1,color='r')
 
import pylab as pL
c = a[:,1] + pL.uniform(0,1000,len(a[:,1]))
p2 = plot.subplot(212)
p2.scatter(a[:,1],c,marker='o',c='g')
plot.show()
We can also do a plot-in-plot. However, for this example, we need to declare a figure so that python know we are adding multiple things to the same figure and not overwriting old subplots with new subplots:
fig = plot.figure()
ind = np.arange(-25,len(b)-25)
p1 = fig.add_subplot(111)
p1.bar(ind,b,width=1,color='r')
import pylab as pL
c = a[:,1] + pL.uniform(0,1000,len(a[:,1]))
p2 = fig.add_subplot(222)
p2.scatter(a[:,1],c,marker='o',c='g')
 
plot.show()
Cool! What if we want to make in a circle? For this, we need to get rid of the subplot and declare a special type of axis:
fig = plot.figure()
ind = np.arange(-25,len(b)-25)
#p1 = fig.add_subplot(111)
p1 = fig.add_axes([0.1,0.1,0.8,0.8], polar=True)
import numpy as np
width = 2*np.pi/len(ind)
p1.bar(ind,b,width=width,color='r')
Alright, enough fooling around, let's make the background white and add the dotted lines:
fig = plot.figure(facecolor='w')
ind = np.arange(-25,len(b)-25)
p1 = plot.subplot(111)
p1.bar(ind,b,width=1,color='r')
p1.xaxis.grid()
plot.show()
And fix up the x-axis to make the grid lines more frequent, get rid of the y axis tick labels, and add axis labels:
p1.bar(ind,b,width=1,color='r')
from matplotlib.ticker import MultipleLocator
p1.xaxis.set_major_locator(MultipleLocator(12))
p1.xaxis.set_minor_locator(MultipleLocator(3))
p1.set_yticklabels([])
p1.xaxis.grid(which='minor')
p1.set_xlabel('CDS Position [nt from start]')
p1.set_ylabel('Read 5\' ends [A.U.]')
plot.show()
Now we have something that looks REALLY similar to Figure 2.A from the paper!

Moving on, let's check out how they made Figure 2.E...First, though, we need a lot more data. Rich has kindly analyzed the entire 1500 reads for ribosome replicates 1 and 2 and for the mRNA the same way you guys analyzed the first 400 reads. You can download the tab-delimited files from here: , , , and .

First, we will rewrite the read loop to be a function to make it easier to call on each file:
#read in data
def read_file(f1):
 
    results = []
    fh = open(f1, 'r')
    for line in fh:
            line = line.strip().split()
            results.append( [int(line[0]), float(line[1])] )
 
    return results
 
#############################################################
 
rib1 = read_file('gsmRich1.txt.out')
rib2 = read_file('gsmRich2.txt.out')
rna1 = read_file('mRNARich1.txt.out')
rna2 = read_file('mRNARich2.txt.out')
 
Next, we'll combine the data and convert it into arrays for analysis:
import numpy as np
bothRib = rib1 + rib2
bothRNA = rna1 + rna2
 
rib1 = np.array(rib1)
rib2 = np.array(rib2)
bothRib = np.array(bothRib)
bothRNA = np.array(bothRNA)

Now, we'll divide the ribosome data by the rna data (needed for histogram) and ribosome replicate 1 by ribosome replicate 2 (needed for the error measurement).
a = bothRib[:,1]/bothRNA[:,1]
b = rib1[:,1]/rib2[:,1]
 
Now we'll create the actual histogram from the ratio of the ribosome to the rna:
import matplotlib.pyplot as plot
p1 = plot.subplot(111)
p1.hist(a)
And now, we'll increase the number of bins to better reflect the complexity of the data:
p1.hist(a,75)
Next, we'll layer on the ratio of the ribosomes as a measure of error:
p1.hist(b,100,histtype='step',edgecolor='r')
If you look, you can see in the figure that the x-axis is actually in log format. What happens when we convert the x-axis to be log format?
p1.set_xscale('log', basex=2)
Ack! Why is this happening? What can we do to fix it?
a = bothRib[:,1]/bothRNA[:,1]
b = rib1[:,1]/rib2[:,1]
a = np.log2(a)
b = np.log2(b)
Phew! Next, lets modify the x-axis to be the range in the figure, add a label and remove the y-axis completely:
p1.set_xlabel('log2 translational efficiency')
p1.xaxis.set_ticks(np.arange(-6,7))
p1.xaxis.tick_bottom()
p1.yaxis.set_visible(False)
Finally, after all this work, we might want to save the figure! But first, we have to tell matplotlib that it is a figure to be saved.
fig = plot.figure(facecolor='w')
p1 = fig.add_subplot(111)
p1.hist(a,100)
p1.hist(b,100,histtype='step',edgecolor='r')
#p1.set_xscale('log', basex=2)
p1.set_xlabel('log2 translational efficiency')
p1.xaxis.set_ticks(np.arange(-6,7))
p1.xaxis.tick_bottom()
p1.yaxis.set_visible(False)
fig.savefig('histogram.pdf',dpi=500,format='pdf')
plot.show()
 

Exercises:
1) Go back to the bar graph you generated toward the beginning of lecture(Figure 2.A). You now have the mRNA data. Add the appropriate line to make it look like the published figure. (HINT: Check out the function--NOT the module--plot(x,y).)

2) We actually only recreated the left side of Figure 2.A. Recreate the right side of the graph by plotting the terminal data from the arrays. (HINT: You MAY need to reset the scale on the y-axis to match the range for the left side of the graph).

3) The authors did a bunch of smoothing of the data and then plotted it to show that the ribosome reads were very highest at the beginning of the codon. Rich has done a quick and dirty smoothing of the data. You can download the files from here: , , , and . Using the smoothed data, recreate Figure 2.F as closely as possible. (HINT: Because Rich's smoothing algorithm worked differently, you don't have to deal with the axis break in the figure. Also, check out the legend function. Can you figure out how to call it?).

4) BONUS: Calculate a beta distribution using numpy. Plot the beta distribution over the histogram graph. How is it different? What other types of distributions might you want to compare?