System Calls, Scripting External Programs, and Project Overview


Topics:

  • Manipulating the file system and UNIX environment from Python
  • Executing other external user space programs with os and sys modules
  • Executing external processes with the subprocess module
  • Running BLAST from a Python script
  • Introduction to Course Projects

Today You Will:

Learn to use the os and sys modules to manipulate the UNIX environment by running external programs to manipulate the filesystem and memory space. Then we will move on to the more sophisticated subprocess modules. Finally, we will see an introduction to the course projects, which we will begin next week.


Useful tools from the os module


The UNIX environment that you have access to when you type in the terminal is also available to Python. That means that you don't have to open a separate terminal window just to change directories, create a new file, or see what operating system you program is running on. This also means that you have the capacity to script what would otherwise be exceedingly mundane tasks, such as renaming lots of files or creating various subdirectories for your output from different data sets.

You've already seen one key component of the os module, the function used to open filehandles, open(). Let's import the module and look at some of the others.

import os
# Let's make sure we know which working directory we are in
cwd = os.getcwd()
print cwd, 'is the current working directory.'
 
# And which files are present in the cwd?  Note that the '.' represents the current working directory
cwdfiles = os.listdir('.')
# The output of listdir is stored as a list, so let's print it item by item.
for file in cwdfiles:
    print file
 
# And we can look in other directories too.  Note that you'll have to change the dir below to one that exists on your filesystem.
homefiles = os.listdir('/Users/matt')
for file in homefiles:
    print file
 
# We could also first change our cwd to a different directory, and then get the file list.
os.chdir('/')
print os.getcwd(), 'is the now the current working directory'
cwdfiles = os.listdir('.')
for file in cwdfiles:
    print file
 

We can also use the os module to manipulate the filesystem.

#import os
# Let's make some new directories and files
numbers = [1, 2, 3]
letters = ['a', 'b', 'c']
for num in numbers:
    for let in letters:
        os.mkdir(str(num) + let)
 
# Let's get all our new directories in a list
diritems = os.listdir('.')
 
# And let's make some files in them
teachers = ['Matt', 'Rich', 'Terry']
 
for diritem in diritems:
#using what Terry taught us about exceptions and testing
    try:
        os.chdir(diritem)
        for teach in teachers:
            for num in numbers:
                try:
                    open(teach+'_'+str(num), 'w')
                except IOError:
                    print "caught an error attempting to open the new file", teach+'_'+str(num)
                    os.chdir('../')
                except:
                    "Caught a general exception attempting to open the new file", teach+'_'+str(num)
        os.chdir('../')
 
    except OSError:
        print "caught an error from module OS, possible that", diritem, "isn't a directory after all."
    except:
        print "caught a general exception trying to change to directory", diritem

Whew, what a mess. Let's clean up all those unimportant files, leaving only the important ones.
import os
 
diritems = os.listdir('.')
for diritem in diritems:
    try:
        os.chdir(diritem)
        subdiritems = os.listdir('.')
        for subitem in subdiritems:
            if ( subitem.startswith('Matt') ):
                continue
            else:
                os.remove(subitem)
        os.chdir('../')
    except OSError:
        print "caught an error from module OS, possible that", diritem, "isn't a directory after all."
    except:
        print "caught a general exception trying to change to directory", diritem

Okay, enough with files for now. Let's look at how we can execute UNIX commands from Python. We'll start with the simplest, dullest, most boringest way: using os.system()

#import os
os.system('ls -l')

$./scratch.py
total 8
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3c
-rw-r--r-- 1 matt matt 0 Jun 3 20:17 Matt_2
-rw-r--r--@ 1 matt matt 3586 Jun 3 20:07 S5.2
-rw-r--r-- 1 matt matt 0 Jun 3 20:02 test

Okay, so we can see the output of the UNIX command 'ls -l' from the script, and compare it to the regular terminal:

$ls -l
total 8
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1a/
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1b/
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1c/
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2a/
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2b/
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2c/
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3a/
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3b/
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3c/
-rw-r--r-- 1 matt matt 0 Jun 3 20:17 Matt_2
-rw-r--r--@ 1 matt matt 3586 Jun 3 20:07 S5.2
-rw-r--r-- 1 matt matt 0 Jun 3 20:02 test

A subtle difference of trailing slashes is the only difference in my case. But this is not the same as using os.listdir('.'). The two differences are that os.listdir('.') does not have access to the detailed information that regular command line 'ls -l' does, and that os.listdir('.') returns a list of the directory items, not merely a string of the command line output. But this was a truly trivial example, so let's move on to, well, another truly trivial example.

import os
#remember our script from the first of the week?
os.system('../S1.2/hello.py')

$./scratch.py
hello "world", if that is your real name.
That's World, to you

Yep, it's still there. And we can use another program to run it. We could run it hundreds of times this way, if we were truly trivial people. Hrmmm...

Just like opening files, it's a good idea to look for exceptions in the process of executing system calls. This is implemented very simply with os.system(), as the function will return a non-zero value if the call fails to execute.

import os
 
dirs = ['.', 'asdjfaklsdfjlksadjf']
for d in dirs:
    if ( os.system('ls '+d) != 0 ):
        print "Whoa, that didn't work out, which ls was happy to tell us about."
    else:
        print "Well, okay, great.  Thanks ls."

$./scratch.py
1a 1b 1c 2a 2b 2c 3a 3b 3c Matt_2 S5.2 test
Well, okay, great. Thanks ls.
ls: asdjfaklsdfjlksadjf: No such file or directory
Whoa, that didn't work out, which ls was happy to tell us about.

But this really is the simple way. Often when we our using Python to run other programs, the programs we are executing may finish at different times. When that's the case, we may want to control the order in which they are executed, or merely the number of subprocesses running at once. These days, even laptops have multiple CPUs in them, but we still might not want to dump hundreds of subprocesses on our computer at once. The subprocess module allows us to a but more intelligent about how we execute system calls by using a function called subprocess.Popen() This is somewhat different, however, in that the subprocess we want to execute must be created as an object, which we can then monitor with the methods of this object.

# let's chop the module name down to sp for the purposes of typing fewer characters
import subprocess as sp
 
# let's create a subprocess without assigning it to a variable, and see what happens
sp.Popen( ['ls', '-l'] )

We're going to use a little UNIX tool called time, which will tell you how long it took to run your script.

$ time python scratch.py
total 24
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3c
-rw-r--r--@ 1 matt matt 3586 Jun 3 20:07 S5.2
-rwxr-xr-x 1 matt matt 219 Jun 3 23:01 scratch.py
-rw-r--r-- 1 matt matt 529 Jun 3 22:57 test

real 0m0.149s
user 0m0.114s
sys 0m0.028s

And we see that script outputs the same as os.system() did, but with the added "time" info on the last line.

import subprocess as sp
 
for each in range(0,100):
    a = sp.Popen( ['ls', '-l'] )

When we run this code, we see that it takes about about half a second of real-life clock time to run, and about the same for the "system" clock. The number in the middle, the "user" time, is about half, because my laptop has two CPU cores, over which the system time has been roughly divided.

$ time python scratch.py

real 0m0.510s
user 0m0.269s
sys 0m0.490s

for each in range(0,100):
    a = sp.Popen( ['ls', '-l'] )
    a.wait()

The .wait() method of the subprocess object tells the system to wait until this subprocess has completed before doing anything else in the program. When we look at the time output for this code, we see that it takes much more time on the clock (almost three quarters of a second), but the same amount of user and system time. This is because while the program waits for the process, the second CPU core cannot be crunching away on another process. While this may seem inefficient (it certainly is), it is also courteous to the other processes on your computer. In other words, if you're using your laptop for anything other than running your program, you may want to not grind everything to a halt for the duration of your program's runtime (which for me is often hours and hours). And if you're using a server to run your program on, your coworkers may appreciate it if you do not make use of every CPU on the server while they're trying to get their work done. In truth, the operating system will make its own attempt at balancing the needs of processes, but especially in the case where you're trying to use your laptop or PC as a workstation, this can save you some frustration.

$ time python scratch.py

real 0m0.735s
user 0m0.272s
sys 0m0.459s

And just to show that it really is the CPU sharing that's driving the effect:

import subprocess as sp
 
# cut the number of subprocesses in half, then run two sets of them
for each in range(0,50):
    a = sp.Popen( ['ls', '-l'] )
    b = sp.Popen( ['ls', '-l'] )
    a.wait()

And we see that our runtime is almost exactly what it was in the first case.

$ time python scratch.py

real 0m0.506s
user 0m0.270s
sys 0m0.491s

Okay, now that we know how to run the subprocess, and we know that we have to create a subprocess object if we want to do things like wait(), let's look at other advantages of using this type of object.

import subprocess as sp
 
#let's say we want to store our output to a file instead of seeing it on the screen
 
fh = open('outfile', 'a')
proc = sp.Popen(['ls', '-l'], stdout=fh)

And if we run the script and then look at the contents of the file:

$ python scratch.py
$ cat outfiletotal 16
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 1c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 2c
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3a
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3b
drwxr-xr-x 5 matt matt 170 Jun 3 20:31 3c
-rw-r--r--@ 1 matt matt 3586 Jun 3 20:07 S5.2
-rw-r--r-- 1 matt matt 0 Jun 3 23:45 outfile
-rwxr-xr-x 1 matt matt 250 Jun 3 23:45 scratch.py

There's all our output, which can be much more convenient than having it scroll down the screen.

However, what can be even *more* convenient is having it stored in a data structure in memory.

import subprocess as sp
 
proc = sp.Popen(['ls', '-l'], stdout=sp.PIPE)
# The sp.PIPE is a mechanism that allows the output to be redirected to wherever you point it.  In this case, to our subprocess object named proc.
# The next step is to get the data out of the object and into a data structure that we can organize however we see fit.
 
# we use the method below to get a tuple of two things: the stdout and the stderr
proc_output = proc.communicate()
print "------------------------"
print "We got a tuple\nWith the first part being:\n", proc_output[0], "\nand the second being:\n", proc_output[1]
print "------------------------"
 
#Once we use the .communicate() method, the stdout and stderr are gone from the proc object.
print "------------------------"
print "and if we don't format anything, we see the tuple like this:\n", proc_output
print "------------------------"
 
#But we could easily convert this into a friendlier structure
stdoutlist = proc_output[0].split('\n')
print "------------------------"
print "And now we see the first half of the tuple as a list of lines of output from our subprocess:\n", stdoutlist
print "------------------------"

I've intentionally kept this "ls -l" example set very simple, in terms of what command we are actually executing, because it's important to see how this object type works. However, typically this tool would be used to run things that are going to generate very large sets of data. For you, that might be a systematic search of a database like the PDB, or large-scale BLAST searches, multiple sequence alignments, searches for transcription factor binding sites in genomes, or a host of other common but computationally expensive bioinformatic problems. Or, for the foreseeable future, it may just be the most reliable way to organize small sets of output data into a data structure or log file.

Project Overview


Starting next week, we're going to move through two class projects. Monday, we will prepare some final skills that will help you write and debug your own complicated code, as well as introduce you to an entirely different way to use Python with an interactive interpreter. On Tuesday, we'll spend all day downloading, parsing, organizing, and doing a preliminary analysis on a recently publsihed dataset. On Wednesday, we'll compare your own re-analyses of the dataset the actual published dataset, and in doing so, we'll learn how to explore large datasets with Python as an analytical scientific toolkit for basic math, statistics, and graphing. The paper for these two days is a recently published Science paper from Jonathan Weissman's Lab (Ingolia, et al 2009). The authors use high-throughout short-read seqeuncing of mRNA molecues in yeast to identify precisely where ribosomes are bound to the transcripts in live cells. There are a couple of interesting findings, and we hope you'll find the paper generally appreciable both for it's curiosity and it's fancy-pants methods.

On Thursday, Terry is going to lead us through a second project oriented more towards protein biophysics. We will be using the program Chimera, which we will script using methods similar to the ones we have discussed this afternoon, to analyze crystal structure data.

Friday morning, we will be covering some other tools that we find useful, but were not of central importance for this class. And throughout the course of next week, we'd like you all to be thinking of another project that you might leverage your new skills against, so that we can take a look at some of them on Friday afternoon.

BLAST


Good ol'e Basic Local Alignment Search Tool, now that's bioinformatics even Grandma can get down with. In its most basic form, BLAST takes a short sequence of either amino acids or nucleotides, searches a database of reference sequences, and returns the most likely matches for the query sequence. It's brilliant, and the only trouble is that we're often interested in BLASTing lotsa query sequences, and much less interested in the mundanity of parsing the output that comes back to us. Fortunately, we can easily script the execution of BLAST such that we can run queries many times with different sequences and settings, and also parse the output back into dictionaries to manipulate and even store longterm on the disk.

The first part of this section will show us how to install a new program in our UNIX environment, and the second part will let us practice using Popen() to script the execution of BLAST.

Installing BLAST


Open your web browser to the NCBI downloads page: http://www.ncbi.nlm.nih.gov/Ftp/

Follow the BLAST link to the NCBI ftp server, then through these folders: downloads -> LATEST

For your respective operating systems, click and download the appropriate .tar.gz file (e.g. for OSX, the correct file is blast-2.2.20-universal-macosx.tar.gz).

Move the file to your home directory, and then go back to the terminal window.

The .tar.gz file extension tells us that the file has been archived (the .tar part) and compressed (the .gz part). The UNIX command to uncompress and expand the file is:

$ tar zxvf [filename]

$ tar zxvf blast-2.2.20-universal-macosx.tar.gz
blast-2.2.20/
blast-2.2.20/bin/
blast-2.2.20/bin/._bl2seq
blast-2.2.20/bin/bl2seq
blast-2.2.20/bin/._blastall
blast-2.2.20/bin/blastall
blast-2.2.20/bin/._blastclust
blast-2.2.20/bin/blastclust
blast-2.2.20/bin/._blastpgp
blast-2.2.20/bin/blastpgp
blast-2.2.20/bin/._copymat
blast-2.2.20/bin/copymat
blast-2.2.20/bin/._fastacmd
blast-2.2.20/bin/fastacmd
blast-2.2.20/bin/._formatdb
blast-2.2.20/bin/formatdb
blast-2.2.20/bin/._formatrpsdb
blast-2.2.20/bin/formatrpsdb
blast-2.2.20/bin/._impala
blast-2.2.20/bin/impala
blast-2.2.20/bin/._makemat
blast-2.2.20/bin/makemat
blast-2.2.20/bin/._megablast
blast-2.2.20/bin/megablast
blast-2.2.20/bin/._rpsblast
blast-2.2.20/bin/rpsblast
blast-2.2.20/bin/._seedtop
blast-2.2.20/bin/seedtop
blast-2.2.20/data/
blast-2.2.20/data/asn2ff.prt
blast-2.2.20/data/BLOSUM45
blast-2.2.20/data/BLOSUM62
blast-2.2.20/data/BLOSUM80
blast-2.2.20/data/bstdt.val
blast-2.2.20/data/ecnum_ambiguous.txt
blast-2.2.20/data/ecnum_specific.txt
blast-2.2.20/data/featdef.val
blast-2.2.20/data/gc.val
blast-2.2.20/data/humrep.fsa
blast-2.2.20/data/KSat.flt
blast-2.2.20/data/KSchoth.flt
blast-2.2.20/data/KSgc.flt
blast-2.2.20/data/KShopp.flt
blast-2.2.20/data/KSkyte.flt
blast-2.2.20/data/KSpcc.mat
blast-2.2.20/data/KSpur.flt
blast-2.2.20/data/KSpyr.flt
blast-2.2.20/data/lineages.txt
blast-2.2.20/data/makerpt.prt
blast-2.2.20/data/objprt.prt
blast-2.2.20/data/PAM30
blast-2.2.20/data/PAM70
blast-2.2.20/data/pubkey.enc
blast-2.2.20/data/seqcode.val
blast-2.2.20/data/sequin.hlp
blast-2.2.20/data/sgmlbb.ent
blast-2.2.20/data/taxlist.txt
blast-2.2.20/data/UniVec.nhr
blast-2.2.20/data/UniVec.nin
blast-2.2.20/data/UniVec.nsq
blast-2.2.20/data/UniVec_Core.nhr
blast-2.2.20/data/UniVec_Core.nin
blast-2.2.20/data/UniVec_Core.nsq
blast-2.2.20/doc/
blast-2.2.20/doc/bl2seq.html
blast-2.2.20/doc/blast.html
blast-2.2.20/doc/blastall.html
blast-2.2.20/doc/blastclust.html
blast-2.2.20/doc/blastdb.html
blast-2.2.20/doc/blastftp.html
blast-2.2.20/doc/blastpgp.html
blast-2.2.20/doc/fastacmd.html
blast-2.2.20/doc/filter.html
blast-2.2.20/doc/formatdb.html
blast-2.2.20/doc/formatrpsdb.html
blast-2.2.20/doc/history.html
blast-2.2.20/doc/impala.html
blast-2.2.20/doc/index.html
blast-2.2.20/doc/megablast.html
blast-2.2.20/doc/netblast.html
blast-2.2.20/doc/rpsblast.html
blast-2.2.20/doc/scoring.pdf
blast-2.2.20/doc/web_blast.pl
blast-2.2.20/VERSION
 
We now have a new directory with all those uncompressed files in it, so let's change to it:

$ cd blast-2.2.20
$ ls
VERSION  bin/     data/    doc/
 

We've got a VERSION file that will tell us what version of BLAST we've got, a doc directory with documentation, a data directory with the various matrices to use for esimating substituion rates, and then the bin directory that contains the actual binary executable for the program. If we change to the bin directory and take a look, we'll see several files:

$ cd bin
$ ls
bl2seq*      blastclust*  copymat*     formatdb*    impala*      megablast*   seedtop*
blastall*    blastpgp*    fastacmd*    formatrpsdb* makemat*     rpsblast*
 

The executables we're interested in for now are formatdb and blastall, the latter being the main executable that we'll be sending arguments to with our script in a moment. But first, we need to add sequence files for BLAST to search against. Today, we're going to use two of them: the nucleotide and amino acid sequence files for S. cerevisiae, which we can download from the command line using a utility called wget.

$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/yeast.nt.gz
--10:55:58--  ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/yeast.nt.gz
           => `yeast.nt.gz'
Resolving ftp.ncbi.nlm.nih.gov... 165.112.7.10
Connecting to ftp.ncbi.nlm.nih.gov|165.112.7.10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD /blast/db/FASTA ... done.
==> PASV ... done.    ==> RETR yeast.nt.gz ... done.
Length: 3,732,371 (3.6M) (unauthoritative)
 
100%[=====================================================================>] 3,732,371    448.73K/s    ETA 00:00
 
10:56:13 (434.34 KB/s) - `yeast.nt.gz' saved [3732371]
 

So, that got us the yeast nucleotide database, and we'll get the amino acid one with the same command (different ftp location):

$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/yeast.aa.gz
--10:59:04--  ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/yeast.aa.gz
           => `yeast.aa.gz'
Resolving ftp.ncbi.nlm.nih.gov... 130.14.29.30
Connecting to ftp.ncbi.nlm.nih.gov|130.14.29.30|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD /blast/db/FASTA ... done.
==> PASV ... done.    ==> RETR yeast.aa.gz ... done.
Length: 1,951,194 (1.9M) (unauthoritative)
 
100%[=====================================================================>] 1,951,194    423.61K/s    ETA 00:00
 
10:59:11 (374.71 KB/s) - `yeast.aa.gz' saved [1951194]
 

Okay, so note that these two files have a .gz extension, which lets us know that they are compressed. To uncompress them, we issue this command (different from the .tar.gz files above):

Yes, you could do this with a web browser, but imagine that you want to download bunches of files, and then imagine how you might be able to script this using wget (or as we'll see next week, built-in python libraries that don't require system calls).

$ gunzip yeast.nt.gz
$ gunzip yeast.aa.gz
 

And one more step before we can use our new BLAST installation: we have to format the sequences we just downloaded to used by BLAST using the formatdb command.

$ bin/formatdb -i yeast.nt -p F
$ bin/formatdb -i yeast.aa -p T
 

Okay, we now have a fully working version of BLAST, installed into our home directories (of course, if you want to do this for a lab server, you'd want it available to all the users, so you wouldn't install it in your home directory, but that's a different lesson).

Let's make a file to query the database with by opening a new text file and pasting in these lines:

>test seq
ATGGTTCATTTAGGTCCAAAGAAACCACAGGCTAGAAAGGGTTCCATGGCTGATGTGCCC
AAGGAATTGATGGATGAAATTCATCAGTTGGAAGATATGTTTACAGTTGACAGCGAGACC
TTGAGAAAGGTTGTTAAGCACTTTATCGACGAATTGAATAAAGGTTTGACAAAGAAGGGA
GGTAACATTCCAATGATTCCCGGTTGGGTCATGGAATTCCCAACAGGTAAAGAATCTGGT
AACTATTTGGCCATTGATTTGGGTGGTACTAACTTAAGAGTCGTGTTGGTCAAGTTGAGC
GGTAACCATACCTTTGACACCACTCAATCCAAGTATAAACTACCACATGACATGAG

And then save the file as query.fasta

And.... here we go a scriptin'.

import subprocess as SP
 
# some arguments for running BLAST
program = 'blastn'
queryseq = 'query.fasta'
database = 's_cerevisiae.nt.niceheader.fasta'
 
#our construction of the executable and the variable arguments for the command (notice the ./ ahead of blastall, which we wouldn't need if the executable was set +x from the command prompt).  Also notice that we're directing the output to the PIPE so that we can parse it in memory.  We could just as easily send this to a filehandle for storage on the disk.
proc = SP.Popen(['/User/matt/blast-2.2.20/bin/blastall', '-p', program, '-i', queryseq, '-d', database], stdout=SP.PIPE)
 
output = proc.communicate()
 
# note that this is returned to us a tuple, even though we only requested to communicate with the stdout when we called Popen().
# to see the output in a reasonable format, try:
 
output[0].split('\n')

You can imaging how you would set a list of query sequences, or a series of databases to query against, or even vary some of the BLAST settings (for example, the minimum E-value to report), and run this again. You also might care which species your hit came from (if you were searching against a larger database), or which chromosome hits were on (to determine synteny, e.g.)

So, use this output to start a couple of exercises:

Exercises


  1. Modify your Popen() call to set a minimum E-value of 1e-08. Parse the output of your BLAST hits into a dictionary keyed by chromosome, containing the length of the hit, the E-value, and the percent identity of the hit.
  2. Right a script using the genetic code (provided here in parse-able text format) to translate your nucleotide sequence into proteins. Then use blastp instead of blastn to search against the amino acid sequence file we downloaded earlier.

Ala/A GCU, GCC, GCA, GCG
Leu/L UUA, UUG, CUU, CUC, CUA, CUG
Arg/R CGU, CGC, CGA, CGG, AGA, AGG
Lys/K AAA, AAG
Asn/N AAU, AAC
Met/M AUG
Asp/D GAU, GAC
Phe/F UUU, UUC
Cys/C UGU, UGC
Pro/P CCU, CCC, CCA, CCG
Gln/Q CAA, CAG
Ser/S UCU, UCC, UCA, UCG, AGU, AGC
Glu/E GAA, GAG
Thr/T ACU, ACC, ACA, ACG
Gly/G GGU, GGC, GGA, GGG
Trp/W UGG
His/H CAU, CAC
Tyr/Y UAU, UAC
Ile/I AUU, AUC, AUA
Val/V GUU, GUC, GUA, GUG
STOP UAG, UGA, UAA