We're wrapping up. However, before we go, I'm going to highlight some of the other modules that, while not important enough to really get their own lecture, you might find come in handy. As I mentioned in the beginning of the course, one of the reasons that we teach python is that it's very popular-- and not just among biologists, but among computer scientists as well. This means that if some functionality is needed, there's a good chance that someone's already written it. To name a few things that I've run into in the last few months, as well as the python modules that addressed them:
Image processing/creation -> the python imaging library & the gd module
Super-speedy, super comprehensive mathematics & statistics -> the gnu scientific library
Merging a bunch of computers into one -> mpi4py
Transitioning to faster code -> cython
Checking your email so that you don't have to -> libgmail
We're not going to get into any of those-- some of them are really for more advanced programming classes (mpi4py), and some of them are basically silly, at least in the ways that I can think how to use them (libgmail). However, we will talk about a few things that you will probably find useful at some point in your career.
The Python Path
This is less of a 'probably find useful' than a 'certainly find useful', at least if you continue to use python (as you all will, of course). Remember when we talked about functions and modules? One of my central points was that modules can help you centralize your code, allowing you to share functions between many different programs. My trusty 'parseFasta(file)' function has been used in at least a hundred of my programs in graduate school. However-- do I keep all of these hundred-plus programs in a single directory? Of course not-- that wouldn't be organized at all. But then how do I let all of those programs access the same (again, trusty) seqTools.py module?
The answer comes in telling python a directory or set of directories to look in when it's tasked with finding a module. By default it looks in the directory that the script is in (and inside python's internals) for modules-- it turns out it's easy to tell it to look somewhere else, too. This involves setting the environment variable PYTHONPATH.
$ export PYTHONPATH=/home/lusk/PythonModules # make the directory appropriate, of course
You can add this line to your .bashrc and .bash_profile scripts in your home directory to make sure that python always looks in that directory for scripts. Now I can and do keep all (as in, no exceptions) of my modules in one place.
Regular expressions
Regular expressions are, in the most basic sense, an extension to the string find, count, startswith, etc methods that we've seen before. However, instead of dealing only in perfectly matching strings, they have a lot more flexibility, and with that, a lot more power.
Here's a basic overview of the syntax:
importre
searchString = r'word'
queryString ='Is there something that says "word" in this sentence?'ifre.search(searchString,queryString):
print'Yes, there is.'
queryString2 ='This exercise is obnoxiously recursive'ifre.search(searchString,queryString2):
print'Certainly so.'
You have a search string, which specifies what you're looking for, and you have a query string, which specifies where you're looking. re.search then looks for you, and if it finds it, returns some information about the match. If it doesn't, it returns nothing. In most cases, as now, we just want to see whether or not the query contains the search string, so we're just using re.search as part of an if statement.
However, if we just wanted to match letter-by-letter, we could use find or count. We want more. Much of the power of regular expressions comes from the set of symbols that they can contain that match not just letters, but parts of the word, or sets of letters, or quantities of letters, or quantities of certain sets of letters at a given part of a word-- etc, etc, and, etc. Here is a small subset of what search strings can contain:
'Wheres'
^ -> beginning of the string
$ -> end of the string
\b -> the beginning or end of a word
'Whats'
. -> any character
[] -> any character inside brackets, e.g. '[abfg]'
\s -> any whitespace character
\d -> any digit
'How manys'
+ -> one or more
? -> zero or one
* -> zero or more
{m}-> exactly m
This makes for things that are, at first, somewhat difficult to read. As in, if I wanted a string that started with a six-letter word beginning with 'A', followed, at some point, by a word with four characters in the set ('y','b','c','6','a'), and ending with at least one 'q' but maybe more, we would write:
searchString = r'^A.{5}\b.*\b[ybc6a]{4}\b.*q+$' # remember that 'r' to start!
queryString = 'Arthur was a baby in Iraq'
if re.search(searchString,queryString):
print 'pass'
If you're a perl programmer, then you can probably read that like you read your native tongue-- perl has to use regular expressions all the time, since it doesn't have find, startswith, etc. However, thankfully, things are rarely so complicated.
For example, say that after taking this python course, you get so fascinated by computing and computers (I mean, we're gifted pedagogues, right?) that you leave science to take a job administering MCBs network. You start by looking at email addresses, looking to impose some order on a chaotic landscape. According to your strict worldview, email addresses should contain some combination of the person's:
(a) last name (always)
(b) first initial of name (but only at the first position, and only if the last name is also there)
(c) possibly some other letters, such as the middle initial, but only if (a) and (b) are satisfied
However, there are hundreds of addresses in the system! Clearly, you need a script to figure out who to harass. With regular expressions, this turns out to be fairly easy:
#!/usr/bin/env pythonimportre
emails ={}
emails['Rich Lusk']='lusk@berkeley.edu'
emails['Terry Lang']='terry@lego.berkeley.edu'
emails['Matt Davis']='matthewdavis@berkeley.edu'
emails['Rich Price']='rich_price@berkeley.edu'
emails['Roseanne Wincek']='rwincek@gmail.com'
emails['Aaron Hardin']='aihardin@berkeley.edu'
emails['Angela Brooks']='angelabrooks@berkeley.edu'# imagine the damage that could be done with libgmail herefor name in emails:
spl = name.split()
firstName = spl[0]
lastName = spl[1]
searchString = r'^' + firstName[0].lower() + '?.*' + lastName.lower()ifre.search(searchString,emails[name]):
print name,": OK"else:
print name,': change your email, hippie!'
You can also capture the output of a match.
sentence ='I am a good sentence, and I dependably have punctuation.'
nearComma =re.search('[a-z]+,',sentence)print nearComma.group(0)# you can also specify subsets using parentheses
nearComma =re.search('([a-z]+) ([a-z]+)(,)',sentence)print nearComma.group(0)# the complete matchprint nearComma.group(1)# what's in the first set of parensprint nearComma.group(2)# second setprint nearComma.group(3)# third set
And that's the basics of regular expressions! In previous versions of the course, when we were teaching in perl, we spent hours and hours and hours of class time on them, because they were such a fundamental part of the language-- again, imagine not having 'find' or 'startswith.' Frankly, I'm happy to only have to use them once every few months.
SQL
Imagine that you need to store a really, really huge piece of data. Not only that, but perhaps you and others need to update and modify this piece of data on a regular basis. This raises a number of complications-- how do you store large data sets efficiently? how can you manage who can change which parts of these sets? with many users making so many modifications, how can you set up protections such that the quality of the data doesn't degrade over time? And finally, and perhaps most importantly-- parsing and loading gigabytes of data can take a whole pile of time and effort. How can we access it without going through that trouble?
These questions have been around for a long time, and as you can imagine, the solution (or at least a solution) has been around for almost as long. Large pieces of data that large numbers of people need access to are stored in databases. These databases having been around for a long time, the means to access and interface with them have become largely standardized in a primitive programming language called Structured Query Language, or SQL for short. While I'm not going to go into any detail about it here-- SQL is, after all, a completely different programming language-- something that many people find neat about python is its ability to very, very easily interact with databases.
This easy interaction comes from a module called sqlite3, which contains a small and tidy implementation of SQL. With it, you can create small-scale databases on your own computer. This can come in handy if you find yourself generating large pieces of data that become unwieldy to store in basic 'flat' text files, and a number of python modules and programs use sqlite3 to store all of their data.
If you find yourself generating lots of data, it's worth exploring-- SQL, while another language, is small and primitive enough that you can learn the basics quickly.
Here's an example of how it might be used:
# make a database file and connect
conn = sqlite3.connect('/tmp/example')# create a tool 'c' to interface with the database
c = conn.cursor()# Create table
c.execute('''create table sequences
(id text, organism text, sequence text)''')# Insert a row of data
c.execute("""insert into sequences
values ('FuzzynessGene','Mouse','ATAGGTACGA')""")# Save (commit) the changes
conn.commit()# We can also close the cursor if we are done with it
c.close()
I wont assign any exercises about this (if you're desperate, see last year's equivalent lecture at http://intro-prog-bioinfo-2008.wikispaces.com/Session8.2), as, well, it's not something that I use myself. However, this will likely come in handy for some of you, and you should check out Lenny Teytelman's excellent 'BioSQL' tutorial at http://biosql.wikispaces.com/.
Exercises:
0. This is the fourth iteration of this course, and I believe that we've made it better each time (pity those in the first). However, I bet we can make it better the fifth time around, too-- and we'd like your help. Could you fill out this course evaluation?
In the spirit of the course, just open it up in your favorite text editor and save it from there.
In order to preserve anonymity, I've set up a separate gmail account. The username is 'ipb.evaluations' and the password is 'evaluation'. Please complete the form above, log in to that account, and send an email to intro.prog.bioinformatics@gmail.com with the form attached. Then log out-- I'm not sure how high gmail's tolerance for multiple-users-logged-in is.
After you've done this, your first priority should be completing the project. If you are finished, you can explore the following exercises.
1. Let's do some regular expression drills.
a) Create a regular expression that matches any line that begins with the character '>'.
b) Modify your regular expression from part (a) so that it also matches the first word in the ID line. Print the first word of the ID line, only using regular expressions-- don't use slices.
c) Modify your regular expression from part (b) so that it expects ID lines to be in the following format.
That is, ID lines should begin with a '>', then have a identification string, then have a '|', then have the name of an organism. Create a regular expression that captures the identification string in a variable 'id' and the organism in a variable 'org.'
2. Make a primitive ORF finder using regular expressions. That is, you should use the template:
sequence =sys.argv[1]
regex =<your code here>
x =re.match(regex,sequence)print x.group(0)
The match should begin at 'ATG' and end with 'TAA.' The number of nucleotides in between should be a multiple of three.
3. Use python's online documentation to find out how to use the re.sub function. Modify your script from (2) so that it deletes the orf sequence.
4. Make a 'hello, <name>' function, like we defined in the lecture introducing functions. The name should be on the command line. Put it in a separate module from the script that calls it, and put that module in a different directory. Run the script.
What else is there?
Introduction
We're wrapping up. However, before we go, I'm going to highlight some of the other modules that, while not important enough to really get their own lecture, you might find come in handy. As I mentioned in the beginning of the course, one of the reasons that we teach python is that it's very popular-- and not just among biologists, but among computer scientists as well. This means that if some functionality is needed, there's a good chance that someone's already written it. To name a few things that I've run into in the last few months, as well as the python modules that addressed them:Image processing/creation -> the python imaging library & the gd module
Super-speedy, super comprehensive mathematics & statistics -> the gnu scientific library
Merging a bunch of computers into one -> mpi4py
Transitioning to faster code -> cython
Checking your email so that you don't have to -> libgmail
We're not going to get into any of those-- some of them are really for more advanced programming classes (mpi4py), and some of them are basically silly, at least in the ways that I can think how to use them (libgmail). However, we will talk about a few things that you will probably find useful at some point in your career.
The Python Path
This is less of a 'probably find useful' than a 'certainly find useful', at least if you continue to use python (as you all will, of course). Remember when we talked about functions and modules? One of my central points was that modules can help you centralize your code, allowing you to share functions between many different programs. My trusty 'parseFasta(file)' function has been used in at least a hundred of my programs in graduate school. However-- do I keep all of these hundred-plus programs in a single directory? Of course not-- that wouldn't be organized at all. But then how do I let all of those programs access the same (again, trusty) seqTools.py module?
The answer comes in telling python a directory or set of directories to look in when it's tasked with finding a module. By default it looks in the directory that the script is in (and inside python's internals) for modules-- it turns out it's easy to tell it to look somewhere else, too. This involves setting the environment variable PYTHONPATH.
You can add this line to your .bashrc and .bash_profile scripts in your home directory to make sure that python always looks in that directory for scripts. Now I can and do keep all (as in, no exceptions) of my modules in one place.
Regular expressions
Regular expressions are, in the most basic sense, an extension to the string find, count, startswith, etc methods that we've seen before. However, instead of dealing only in perfectly matching strings, they have a lot more flexibility, and with that, a lot more power.
Here's a basic overview of the syntax:
You have a search string, which specifies what you're looking for, and you have a query string, which specifies where you're looking. re.search then looks for you, and if it finds it, returns some information about the match. If it doesn't, it returns nothing. In most cases, as now, we just want to see whether or not the query contains the search string, so we're just using re.search as part of an if statement.
However, if we just wanted to match letter-by-letter, we could use find or count. We want more. Much of the power of regular expressions comes from the set of symbols that they can contain that match not just letters, but parts of the word, or sets of letters, or quantities of letters, or quantities of certain sets of letters at a given part of a word-- etc, etc, and, etc. Here is a small subset of what search strings can contain:
'Wheres' ^ -> beginning of the string $ -> end of the string \b -> the beginning or end of a word 'Whats' . -> any character [] -> any character inside brackets, e.g. '[abfg]' \s -> any whitespace character \d -> any digit 'How manys' + -> one or more ? -> zero or one * -> zero or more {m}-> exactly mThis makes for things that are, at first, somewhat difficult to read. As in, if I wanted a string that started with a six-letter word beginning with 'A', followed, at some point, by a word with four characters in the set ('y','b','c','6','a'), and ending with at least one 'q' but maybe more, we would write:
searchString = r'^A.{5}\b.*\b[ybc6a]{4}\b.*q+$' # remember that 'r' to start! queryString = 'Arthur was a baby in Iraq' if re.search(searchString,queryString): print 'pass'If you're a perl programmer, then you can probably read that like you read your native tongue-- perl has to use regular expressions all the time, since it doesn't have find, startswith, etc. However, thankfully, things are rarely so complicated.
For example, say that after taking this python course, you get so fascinated by computing and computers (I mean, we're gifted pedagogues, right?) that you leave science to take a job administering MCBs network. You start by looking at email addresses, looking to impose some order on a chaotic landscape. According to your strict worldview, email addresses should contain some combination of the person's:
(a) last name (always)
(b) first initial of name (but only at the first position, and only if the last name is also there)
(c) possibly some other letters, such as the middle initial, but only if (a) and (b) are satisfied
However, there are hundreds of addresses in the system! Clearly, you need a script to figure out who to harass. With regular expressions, this turns out to be fairly easy:
You can also capture the output of a match.
And that's the basics of regular expressions! In previous versions of the course, when we were teaching in perl, we spent hours and hours and hours of class time on them, because they were such a fundamental part of the language-- again, imagine not having 'find' or 'startswith.' Frankly, I'm happy to only have to use them once every few months.
SQL
Imagine that you need to store a really, really huge piece of data. Not only that, but perhaps you and others need to update and modify this piece of data on a regular basis. This raises a number of complications-- how do you store large data sets efficiently? how can you manage who can change which parts of these sets? with many users making so many modifications, how can you set up protections such that the quality of the data doesn't degrade over time? And finally, and perhaps most importantly-- parsing and loading gigabytes of data can take a whole pile of time and effort. How can we access it without going through that trouble?
These questions have been around for a long time, and as you can imagine, the solution (or at least a solution) has been around for almost as long. Large pieces of data that large numbers of people need access to are stored in databases. These databases having been around for a long time, the means to access and interface with them have become largely standardized in a primitive programming language called Structured Query Language, or SQL for short. While I'm not going to go into any detail about it here-- SQL is, after all, a completely different programming language-- something that many people find neat about python is its ability to very, very easily interact with databases.
This easy interaction comes from a module called sqlite3, which contains a small and tidy implementation of SQL. With it, you can create small-scale databases on your own computer. This can come in handy if you find yourself generating large pieces of data that become unwieldy to store in basic 'flat' text files, and a number of python modules and programs use sqlite3 to store all of their data.
If you find yourself generating lots of data, it's worth exploring-- SQL, while another language, is small and primitive enough that you can learn the basics quickly.
Here's an example of how it might be used:
I wont assign any exercises about this (if you're desperate, see last year's equivalent lecture at http://intro-prog-bioinfo-2008.wikispaces.com/Session8.2), as, well, it's not something that I use myself. However, this will likely come in handy for some of you, and you should check out Lenny Teytelman's excellent 'BioSQL' tutorial at http://biosql.wikispaces.com/.
Exercises:
0. This is the fourth iteration of this course, and I believe that we've made it better each time (pity those in the first). However, I bet we can make it better the fifth time around, too-- and we'd like your help. Could you fill out this course evaluation?
In the spirit of the course, just open it up in your favorite text editor and save it from there.
In order to preserve anonymity, I've set up a separate gmail account. The username is 'ipb.evaluations' and the password is 'evaluation'. Please complete the form above, log in to that account, and send an email to intro.prog.bioinformatics@gmail.com with the form attached. Then log out-- I'm not sure how high gmail's tolerance for multiple-users-logged-in is.
After you've done this, your first priority should be completing the project. If you are finished, you can explore the following exercises.
1. Let's do some regular expression drills.
a) Create a regular expression that matches any line that begins with the character '>'.
b) Modify your regular expression from part (a) so that it also matches the first word in the ID line. Print the first word of the ID line, only using regular expressions-- don't use slices.
c) Modify your regular expression from part (b) so that it expects ID lines to be in the following format.
That is, ID lines should begin with a '>', then have a identification string, then have a '|', then have the name of an organism. Create a regular expression that captures the identification string in a variable 'id' and the organism in a variable 'org.'
2. Make a primitive ORF finder using regular expressions. That is, you should use the template:
The match should begin at 'ATG' and end with 'TAA.' The number of nucleotides in between should be a multiple of three.
3. Use python's online documentation to find out how to use the re.sub function. Modify your script from (2) so that it deletes the orf sequence.
4. Make a 'hello, <name>' function, like we defined in the lecture introducing functions. The name should be on the command line. Put it in a separate module from the script that calls it, and put that module in a different directory. Run the script.
5. Pat yourself on the back for a job well done.