Welcome to UC Berkeley's fourth sort-of annual programming bootcamp! If this isn't what you signed up for, navigate away from this web site.
This will be a busy course, and we have a few goals for this morning-- first, we'd like to give you some context about what to expect to learn from this course, and what not to. We're going to talk a little about how the class is organized, who the instructors are, and finally, we will need to get you set up on your own computers, which will involve teaching you a little about how to function in a Linux environment. So, without further ado:
Why program?
I'm going to be brief here, because most of you in this room signed up for the class because you had some motivation to do so. In short, by leveraging your computer to work for you, programming saves you time. It has the potential to make difficult tasks easy and impossible tasks... maybe also easy.
What are you going to learn?
You are going to learn python. From one of the advertisements for the course:
Python is a simple and powerful programming language that is used for
many applications from simple tasks to large software development
projects. It has become popular as both a first language for
beginning students and an everyday one for advanced programmers.
The course is broadly divided into two parts. In the first week, we will focus on the basic core of the language: how to store information, how to let your programs receive, interpret, and output information, and, finally, ways to build greater and greater complexity into your programs. In the second week, we will continue to teach you the principles of the language, but we will also discuss powerful extensions built for scientific programming. During this week, we will also have you apply your new-found skills to a project: replicating the results of a recent paper that analyzed high-throughput sequencing data.
Our focus throughout all of this is to allow you to apply programming to the problems that you face in the lab. Although we will only directly cover a couple application areas of programming to biology, we expect you to leave this course with a sufficiently generalized knowledge of programming that you will be able to apply your skills to whatever you happen to be working on-- even if biology has nothing to do with it! With that in mind, the last lecture of the class is an open study hall where you can bring your problems to us, and we will help you develop a computational strategy to address them.
However, while we do hope to show you the beginnings of programming, this course alone won't land you that google senior programmer job. So, it might be worthwhile to tell you what the limitations of this course are.
What are you not going to learn?
Python is big. Thousands of computer scientists and programmers have used it, contributed to it, and extended it to their own sub-fields. We can't get to all of it, even if we had much longer than the two weeks we have-- to make a tired analogy to spoken language, learning French isn't just a matter of a year or two of coursework-- the most important thing is experience. Likewise, you won't come out of this class an advanced programmer, 'fluent' in python, but students in the past have come out of it with enough knowledge to get by, and more importantly, enough knowledge to be able to extend their skills and become advanced programmers on their own.
We wont teach you about object-oriented programming, writing parallel programs, integrating your code with faster code written in C or C++, or a host of other powerful-but-difficult methods and topics, but we will teach you enough that you should be able to learn about them if and when you want to. And that's pretty good-- although I've used all of those things, I don't believe that I use anything on an everyday or even every-week basis that we won't be covering in this course.
How are you going to learn it?
You may already be familiar with the basic setup-- each day will start with a lecture, followed by a lab until a break for lunch, followed by another lecture and another lab. You have two incredibly useful resources at your disposal during the labs: first, you have us, the TAs and instructors, who are all familiar with the language and here to help you out. Second, you have the extensive documentation about python and programming available on the internet and in your book.
Importantly, and especially during the first week, we don't want people to be get 'stuck.' You will have a number of exercises each day covering the breadth of the lectures, and it's important for you to be able to demonstrate what you've just learned-- one best learns programming by doing, not listening, and so if time runs out and you've only completed half the problems, you've only really learned half the material. So, please don't hesitate to raise a hand and ask for help.
But for now, let's get started:
Getting started
There are people here using both Microsoft and Apple computers, and there are some using linux as well. Our first task is to get everyone on the same page: this involves opening a computer terminal from which we can edit and run computer programs, and getting some familiarity with the interface to that terminal. The Apple people will have an easier time getting set up, but it's not too hard to do this in Windows as well.
We're going to have a short break here for perhaps, hopefully, the most chaotic few minutes of the course, while we make sure that everyone is started up and set up correctly. So, everyone: open a terminal application and follow along.
Using the shell
You will spend nearly all of your time in one of two places: the shell or the text editor. The shell allows you to move and copy files, run programs, and more, while the text editor is where you will write your programs. We will focus mostly on the shell this morning, although we will touch on the basic usage of a popular text editor, emacs. We will begin programming this afternoon.
Here are some common shell commands:
apropos 'search text' [what's the command to do 'xxx'?]
There are many more commands in Linux than we can cover in a single lecture or in 10 days. Use the 'apropos' command to search for ways to do something particular.
e.g. apropos 'remove file'
man command_name [look up information on a particular command]
Most commands have many many useful options. For information on a particular command, look at the manual pages with the "man".
e.g. man rename
pwd [where am I?]
(Print Working Directory) Prints your current location, such as "/home/matt/Docs". This is the directory in which you are at the current moment. If you create any files, they will appear in this spot. When you first open the terminal shell, you will be in your "home"/base directory ("/home/matt").
cd directory_path [move to the named directory]
(Change Directory) Given a complete path, this command moves your "current
location" to the specified directory.
e.g. cd /home/matt/ [will take me to my home directory]
If you simply want to descend into a subdirectory from you current position,
you can omit the full path and just specify the directory name. This is called giving
a relative path; it's a frequent source of error in programming.
e.g. cd Docs [if I execute this from /home/matt/, I'll end up in /home/matt/Docs]
To move back to the parent directory, just do "cd .."
e.g. cd .. [If I execute this from /home/lenny/Docs/, I will go to
/home/matt/]
ls [lists contents of a directory]
(LiSt) Shows the files and directories in the current location.
Options:
ls -l [lists security permissions, owners of files, sizes,
date created]
ls -F [directories will be shows with a "/" after the name,
so it's easier to tell them apart from files]
ls path [lists contents of the specified directory]
ls .. [list contents of the parent directory]
Options can be combined as: ls -lF /home/lenny/Docs/
Now that we can move around directories and look inside them,
a few words about the linux directory structure. The topmost level
is "/" and everything else resides somewhere below in the hierarchy of
directory branches.
mkdir directory_name [Create a given directory]
(MaKe DIRectory) Exactly what it says - let's you create new directories.
cp original_name copy_name [copy file or directory]
(CoPy) It is possible to copy all subdirectories and their files
under a given directoy recursively with the command:
cp -r source_directory destination_directory
mv source destination [move files or directories]
(MoVe) Similar to the copy command, but this actually moves the desired
file or directories instead of copying them.
This is the command that is used to rename files or directories.
more file_name [view contents of a file]
Shows contents of a file. Allows scrolling, jumping, searching.
Can be used only for text files (won't show you MS Word or PowerPoint -
need special programs for that)
Most useful options (once inside the viewer):
-"space" to scroll down a page
-"Control-G" to go to the end of the file
-"/" to type in text to search for
-"q" to quit
-"h" for a complete list of functions
head filename [print first 10 lines of the file]
By default, prints the top 10 lines of the input file.
To print a different number of lines, execute:
head -number filename
So, "head -n 100 unix_ref.txt" will print the
top 100 lines of this file.
tail filename [print the last ten lines of the file]
cat file1 file2 ... [print named files to the screen]
(conCATenate) If given just one file, cat will simply spit out the
contents of the file to the screen. However,
given multiple files, this will concatenate all of
them, printing one after the other.
grep and egrep 'search_string' file
(Global Regular Expression Print)
Searches for the "search string" in a text file and
prints out all lines where it find the desired text.
e.g.: grep '>' fasta_file.fa [Will print to screen the headers
of the fasta file as they match
">" at the beginning of each line.]
-v will invert the search.
e.g: grep -v '>' fasta_file.fa
[Prints out all non-header lines, that is,
the sequences only.]
wc file_name
(Word Count) Reports the number of:
lines, words, characters (in that order)
of a given file.
star "*"
The star of unix is simple and incredibly useful. This symbol
can be used with all of the commands listed above, and means that you
want it to MATCH ANYTHING. This is the "wild-card"
e.g.: ls *.fa [list all fasta files - files with ".fa" on the end]
wc * [Do the word-count on all files in the current directory.]
pipe '|'
Piping with "|" connects unix commands, allowing the output
of one command to "flow through the pipe" to another.
e.g." grep '>' file.fasta | wc [get all header lines from the
fasta file, but instead of printing
them to the screen, send the output
to the "wc" command, to count the
number of headers. Effectively,
this will count number of sequences
in the file.]
Redirection ">"
In addition to redirecting output to another command, the results can
be sent into a file with ">".
e.g. cat file1 file2 file3 > file4 [Combine files 1-3 into file4.]
The ">" will create a new file or overwrite an existing one. If you
simply want to add to a file, use ">>".
cut -f NUMBER file_name [Extract one or more columns from a file]
Prints only the specified column/field from a text file. By
default, expects the fields to be tab-separated.
Options:
-d ' ' [specifies the character separating the columns.
If fields are separated by spaces, just use
"cut -d ' ' -f..." if they are separated by
commas, "cut -d ',' -f...", and so on.]
Examples:
cut -f 1 some_file.txt [get the first column of the file]
cut -d ' ' -f 3-5 some_file.txt [get columns 3,4,5 from
a space-separated file]
cut -f 2,6,7 some_file.txt [get columns 2,6,7 from the file]
sort file_name [Order lines in a file alphabetically/numerically)
Will sort lines in a text file. There are many useful
options:
sort -r file_name [will do a bottom-up reverse sort]
sort -k # file_name [will sort on the specified column in a
tab-separated file]
sort -k # -t ' ' file [also sort on given column, but the columns
are now space-separated]
sort -n file_name [do a numeric rather than alphabetical sort]
Of course, all the above options can be combined.
uniq file_name [print distinct lines from a sorted file]
This will run through the whole file, comparing every two
adjacent lines, and will remove the duplicate lines.
Unless you have a good specific reason not to, you should
always sort the file first.
Permissions and chmod
Files in Linux are not all created equally. Some files are read-only, some can be written, and some are executable, meaning that the Linux knows how to do something special with these files. Script files can be executed in one of two ways:
If the file is created (or modified) to be executable, then you can execute the file by giving it's "full local path," with the "./" character before the name:
./hello_world.py
But if the file is not executable, you have to explicitly tell Linux what to do with the file:
python hello_world.py
chmod is the command to modify permissions. This can get fairly complicated, but the command to make a file executable is simple:
chmod +x filename
Using Emacs
Lastly, we're going to need a text editor so that we can create and edit our own files. A popular text editor, and the one that I personally use, is called emacs. While it has a lot of features, we're going to stick with the very, very basics for now. That is, we're going to learn how to open a file, edit it, save it, and close it.
< demonstration on screen >
open a file: <prompt> emacs <file> &
save a file: CTRL-X CTRL-S
close: CTRL-X CTRL-C
Exercises
1. Make a directory called "fasta_files" and change into it
Go to http://www.yeastgenome.org, then click the "FTP" link on the right hand sidebar (under Data Download), then "sequence", "genomic_sequence", "chromosomes", "fasta"
Download one-by-one all cerevisiae chromosomes (depending on your web browser, you may need to right-click and save the linked file).
Make a single whole genome file called "cerevisiae_genome.fasta"
Count the chromosomes in the whole genome file using commands from the lecture
Get size of genome, excluding the header lines
3. Use apropos to find a command that might tell you about how much 'disk space' that you have left on your system. Use the 'man' command to see how it works. How much space is left on your system? Make the command output in terms of gigabytes and megabytes-- 'human-readable' form.
4. Make a temporary directory under "fasta" and "cd" into it.
Connect to the YGD ftp server with the 'ftp' program:
"ftp genome-ftp.stanford.edu"
You will be asked to present a name. Type 'anonymous'. Just press enter when prompted for a password.
Use "ls" and "cd" to navigate to the chromosomes directory.
Look at the man pages for the "ftp" program to figure out how to download, with a single command, all of the chromosomes for cerevisiae.
Bonus:
Figure out how to count the different types of genes in #3 without the "wc" command.
You should be able to get the breakdown of all the different gene types and their counts with a single statement (with pipes of course). If you can do this, then you should feel no small measure of pride in your newly-minted linux abilities.
Solutions (these solutions are old, so tell me if they seem wrong)
Problem1
Number of chromosomes (once the chromosomes are all in one directory)
cat *.fsa > cerevisiae_genome.fa
egrep '>' cerevisiae_genome.fa | wc
or
egrep -c '>' cerevisiae_genome.fa
The total number of chromosomes is 17. This includes the mitochondrial, and if you did
"egrep -c 'chr' cerevisiae_genome.fa" you would have missed it. In general, fasta file format
always has a header line start with ">" before the sequence, whether DNA or protein.
Genome size
egrep -v '>' cerevisiae_genome.fa | wc
Being picky here, notice that "wc" includes line breaks in the total character count ("wc" on a file with "Hello" will give 6).
So to get the real genome size, subtract number of lines 202620 from 12359298.
Cerevisiae thus has 12,156,678bp.
Problem2
Chromosome count using the features file.
cut -f 7 SGD_features.tab | egrep 'chr' | sort | uniq | wc
Welcome
Introduction
Welcome to UC Berkeley's fourth sort-of annual programming bootcamp! If this isn't what you signed up for, navigate away from this web site.This will be a busy course, and we have a few goals for this morning-- first, we'd like to give you some context about what to expect to learn from this course, and what not to. We're going to talk a little about how the class is organized, who the instructors are, and finally, we will need to get you set up on your own computers, which will involve teaching you a little about how to function in a Linux environment. So, without further ado:
Why program?
I'm going to be brief here, because most of you in this room signed up for the class because you had some motivation to do so. In short, by leveraging your computer to work for you, programming saves you time. It has the potential to make difficult tasks easy and impossible tasks... maybe also easy.What are you going to learn?
You are going to learn python. From one of the advertisements for the course:The course is broadly divided into two parts. In the first week, we will focus on the basic core of the language: how to store information, how to let your programs receive, interpret, and output information, and, finally, ways to build greater and greater complexity into your programs. In the second week, we will continue to teach you the principles of the language, but we will also discuss powerful extensions built for scientific programming. During this week, we will also have you apply your new-found skills to a project: replicating the results of a recent paper that analyzed high-throughput sequencing data.
Our focus throughout all of this is to allow you to apply programming to the problems that you face in the lab. Although we will only directly cover a couple application areas of programming to biology, we expect you to leave this course with a sufficiently generalized knowledge of programming that you will be able to apply your skills to whatever you happen to be working on-- even if biology has nothing to do with it! With that in mind, the last lecture of the class is an open study hall where you can bring your problems to us, and we will help you develop a computational strategy to address them.
However, while we do hope to show you the beginnings of programming, this course alone won't land you that google senior programmer job. So, it might be worthwhile to tell you what the limitations of this course are.
What are you not going to learn?
Python is big. Thousands of computer scientists and programmers have used it, contributed to it, and extended it to their own sub-fields. We can't get to all of it, even if we had much longer than the two weeks we have-- to make a tired analogy to spoken language, learning French isn't just a matter of a year or two of coursework-- the most important thing is experience. Likewise, you won't come out of this class an advanced programmer, 'fluent' in python, but students in the past have come out of it with enough knowledge to get by, and more importantly, enough knowledge to be able to extend their skills and become advanced programmers on their own.We wont teach you about object-oriented programming, writing parallel programs, integrating your code with faster code written in C or C++, or a host of other powerful-but-difficult methods and topics, but we will teach you enough that you should be able to learn about them if and when you want to. And that's pretty good-- although I've used all of those things, I don't believe that I use anything on an everyday or even every-week basis that we won't be covering in this course.
How are you going to learn it?
You may already be familiar with the basic setup-- each day will start with a lecture, followed by a lab until a break for lunch, followed by another lecture and another lab. You have two incredibly useful resources at your disposal during the labs: first, you have us, the TAs and instructors, who are all familiar with the language and here to help you out. Second, you have the extensive documentation about python and programming available on the internet and in your book.Importantly, and especially during the first week, we don't want people to be get 'stuck.' You will have a number of exercises each day covering the breadth of the lectures, and it's important for you to be able to demonstrate what you've just learned-- one best learns programming by doing, not listening, and so if time runs out and you've only completed half the problems, you've only really learned half the material. So, please don't hesitate to raise a hand and ask for help.
But for now, let's get started:
Getting started
There are people here using both Microsoft and Apple computers, and there are some using linux as well. Our first task is to get everyone on the same page: this involves opening a computer terminal from which we can edit and run computer programs, and getting some familiarity with the interface to that terminal. The Apple people will have an easier time getting set up, but it's not too hard to do this in Windows as well.We're going to have a short break here for perhaps, hopefully, the most chaotic few minutes of the course, while we make sure that everyone is started up and set up correctly. So, everyone: open a terminal application and follow along.
Using the shell
You will spend nearly all of your time in one of two places: the shell or the text editor. The shell allows you to move and copy files, run programs, and more, while the text editor is where you will write your programs. We will focus mostly on the shell this morning, although we will touch on the basic usage of a popular text editor, emacs. We will begin programming this afternoon.
Here are some common shell commands:
apropos 'search text' [what's the command to do 'xxx'?]
There are many more commands in Linux than we can cover in a single lecture or in 10 days. Use the 'apropos' command to search for ways to do something particular.
e.g. apropos 'remove file'
man command_name [look up information on a particular command]
Most commands have many many useful options. For information on a particular command, look at the manual pages with the "man".
e.g. man rename
pwd [where am I?]
(Print Working Directory) Prints your current location, such as "/home/matt/Docs". This is the directory in which you are at the current moment. If you create any files, they will appear in this spot. When you first open the terminal shell, you will be in your "home"/base directory ("/home/matt").
cd directory_path [move to the named directory]
(Change Directory) Given a complete path, this command moves your "current
location" to the specified directory.
e.g. cd /home/matt/ [will take me to my home directory]
If you simply want to descend into a subdirectory from you current position,
you can omit the full path and just specify the directory name. This is called giving
a relative path; it's a frequent source of error in programming.
e.g. cd Docs [if I execute this from /home/matt/, I'll end up in /home/matt/Docs]
To move back to the parent directory, just do "cd .."
e.g. cd .. [If I execute this from /home/lenny/Docs/, I will go to
/home/matt/]
ls [lists contents of a directory]
(LiSt) Shows the files and directories in the current location.
Options:
ls -l [lists security permissions, owners of files, sizes,
date created]
ls -F [directories will be shows with a "/" after the name,
so it's easier to tell them apart from files]
ls path [lists contents of the specified directory]
ls .. [list contents of the parent directory]
Options can be combined as: ls -lF /home/lenny/Docs/
Now that we can move around directories and look inside them,
a few words about the linux directory structure. The topmost level
is "/" and everything else resides somewhere below in the hierarchy of
directory branches.
mkdir directory_name [Create a given directory]
(MaKe DIRectory) Exactly what it says - let's you create new directories.
cp original_name copy_name [copy file or directory]
(CoPy) It is possible to copy all subdirectories and their files
under a given directoy recursively with the command:
cp -r source_directory destination_directory
mv source destination [move files or directories]
(MoVe) Similar to the copy command, but this actually moves the desired
file or directories instead of copying them.
This is the command that is used to rename files or directories.
more file_name [view contents of a file]
Shows contents of a file. Allows scrolling, jumping, searching.
Can be used only for text files (won't show you MS Word or PowerPoint -
need special programs for that)
Most useful options (once inside the viewer):
-"space" to scroll down a page
-"Control-G" to go to the end of the file
-"/" to type in text to search for
-"q" to quit
-"h" for a complete list of functions
head filename [print first 10 lines of the file]
By default, prints the top 10 lines of the input file.
To print a different number of lines, execute:
head -number filename
So, "head -n 100 unix_ref.txt" will print the
top 100 lines of this file.
tail filename [print the last ten lines of the file]
cat file1 file2 ... [print named files to the screen]
(conCATenate) If given just one file, cat will simply spit out the
contents of the file to the screen. However,
given multiple files, this will concatenate all of
them, printing one after the other.
grep and egrep 'search_string' file
(Global Regular Expression Print)
Searches for the "search string" in a text file and
prints out all lines where it find the desired text.
e.g.: grep '>' fasta_file.fa [Will print to screen the headers
of the fasta file as they match
">" at the beginning of each line.]
-v will invert the search.
e.g: grep -v '>' fasta_file.fa
[Prints out all non-header lines, that is,
the sequences only.]
wc file_name
(Word Count) Reports the number of:
lines, words, characters (in that order)
of a given file.
star "*"
The star of unix is simple and incredibly useful. This symbol
can be used with all of the commands listed above, and means that you
want it to MATCH ANYTHING. This is the "wild-card"
e.g.: ls *.fa [list all fasta files - files with ".fa" on the end]
wc * [Do the word-count on all files in the current directory.]
pipe '|'
Piping with "|" connects unix commands, allowing the output
of one command to "flow through the pipe" to another.
e.g." grep '>' file.fasta | wc [get all header lines from the
fasta file, but instead of printing
them to the screen, send the output
to the "wc" command, to count the
number of headers. Effectively,
this will count number of sequences
in the file.]
Redirection ">"
In addition to redirecting output to another command, the results can
be sent into a file with ">".
e.g. cat file1 file2 file3 > file4 [Combine files 1-3 into file4.]
The ">" will create a new file or overwrite an existing one. If you
simply want to add to a file, use ">>".
cut -f NUMBER file_name [Extract one or more columns from a file]
Prints only the specified column/field from a text file. By
default, expects the fields to be tab-separated.
Options:
-d ' ' [specifies the character separating the columns.
If fields are separated by spaces, just use
"cut -d ' ' -f..." if they are separated by
commas, "cut -d ',' -f...", and so on.]
Examples:
cut -f 1 some_file.txt [get the first column of the file]
cut -d ' ' -f 3-5 some_file.txt [get columns 3,4,5 from
a space-separated file]
cut -f 2,6,7 some_file.txt [get columns 2,6,7 from the file]
sort file_name [Order lines in a file alphabetically/numerically)
Will sort lines in a text file. There are many useful
options:
sort -r file_name [will do a bottom-up reverse sort]
sort -k # file_name [will sort on the specified column in a
tab-separated file]
sort -k # -t ' ' file [also sort on given column, but the columns
are now space-separated]
sort -n file_name [do a numeric rather than alphabetical sort]
Of course, all the above options can be combined.
uniq file_name [print distinct lines from a sorted file]
This will run through the whole file, comparing every two
adjacent lines, and will remove the duplicate lines.
Unless you have a good specific reason not to, you should
always sort the file first.
Permissions and chmod
Files in Linux are not all created equally. Some files are read-only, some can be written, and some are executable, meaning that the Linux knows how to do something special with these files. Script files can be executed in one of two ways:
If the file is created (or modified) to be executable, then you can execute the file by giving it's "full local path," with the "./" character before the name:
./hello_world.py
But if the file is not executable, you have to explicitly tell Linux what to do with the file:
python hello_world.py
chmod is the command to modify permissions. This can get fairly complicated, but the command to make a file executable is simple:
chmod +x filename
Using Emacs
Lastly, we're going to need a text editor so that we can create and edit our own files. A popular text editor, and the one that I personally use, is called emacs. While it has a lot of features, we're going to stick with the very, very basics for now. That is, we're going to learn how to open a file, edit it, save it, and close it.< demonstration on screen >
open a file: <prompt> emacs <file> &
save a file: CTRL-X CTRL-S
close: CTRL-X CTRL-C
Exercises
1. Make a directory called "fasta_files" and change into it
Go to http://www.yeastgenome.org, then click the "FTP" link on the right hand sidebar (under Data Download), then "sequence", "genomic_sequence", "chromosomes", "fasta"
Download one-by-one all cerevisiae chromosomes (depending on your web browser, you may need to right-click and save the linked file).
Make a single whole genome file called "cerevisiae_genome.fasta"
Count the chromosomes in the whole genome file using commands from the lecture
Get size of genome, excluding the header lines
2. Get the list of cerevisiae chromosome features: ftp://genome-ftp.stanford.edu/pub/yeast/chromosomal_feature/SGD_features.tab
Count total genes
Count only verified genes
Count only uncharacterized genes
What other types of genes are in this file?
3. Use apropos to find a command that might tell you about how much 'disk space' that you have left on your system. Use the 'man' command to see how it works. How much space is left on your system? Make the command output in terms of gigabytes and megabytes-- 'human-readable' form.
4. Make a temporary directory under "fasta" and "cd" into it.
Connect to the YGD ftp server with the 'ftp' program:
"ftp genome-ftp.stanford.edu"
You will be asked to present a name. Type 'anonymous'. Just press enter when prompted for a password.
Use "ls" and "cd" to navigate to the chromosomes directory.
Look at the man pages for the "ftp" program to figure out how to download, with a single command, all of the chromosomes for cerevisiae.
Bonus:
Figure out how to count the different types of genes in #3 without the "wc" command.
You should be able to get the breakdown of all the different gene types and their counts with a single statement (with pipes of course). If you can do this, then you should feel no small measure of pride in your newly-minted linux abilities.
Solutions (these solutions are old, so tell me if they seem wrong)
Problem1
Number of chromosomes (once the chromosomes are all in one directory)
cat *.fsa > cerevisiae_genome.fa
egrep '>' cerevisiae_genome.fa | wc
or
egrep -c '>' cerevisiae_genome.fa
The total number of chromosomes is 17. This includes the mitochondrial, and if you did
"egrep -c 'chr' cerevisiae_genome.fa" you would have missed it. In general, fasta file format
always has a header line start with ">" before the sequence, whether DNA or protein.
Genome size
egrep -v '>' cerevisiae_genome.fa | wc
Being picky here, notice that "wc" includes line breaks in the total character count ("wc" on a file with "Hello" will give 6).
So to get the real genome size, subtract number of lines 202620 from 12359298.
Cerevisiae thus has 12,156,678bp.
Problem2
Chromosome count using the features file.
cut -f 7 SGD_features.tab | egrep 'chr' | sort | uniq | wc
Count total genes: 6605.
cut -f 2 SGD_features.tab| egrep 'ORF' | wc
Verified Genes:4648.
egrep ORF SGD_features.tab | cut -f 1-3 | egrep 'Verified' | wc
Uncharacterized: 1142.
egrep ORF SGD_features.tab | cut -f 1-3 | egrep 'Unchar' | wc
All gene types:Dubious,Uncharacterized,Verified,Verified|silenced_gene.
cut -f 2-3 SGD_features.tab | egrep 'ORF' | cut -f 2 | sort | uniq
Bonus
"wget"
wget --retr-symlinks ftp://genome-ftp.stanford.edu/pub/yeast/sequence/genomic_sequen
nce/chromosomes/fasta/*.fsa
Counting different gene types
cut -f 2-3 SGD_features.tab | egrep 'ORF' | cut -f 2 | sort | uniq -c
815 Dubious
1142 Uncharacterized
4644 Verified
4 Verified|silenced_gene