Manipulating Files and Processing Text


Topics:

  • Basic text processing with split, join, and partition
  • Text testing with endswith(), startswith(), find()
  • Text conversion with swapcase(), replace(), upper(), and lower()
  • Opening and closing filehandles
  • Reading from the filehandle with read(), readline(), and readlines()
  • Reading from the filehandle iterable
  • Writing or appending to a file with write() and writelines()
  • Writing to a file with a loop

Today You Will:

Learn to read, write, and append to files. You will also learn basic text processing with python string methods.


Basic Text Processing


Systematically manipulating large text files is one of the most common tasks you will encounter. The most basic tools for this task are the built-in Python string methods. These allow us to convert between strings and lists, test the properties of strings, and modify strings.

Split()


Let's consider the task of converting a character string of a sentence into a list of words separated by spaces and punctuation marks:

#!/usr/bin/python
 
delimiter = ","
string_to_split = "I am a well-written sentence, and so I dependably have punctuation"
list_from_string = string_to_split.split(delimiter)
print "clause one %s" % list_from_string[0]
print "clause two %s" % list_from_string[1]

Note that as we've split with a comma, the comma doesn't appear in our list. We can try out what happens with different arguments to split.

# we don't need to specify the delimiter in a different variable
 
list_from_string = string_to_split.split(' ')
for word in list_from_string:
     print word
 
list_from_string = string_to_split.split('a')
for vowel_handicapped_lump in list_from_string:
     print vowel_handicapped_lump

You might also want to take a string and turn it letter-by-letter into a list. Although this isn't done by split, it fits nicely here:

list_from_string = list(string_to_split)
for letter in list_from_string:
     print letter

Split also can take a second argument (see, as always, the string method documentation): you can specify how many times you want to chop.

list_from_string = string_to_split.split(' ', 3)
for item in list_from_string:
     print item

Now let's see what happens when two delimiters are next to each other:

list_from_string = string_to_split.split('t')
for consonant_crippled_lump in list_from_string:
     print consonant_crippled_lump

We can see that we have a blank space in out list-- "written," in particular, was split into three parts: ["...wri","","en..."]. If delimiters are adjacent to each other, it will find that empty string between them and give it to you at the approprate spot. It's a very one-hand-clapping-in-a-forest sort of thing.

However, these is an exception to this. If you glanced at the split documentation, you might have noticed that all of its arguments are, in fact, in brackets. That means that it doesn't need arguments to run: it has a default behavior.

# this should look the same as splitting by spaces
list_from_string = string_to_split.split()
for item in list_from_string:
     print item
 
# this is not the same as splitting by spaces-- no empty items!
string_to_split = "   this      is    a   different                         string"
list_from_string = string_to_split.split()
for item in list_from_string:
     print item
 
string_to_split = '''   complete
\t\t whitespace                      chaos
             !!!!!!!!!!!         '''
list_from_string = string_to_split.split()
for item in list_from_string:
     print item

We see that the default behavior of split is to:
  1. Remove all kinds of whitespace from the beginning and end of the string.
  2. Condense all adjacent whitespaces to single space characters.
  3. Split on those spaces.

This turns out to be really handy. For instance, if you're using someone else's table, and, as happens more often as you might want to think, they've done a poor job delimiting their fields systematically with whitespace, this cleans things up quickly and easily in just one line.

You'll learn to extend this power of whitespace to other characters, sets of characters, and all sorts of exotic delimiters when we get to regular expressions.

The split method being popular, it has a few hangers-on:

toes = '''went to the market
stayed home
had roast beef
had none
cried wee wee wee all the way home'''
 
# splitlines splits on linebreaks
list_from_string = toes.splitlines()
for toe in list_from_string:
     print "this little piggy %s" % toe
 
# from the end of the string
last_toe = "and _this_ little piggy went wee wee wee all the way home"
list_from_string = last_toe.rsplit(' ',7)  # when given a second argument, reverse split counts
 
for item in list_from_string:
     print item

And lastly, I would like to introduce the string method partition. This works a lot like split(delimiter,1) -- it takes a delimiter and chops at the first instance. However, while split(delimiter,1) will return either a list of length two (if it split successfully) or a list of length one (if it didn't), partition will always return a list of length three. Let's look at the output.

rhyme = '''There was a crooked man
Who walked a crooked mile.
He found a crooked sixpence
Against a crooked stile.
He bought a crooked cat
Which caught a crooked mouse,
And they all lived together
In a crooked little house.'''
 
# you can split on words as well as single letters and symbols
split_list = rhyme.split('crooked',1)
 
print "List output:"
for item in split_list:
     print item
 
partition_list = rhyme.partition('crooked')
print "Partition output:"
for item in partition_list:
     print item

What if the delimiter isn't there?

split_list = rhyme.split('happiness',1)
# I mean, this is like the nursery-rhyme
# equivalent of hangin' under the bart tracks in
# west Oakland.
 
print "List output:"
for item in split_list:
     print item
 
partition_list = rhyme.partition('happiness')
print "Partition output:"
for item in partition_list:
     print item

This can be useful if you are looking for that second item, but you're not sure if it's going to be there-- the string could be user generated or read in from a file, and you want to gracefully do one thing if it's there and another if it's not. Split can be less than graceful about this:

if rhyme.split('happiness')[1]:
# if it's there you're all good
else:
# if it isn't your program will crash
 
# vs
if rhyme.partition('happiness')[2]:
# parse the wanted information out of it
else:
# wait until the next line

Join()


So now we're pretty good at splitting things up, but how do we put things together again? Join takes care of that: it turns lists into strings. Surprisingly enough, it's not a method of lists. It's a string method, and it relies on the delimiter to know how to put lists together.

broken = ['hu','m','pty',' du','mpty']
all_the_kings_horses = 'n~n*^'
all_the_kings_men = '>+O'
first_try = all_the_kings_horses.join(broken)
second_try = all_the_kings_men.join(broken)
if (first_try == 'humpty dumpty') or (second_try =='humpty dumpty'):
     print 'hooray!'
else:
     print '''All the king's horses and all the king's men
  couldn't put Humpty together again'''

Like split, join can usefully use the empty string-- it glues the components of the list directly together.

third_try = ''.join(broken)
print third_try
# 'nothing' can put poor Humpty together again

This is in fact the usual way to use join-- you don't need to declare a separate variable to act as the glue.

fairy_tale_characters = ['witch','rapunzel','prince']
plot = 'hair'.join(fairy_tale_characters)
print plot

Testing Text


We just saw how you can use an if statement to test for the presence of a delimiter with partition(). There are other tests you will often be interested in, for example asking if a string begins with, ends with, or contains a substring of interest.

id_number = '1131431a'
 
# let's see if the id_number string starts with the number one
 
if (id_number[0] == '1'):
       print "this id starts with a 1!"
else:
    pass
 
# now let's use the string method startswith()
 
if ( id_number.startswith('1') ):
         print "this id starts with a 1!"
else:
    pass
 
# and here's the endswith() method
if ( id_number.endswith('1') ):
         print "This id number ends with a 1!"
else:
    "This id number doesn't end with a 1 at all!"
 
# and these methods can get a little fancier by having multiple things to test for if you provide a tuple of characters
if ( id_number.endswith( ('1', 'a') ):
        print "this id number ended with either an 'a' or a '1' "
else:
    pass

Or maybe we don't care what the string starts or ends with as long as it contains a substring of interest. For this, we can use the find() method, which will return the index of the substring. But be careful when you write if tests using the find() method, as it returns the index of the substring only if the substring is found. Otherwise, it returns the integer -1, which is not a zero, so will pass the if test as TRUE.

beatles = "johnpaulgeorgeandringo"
 
# the wrong way
if ( beatles.find('george')):
    print "At least we've got a bassist."
else:
    print "Anyone gotta bass guitar handy?"
 
# let's do a comparison for -1 instead
if not (beatles.find('george' == -1)):
           print "At least we've got a bassist"
else:
    print "Well, I guess we're a three piece."

Text Conversions


Systematically replacing the instances of a substring with a replacement substring may be a familiar task of tedium. Python has several methods for systematically converting characters in strings. The most general is the method replace().

beatles = 'johnpaulgeorgeandringo'
beatles = beatles.replace('george', 'MATT')
 
#YES! I'm in!
 
beatles = beatles + "MOREMATT!"
beatles.replace("MATT", "RICH!")
 
# and we can tell replace how many replacements to make, starting at the beginning
 
beatles.replace("MATT", "RICH!", 1)

Since Python is case sensitive, as are most UNIX-based bioinformatics programs you'll be interested in using, you may also find yourself wishing that all the text in your data was the same case. There are methods for both testing and converting cases.

# why not use something a touch relevant for a change
blast_hit = 'ACTGTCAGTACGTAGCATCGAaaatCGATCGACTGAatacgatCG'
 
if ( blast_hit.isupper() ):
    pass
else:
    blast_hit = blast_hit.upper()
 
# or if you prefer lower case
 
blast_hit = blast_hit.lower()
 
# or if you are (or the program you're writing is) indecisive
 
blast_hit = blast_hit.swapcase()
 
# and we might also be interested in these methods
 
if ( blast_hit.isalpha() ):
    print "we got all letters here"
else:
    print "whoa, something doesn't look like nucleotides!"

Files and Filehandles


Now that we can process text, all we need is... more text. And odds are, that text is going to come in the form of a file, so it's high-time that we start using them.

Opening filehandles and ()


A filehandle is an object that controls the stream of information between your script and a file stored somewhere on the computer. Filehandles are not filenames, they are not the files themselves. They are a tool that your scripts use to interact with files, nothing more (for instance, deleting a filehandle in your script using the del command does nothing to the file that handle refers to).

We create filehandles in the simplest sense with the open() command:

fh = open('some_file')

where some_file is the path to a file on your filesystem. In general, it is good practice to use absolute path nomenclature (e.g. /Users/matt/some_file or /home/matt/some_file), but you can be lazy if you know the file you want is going to be in the same directory as the script.

#!/usr/bin/env python
 
fh = open('hello.py')
contents = fh.read()
print contents
fh.close()

$ ./hello.py
#!/usr/bin/env python

fh = open('hello.py')
contents = fh.read()
print contents
fh.close()

As you can see, the read() method of the filehandle just sucks in the whole file in a single string, newlines and all! This is quick and easy, for sure, but it's not necessarily the easiest way to deal with the contents of a file in an orderly fashion.

readline() and readlines()


Copy the contents of the following snippet to a text file in your directory for this session, and call the file pdb_head.

HEADER OXIDOREDUCTASE 08-JUL-97 1AOP
TITLE SULFITE REDUCTASE STRUCTURE AT 1.6 ANGSTROM RESOLUTION
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: SULFITE REDUCTASE HEMOPROTEIN;
COMPND 3 CHAIN: A;

Save it, and then try the following:

#!/usr/bin/env python
 
filename = 'pdb_head'
fh = open(filename, 'r')
#the 'r' is for 'read-only', which will keep us from being able to alter this file with the filehandle we just created
 
print fh.readline()
print fh.readline()
 
lines = fh.readlines()
 
fh.close()
 
print lines

$ ./hello.py
HEADER OXIDOREDUCTASE 08-JUL-97 1AOP

TITLE SULFITE REDUCTASE STRUCTURE AT 1.6 ANGSTROM RESOLUTION

['COMPND MOL_ID: 1; \n', 'COMPND 2 MOLECULE: SULFITE REDUCTASE HEMOPROTEIN; \n', 'COMPND 3 CHAIN: A; \n']

While this is a bit of a mess, a few things should become apparent:
  1. fh.readline() takes in one line, newline character and all! (and since print also supplies a newline, we've got an extra linebreak after each of the first two print statements
  2. fh.readlines() (plural!) takes the entire file, from the current read position all the way to the end, giving back a list of lines (again, with newlines intact)
  3. this file has a bunch of whitespace cluttering things up at the end of each line

All of these complications are easily resolved with the use of the strip() method whenever we actually make use of the lines thus read:

#!/usr/bin/env python
 
filename = 'pdb_head'
fh = open(filename, 'r')
 
print fh.readline().strip()
print fh.readline().strip()
 
lines = fh.readlines()
 
fh.close()
 
lines[0] = lines[0].strip()
 
print lines

$ ./hello.py
HEADER OXIDOREDUCTASE 08-JUL-97 1AOP
TITLE SULFITE REDUCTASE STRUCTURE AT 1.6 ANGSTROM RESOLUTION
['COMPND MOL_ID: 1;', 'COMPND 2 MOLECULE: SULFITE REDUCTASE HEMOPROTEIN; \n', 'COMPND 3 CHAIN: A; \n']

Now the spaces and newlines are gone from the first two, and from the 0th element of the list I printed in the last print statement (since I only bothered to strip and put back the 0th element).

One crucially important concept of file input in Python is that each time you read something by any of the three methods I've described, you advance the position of the filehandle in the file, which means that you never get the same character or characters twice (unless of course they're in the file twice!)

This is why reading from the filehandle with fh.readline() twice in a row gave two different values; as soon as the line is read, the filehandle has moved to the next line, awaiting another read request. This is an example of an iterable type, meaning that the filehandle is a type of object that knows how to advance itself in anticipation of the next request. That means that to get back to the beginning of the file, you must either close the file with the close() and reopen it, or use the seek() method of the filehandle (which we don't have time to go into -- google is your friend!)

While potentially a bit odd now, this behavior will be essential when we discuss reading file contents with loops.

Reading files in a loop


Certainly one of the most common contexts in which you'll encounter for loops is in working your way through a file. You can just put together two things I've already shown you to get to where you need to be:

#!/usr/bin/env python
 
fh = open('pdb_head')
lines = fh.readlines()
for line in lines:
    fields = []
    fields.append(line[0:6].strip())
    fields.append(line[6:10].strip())
    print '0th field: %s, 1st field %s' % (fields[0],fields[1])

$ ./hello.py
0th field: HEADER, 1st field
0th field: TITLE, 1st field
0th field: COMPND, 1st field
0th field: COMPND, 1st field 2
0th field: COMPND, 1st field 3

This is starting to get a little fancier, but we're only doing things you've seen before -- read all the lines in a file into a list, then iterate over the list, looking for a couple of different parts of the line, stripping off leading and trailing whitespace, then printing the first and second elements of the resulting list.

We can simplify this one more step using the fact that filehandles are iterables, and know what's being asked of them. So we can replace this:

lines = fh.readlines()
for line in lines:

with

for line in fh:

to exactly the same effect.

Writing to Files


Writing output is sorta like doing the dishes. You just did all this work to cook up a fancy program and analyze some data, and the last thing you want to do is put all your answers away into clean little output files. Fortunately, we'll learn about pickling files later, but for now, we'd best make sure you know how to write output to a file.

The default behavior of the filehandle is to open the file supplied in read mode. However, by giving an additional argument, you can either add lines to the bottom of the specified file, or overwrite it entirely:

#!/usr/bin/env python
 
filename = 'test_out'
fh = open(filename, 'w')
# 'w' flag means "writeable"
 
fh.write('Last year at this time, I had just lacerated my cornea, and Brant used this exercise to make fun of me.\n')
# note that we have to add the '\n' if we want it at the end of the line; this is in contrast to the print command's behavior.
 
fh.close()
 
filename = 'test_out2'
fh = open(filename, 'a')
# 'a' flag means "append"
 
fh.write("At least I stole this section from him and updated it for this year's class.\n")
 
fh.close()

While this script doesn't print anything to the screen, if you run it a few times and look at the contents of test_out vs test_out2, the distinction between the 'w' and 'a' arguments to open() should become quite clear.

When reading files, the close() method is a good thing to keep in mind, but if you forget it, python will close the file at the end of the script's execution. With writing files, however, python may not make the changes you stipulate right away, so if you plan to evaluate the contents of the file you're writing in the same script (or for instance use that file for something else during the run of that script) it is wise to close the filehandle to ensure that all the write operations you've requested are performed.

While python has no writeline() method, the other two read methods are mirrored for writing to files. The first, write() you've already seen. It takes a string, and puts it in a file. The only difference between this and writelines is that writelines takes a list of strings, and writes them all (But beware! If you want those strings to appear on separate lines, they had best all end with a \n!)

#!/usr/bin/env python
 
filename = 'test_out'
fh = open(filename, 'w')  # 'w' flag means "writeable"
 
lines = ["Last year Brant's line said\n", "Poor Matt.\n","Poor Matt? Poor me!\n", "Well, at least I can still see.\n"]
 
fh.writelines(lines)
 
fh.close()

And check out the contents of test_out to see your many-line-writing machine in action!


Exercises


1. Pile of basic split drills:

  • Turn 'Humpty Dumpty sat on a wall' into ['Humpty','Dumpty','sat','on','a','wall']
  • Turn 'Humpty Dumpty had a great fall' into ['Humpty Dumpty had a ',' fall']
  • Turn "All the King's horses" into ["All the King's hor",'e','']
  • Turn "and all the King's men" into ['and a',''," the King's men"] (note: there is a space at the beginning of ' the King's men')
  • Turn "couldn't put Humpty together again" into 'again' (using one line)

2. Pile of basic split, join, and replacement drills:

  • Turn ' Brant Rich Matt\n' into Terry\tRich\tMatt'
  • Turn 'Terry,Rose,Aaron' into 'TERRY\tROSE\tAARON\t'

3. Using the names of all seven instructors and TA's, write each possible pair of names to a file, separated by a line of hyphens (i.e. '-----------------')

4. Reopen the last output file, and read in the file, then write the lines back out in reverse order, in all capital letters.

5. For each file you created, use a loop to copy the content of the files to new files with the extension ".out" Don't worry about deleting the old files, but do empty all the data from them such that they take no disk space.