Conversion between standard Python data types and numpy data types
Introduction to numpy, matplotlib, scipy, and pylab packages
Revisit basic statistical analysis with numpy tools
Compare performance using vector math and for loops
Today You Will
Learn a new set of data types and syntax designed to solve
mathematical problems. We will see how to create and convert these
new data types, as well as reanalyze our data from yesterday
afternoon. The last section will demonstrate the ease and speed of
vectorized data types from NumPy.
NumPy Basics
Numerical Python is a powerful library of functions, methods, and data
types we can used to analyze our data. Unforunately for those of us
whose heads continue to spin in a crash-course of syntax, it also uses
a different set of rules. I hope you'll understand why when you see
the power and speed NumPy's data types afford us.
Let's start off creating some empty arrays, which look sorta like
lists, and are in fact vectors.
They differ in a few fundamental ways from lists:
1) Arrays cannot be of mixed types. They are all integers or all floats. They
can be all characters too, but they cannot be some characters, some
numbers, etc.
2) Arrays can be multidimensional, but they are not sparse structures.
3) We can perform vector operations on them, which can be algebraic
functions (like a dot product), or simple replacements of values in
slice of the array.
Arrays
Here's one way: start with a list and change it
import numpy as np
a = [0] * 40
a = np.array(a)
Or this can be shortened:
a = np.array([0] * 40)
But there's a better way to get a vector of zeros:
a = np.zeros(40)
Notice that the default type when declaring an array is float64:
type(a[0])
Out[35]: <type 'numpy.float64'>
And here's the simplest way to change that:
a = np.zeros(40, int)
type(a[0])
Out[37]: <type 'numpy.int32'>
And here's how to declare something that's not all zeros:
a = np.arange(40)
Notice the int type:
type(a[0])
How can we change that?
a = np.arange(40, dtype='f8')
Some additional functionality of arange(): note that when it gets told
a float increment, it defaults to a float type
a = np.arange(40, 50, .25)
Okay, now for a vector with more than one dimension:
a = np.array([np.arange(40), np.arange(40), np.arange(40)])
a
Note the second array is changed. This is the first great power of
vector operations: we just changed every element of a data
structure without a for loop.
Now let's look at a slice of our sub-arrays: all rows, columns 0
through 10:
Just for a minute, we're going to break the sacred rules and talk
about the way things used to be. Before you took this class. Before,
let's say, your enlightenment. Back when you used MS Excel to look at
data. Unlike python arrays, Excel lets you mix datatypes and sort
things by column name. Well, if you get into a dataset that really
makes you wanna use names and feel like you have a spreadsheet or
database, then Record Arrays are for you. In this first example,
you'll see that they're sloppy to set up, but in a moment, we'll make
that easier for you.
The following line of code establishes a record array. This line says
to make a tuple out of each entry in our resultslist, type them as
int and float, then insert them into two named columns, one for
position, one for score.
x = np.array([tuple(results[i]) for i in range(len(results))], dtype=[('pos', int), ('score', float)])
Now, we can access the data with this notation:
x['pos'][0:5]
# or
x['score'][0:5]
Record arrays allow us to mix data types, and to keep track of things
by name, which can be especially helpful when you have lotsa and lotsa
data columns, or say, when you have lotsa columns with nearly similar
names.
Different Packages for NumPy and Matplotlib
There are several packages for these libraries that we're using
today. The two originals are NumPy and (as we'll see all afternoon)
Matplotlib. There's no reason why you cannot simply import either of
these libraries, but some thought has gone into which pieces of these
two projects integrate together cogently. The results is the
production of SciPy and Pylab, both of which attempt to organize the
former two libraries in a user-friendly way.
SciPy is a package that most of you installed in order to have access
to NumPy and Matplotlib. Pylab is a special packagage of these libraries
made for the iPython. It imports different parts of the libraries into different
levels of namespace in a way that the authors think makes sense.
There's a certain amount of individual preference in choosing how you
want to import things, so the only real rule/suggestion I'll lay down
is this: if you're going to use iPython, then it's a great idea to
have pylab imported. Such a good idea, in fact, that iPython lets you
issue a command line argument to import everything before you even
start:
$ ipython -pylab
In addition to loading the libraries, this also changes some other
default settings to allow you to graph things more easily and run
simultaneous subprocesses from iPython.
Basic Statistics with NumPy
Okay, we've already revisited part of our work yesterday, using the
vector indices to slice our data and calculate means. But now, we are
free. We don't need to make functions for things like mean and
variance anymore, because we have NumPy to guide us. Note that in
these examples, iPython has been launched with the -pylab flag, which
allows us to access many more functions with no module prefix.
# reload our data
results = []
dataFile = 'res.orfU1000.tab'
fh = open(dataFile, 'r')
for line in fh:
line = line.strip().split()
results.append( [int(line[0]), float(line[1])] )
results = results[0:379]
a = array(results)
Now let's revisit calculating the statistics:
# calculate the mean
mean(a[;,1])
# or the median
median(a[:,1])
# or the variance
var(a[:,1])
# or the standard deviation, two ways
sqrt(var(a[:,1]))
std(a[:,1])
In addition to basic statistics (and linear algebra, numerical
integration, and time-series analysis), we also have access to
probability distributions. We can use these to generate random
numbers, which can be used to check our data against random models.
Let's use these tools now as an excuse to learn how to use
correlation as well.
#check out the help for the uniform distribution function
uniform?
We see that uniform() takes three arguments.
b = a[:,1] + uniform(0,1000,len(a[:,1]))
Now the help for corrcoef:
corrcoef(a[:,1], b)
We see the .95+ correlation, so we know we didn't perturb our data
much, at least not in a non-linear way. So, let's try a different distribution:
exponential?
c = a[:,1] + exponential(1000,len(a[:,1]))
corrcoef(a[:,1], c)
Now, that's pretty different. Let's get ready for our first graph:
# uniform correlation
scatter(a[:,1], b)
#exponential correlation
scatter(a[:,1], c)
Perofrmance with Vector Math
I kept saying that arrays were much faster at math than using other
Python data structures, but I haven't really demonstrated this yet.
Even with this simple example below, we will not need the
Python profiler to see the effect. Let's go ahead and do this
example in a script, such that we can time things.
$ time python test.py
And now the same exact math, but with arrays:
[[code format="python"]]from pylab import *
a = uniform(0,1,10e6)
b = a ** 2
$ time python test2.py
Exercises
1) Take the two datasets from yesterday afternoon (the one from the
morning exercise, and the one with the integer values instead) and make
a scatter plot of the two data. Make sure you are converting your data
to arrays, not trying to just use lists.
2) Using the same data, create a record array with the column "position"
and the values of the read scores store in each position.
Numerical Python (NumPy)
Topics
Today You Will
Learn a new set of data types and syntax designed to solvemathematical problems. We will see how to create and convert these
new data types, as well as reanalyze our data from yesterday
afternoon. The last section will demonstrate the ease and speed of
vectorized data types from NumPy.
NumPy Basics
Numerical Python is a powerful library of functions, methods, and data
types we can used to analyze our data. Unforunately for those of us
whose heads continue to spin in a crash-course of syntax, it also uses
a different set of rules. I hope you'll understand why when you see
the power and speed NumPy's data types afford us.
Let's start off creating some empty arrays, which look sorta like
lists, and are in fact vectors.
They differ in a few fundamental ways from lists:
1) Arrays cannot be of mixed types. They are all integers or all floats. They
can be all characters too, but they cannot be some characters, some
numbers, etc.
2) Arrays can be multidimensional, but they are not sparse structures.
3) We can perform vector operations on them, which can be algebraic
functions (like a dot product), or simple replacements of values in
slice of the array.
Arrays
Here's one way: start with a list and change it
Or this can be shortened:
But there's a better way to get a vector of zeros:
Notice that the default type when declaring an array is float64:
Out[35]: <type 'numpy.float64'>
And here's the simplest way to change that:
Out[37]: <type 'numpy.int32'>
And here's how to declare something that's not all zeros:
Notice the int type:
How can we change that?
Some additional functionality of arange(): note that when it gets told
a float increment, it defaults to a float type
Okay, now for a vector with more than one dimension:
Out[111]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])
This array has three sub-arrays, which we can address like so:
Out[115]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39])
Or like so, in row:column form:
Out[120]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39])
Out[121]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Now, let's modify one of our arrays:
Out[124]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78], [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])
Note the second array is changed. This is the first great power of
vector operations: we just changed every element of a data
structure without a for loop.
Now let's look at a slice of our sub-arrays: all rows, columns 0
through 10:
Out[126]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
And we can change them all at once, if we are so inclined:
Out[127]:
array([[ 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [ 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78], [ 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]])
Let's stop for a moment and contemplate how we would do this with a list of list and a for loop.
[pause to think]
The vector math, i.e. array, syntax is much simpler and, as it turns out, much, much faster for these operations.
Onward!
We can do tons of multiplication too:
Out[145]:
array([[25, 25, 25], [25, 25, 25], [25, 25, 25]])
And let's look at these data type specifications a little more closely.
Take a second to move your data file (the 400all.tab.txt file from yesterday morning) to your current working directory.
And we'll reload our data from yesterday:
results = [] dataFile = 'all400.tab.txt' fh = open(dataFile, 'r') for line in fh: line = line.strip().split() results.append( [int(line[0]), float(line[1])] ) results = results[0:399]We now have a list of lists called results, let's make it an array:
Note that both vectors in the array are formatted to dtype
float64. Remember in an array, all data are the same type.
Remember yesterday when we wanted to find the mean for each third of
the dataset to compare the values for the first, second, and third
frames?
Let's do that again here:
The arange gets us every third index of the array slice a[:,1]
(all rows, second column)
And just to be fancy and combine things we're learning:
a list comprehension to do all three:
[1522.0474829752695, 647.63184770651594, 1219.8208930149849]
Record Arrays
Just for a minute, we're going to break the sacred rules and talk
about the way things used to be. Before you took this class. Before,
let's say, your enlightenment. Back when you used MS Excel to look at
data. Unlike python arrays, Excel lets you mix datatypes and sort
things by column name. Well, if you get into a dataset that really
makes you wanna use names and feel like you have a spreadsheet or
database, then Record Arrays are for you. In this first example,
you'll see that they're sloppy to set up, but in a moment, we'll make
that easier for you.
The following line of code establishes a record array. This line says
to make a tuple out of each entry in our resultslist, type them as
int and float, then insert them into two named columns, one for
position, one for score.
x = np.array([tuple(results[i]) for i in range(len(results))], dtype=[('pos', int), ('score', float)])Now, we can access the data with this notation:
Record arrays allow us to mix data types, and to keep track of things
by name, which can be especially helpful when you have lotsa and lotsa
data columns, or say, when you have lotsa columns with nearly similar
names.
Different Packages for NumPy and Matplotlib
There are several packages for these libraries that we're using
today. The two originals are NumPy and (as we'll see all afternoon)
Matplotlib. There's no reason why you cannot simply import either of
these libraries, but some thought has gone into which pieces of these
two projects integrate together cogently. The results is the
production of SciPy and Pylab, both of which attempt to organize the
former two libraries in a user-friendly way.
SciPy is a package that most of you installed in order to have access
to NumPy and Matplotlib. Pylab is a special packagage of these libraries
made for the iPython. It imports different parts of the libraries into different
levels of namespace in a way that the authors think makes sense.
There's a certain amount of individual preference in choosing how you
want to import things, so the only real rule/suggestion I'll lay down
is this: if you're going to use iPython, then it's a great idea to
have pylab imported. Such a good idea, in fact, that iPython lets you
issue a command line argument to import everything before you even
start:
$ ipython -pylab
In addition to loading the libraries, this also changes some other
default settings to allow you to graph things more easily and run
simultaneous subprocesses from iPython.
Basic Statistics with NumPy
Okay, we've already revisited part of our work yesterday, using the
vector indices to slice our data and calculate means. But now, we are
free. We don't need to make functions for things like mean and
variance anymore, because we have NumPy to guide us. Note that in
these examples, iPython has been launched with the -pylab flag, which
allows us to access many more functions with no module prefix.
# reload our data results = [] dataFile = 'res.orfU1000.tab' fh = open(dataFile, 'r') for line in fh: line = line.strip().split() results.append( [int(line[0]), float(line[1])] ) results = results[0:379] a = array(results)Now let's revisit calculating the statistics:
In addition to basic statistics (and linear algebra, numerical
integration, and time-series analysis), we also have access to
probability distributions. We can use these to generate random
numbers, which can be used to check our data against random models.
Let's use these tools now as an excuse to learn how to use
correlation as well.
We see that uniform() takes three arguments.
Now the help for corrcoef:
We see the .95+ correlation, so we know we didn't perturb our data
much, at least not in a non-linear way. So, let's try a different distribution:
Now, that's pretty different. Let's get ready for our first graph:
Perofrmance with Vector Math
I kept saying that arrays were much faster at math than using other
Python data structures, but I haven't really demonstrated this yet.
Even with this simple example below, we will not need the
Python profiler to see the effect. Let's go ahead and do this
example in a script, such that we can time things.
$ time python test2.py
Exercises
1) Take the two datasets from yesterday afternoon (the one from the
morning exercise, and the one with the integer values instead) and make
a scatter plot of the two data. Make sure you are converting your data
to arrays, not trying to just use lists.
2) Using the same data, create a record array with the column "position"
and the values of the read scores store in each position.