Numpy arrays are more efficient than lists because all elements of numpy arrays are of the same pre-determined type. Numpy also provides efficient ufuncs (universal functions) which are vectorized functions that loop over array elements with loops pre-compiled in C. Numpy also exhibits some syntactic features that a mathematician may consider a nuisance, but knowing how to work with numpy is key for almost all scientific computation in python. It takes some practice to be comfortable with numpy and practice is what we are aiming for in this activity. This activity is structured as a list of questions. You will be in a position to appreciate these questions (and their answers) after going through this lecture's prerequistes on numpy yourself.
import numpy as np
import math
A = [0.1, 1.3, 0.4, 0.5] # list
a = np.array(A) # numpy array
type(a), type(A)
Here is how you find out the common data type of elements of a numpy array (and there is no such analogue for list, since list elements can be of different types).
a.dtype # a's common element type (A.dtype is undefined!)
2*a
and 2*A
?¶2*a
2*A
Here is one option:
[sin(x[i]) * exp(-x[i]) for i in range(n)]
And here is another:
np.sin(x) * np.exp(-x)
Which is better?
n = 100000
x = np.linspace(0, 2*np.pi, n)
# list comprehension
%timeit y = [math.sin(x[i]) * math.exp(-x[i]) for i in range(n)]
# use numpy ufuncs
%timeit y = np.sin(x) * np.exp(-x)
The functions np.sin
and np.exp
are examples of numpy's universal functions (ufuncs) that act directly on arrays. While np.sin
are unary ufunc, there are many binary ufuncs like np.add
: when you write x+y
, you are actually calling the binary ufunc np.add(x, y)
. Ufuncs are vectorized functions.
Vectorization refers to one or both of the following, depending on context:
A convenience feature: Apply an operation to all elements of a collection at once.
A performance feature: Use hardware instruction sets to execute single instruction on multiple data (SIMD).
A numpy ufunc like np.sin
is vectorized in both the above senses: (1) you can apply it directly to an array thus avoiding python loops (unlike math.sin
which can be applied to just a single value), and (2) numpy's C implementation of np.sin
uses some SIMD instruction sets, allowing you to get automatic speed up when running on hardware that supports the instructions. If the latter still sounds mysterious, here is more explanation than you probably need: most chips today come with a SIMD instruction to process 4 numbers (float64) at once (and fancier chips can do more), so a loop over an array of N floats can finish in N/4 iterations if you utilize that instruction.
To examine the difference between the convenience feature and the performance feature, consider the following function, which is written to apply to just one number:
def f(v): # apply f to one scalar value v
return math.sin(v) * math.exp(-v)
To gain (1), the convenience feature, there are at least two options other than using ufuncs:
a) Use map
A function f
acting on a scalar value can be made into a function that acts on a vector of values using the functional programming tool map(f, x)
, which
returns an iterator that applies f to every element of x.
vectorizedf = map(f, x) # apply same f to a vector of values x
b) Use numpy's vectorize
F = np.vectorize(f) # F can be applied to a array x
Both options (a) and (b) provide the convenience feature (1), letting you avoid python loops, and allows you to write expressive short codes.
However, neither option (a) nor (b) gives you the full performance of numpy's ufuncs.
# use map
%timeit y = list(map(f, x))
# use numpy's vectorize
%timeit y = F(x)
# use numpy's ufunc
%timeit y = np.sin(x) * np.exp(-x)
range
as efficient as np.arange
?¶%timeit for x in range(1000000): x**3
%timeit for x in np.arange(1000000): x**3
There was a time (in Python 2) when range
was not as efficient, but those times have passed.
A slice a[b:e:s]
of a
refers to the array of elements from the beginning index b
(included) till ending index e
(excluded), stepping s
elements.
The defaults b=0
, e=len(a)
, and s=1
may be omitted in a slice specification.
Negative indices count from the end of the array: a[-1]
is the last element of a
and a[-k] = a[len(a)-k]
.
Positive indices count from the begining as usual.
a = np.random.randint(0,9,5)
a
If you have understood these, then you should be able to say what the expected results are from each of the following statements.
a[::]
a[-1]
a[len(a)-1]
a[-3:]
a[-4:-1:2]
slice = range(-4,-1,2)
a[-4:-1:2], a[slice]
Verify your answers:
a[::]
a[-3:]
a[-1], a[len(a)-1]
a[-4:-1:2]
slice = range(-4,-1,2) # Think of b:e:s specification as a range.
a[-4:-1:2], a[slice] # In older versions, a[slice] may not work
# but will work with slice=arange(-4,-1,2).
=
does?¶a = np.array([1,2,3])
b = np.array([3,4,5,6])
After assigning a
to b
by =
, what happens when you change an element of a
?
a = b
a[0] = 1
a
We certainly expected the 3 to become 1 in a
. Did you also expect the following?
b
If this surprises you, pay close attention to the next question.
In most languages, each variable has its own memory address. For example consider this simple C++ code (ignore it if you don't know C++).
#include <vector>
std::vector<int> a{1,2,3}, b{3,4,5,6};
// Objects a and b each have their own memory addresses.
// Assignment a=b copies contents of b's memory into a.
a = b;
// a's memory address has not changed, but its contents have.
If you have programmed in C or C++, you might have gotten used to variables being permanently linked to their memory locations.
In contrast, python variables are just names. In python, variables like a
and b
are names which are not associated to fixed memory addresses. Names can be bound to one object in memory, and later to another. Multiple names can be bound to the same object (sometimes known as aliasing in other languages). The upshot of this is that in python, the assignment "=
" changes names, but need not copy memory contents.
a = np.array([1,2,3]) # This is Object1 and "a" is a name for it.
b = np.array([3,4,5,6]) # This is Object2 and "b" is a name for it.
We can double check that these objects occupy different memory locations using python's id
function.
id(a), id(b)
Consider what happens when you say a=b
:
a = b # a is no longer a name for Object1, it is now a name for Object2.
id(a), id(b)
Names a
and b
are now both bound to the same "Object2". (And we no longer have a way to access "Object1"!) Now, when we change a
, or when we change b
, we are really changing the same object.
a = np.array([1,2,3]) # Object1
b = np.array([3,4,5,6]) # Object2
a = b.copy() # Copies Object2, and binds a to the copy
a[0] = 2 # Only the copied (new) object is changed
a, b
Of course, numpy is all about vectors and matrices (and even higher-order tensors). Two-dimensional data, or tabular data, or matrix data of the form $$ \left\lbrack\begin{array}{ccc} A_{0,0} & \cdots & A_{0,n-1}\\ \vdots & \ddots & \vdots\\ A_{m-1,0} & \cdots & A_{m-1,n-1} \end{array}\right\rbrack $$ can be represented in python
The numpy array
is more efficient than list of lists and has constructs for some matrix operations. (Note that you might find a numpy.matrix
class, distinct from the array
class, in some older codes, but be warned that it is deprecated. Due to problems arising from mixing matrix
and array
objects in python codes, we will not use the deprecated matrix
class in this course. You should not use it in work you turn in.)
Amat = [[1,2],
[3,4]]
Amat
amat = np.array(Amat)
amat
type(A), type(a)
Note that 2D and 1D numpy arrays are of the same type called numpy.ndarray
.
What is the difference between 2*Amat
and 2*amat
, for the objects Amat
(list of lists) and amat
(numpy array) just made above?
2*Amat
2*amat
amat
amat * amat
Look at the output: is this really matrix multiplication?! This is one thing that drives mathematicians crazy when they look at numpy
for the first time. Mathematicians want the multiplication operator *
to mean matrix multiplication when it is applied to numpy arrays. Unfortunately python's default *
does element-by-element multiplication, not matrix multiplication. Since the first proposal for numpy, decades ago, many battles have been waged to resolve this embarrassment.
Finally, a few years ago, there came some good news. Since Python 3.5, the @
symbol was dedicated to mean the matrix multiplication operator. You can read more about it at PEP 465.
import sys
print(sys.version) # check if you have version >= 3.5 before trying @
amat @ amat
Naturally, many of us needed matrix multiplication before the @
came along, so as you must have guessed, there is another way to do matrix multiplication:
np.dot(amat, amat) # dot(A,B) = matrix A multiplied by matrix B
amat.dot(amat)
I think you will agree with me that this is not as neat as @
.
You should know that the embarrassment continues in matrix powers. If you thought amat ** 2
should give you a matrix power equaling the product of amat
with itself, think again.
amat**2 # not equal to matrix power !!
Numpy provides a matrix_power
routine to compute $M^n$ for matrices $M$ (and integers $n$).
np.linalg.matrix_power(amat, 2)
It does the job, but leaves elegance by the wayside.
Slicing in two-dimensional arrays is similar to slicing one-dimensional arrays.
If rslice
and cslice
are 1D slices (or ranges) like the ones we used for one-dimensional arrays, then when applied to a 2D array A
,
A[rslice, cslice]
the result is a submatrix of A
using row indices in rslice
and column indices in cslice
.
A = np.array([[7, 8, 5, 1], [2, 5, 5, 2], [9, 6, 8, 9]])
A
A[1, :], A[:, 2]
A[:3:2, :3]
Like other programming facilities, numpy stores 2D array data internally as a 1D array, in order to get a contiguous memory block for convenient storage. For 2D arrays, Fortran and Matlab uses column-major ordering, while C uses row-major ordering. For example, the 2D array
[[7, 8, 5, 1],
[2, 5, 5, 2],
[9, 6, 8, 9]]
in row-major ordering looks like
7, 8, 5, 1, 2, 5, 5, 2, 9, 6, 8, 9
while in column-major ordering, it looks as follows.
7, 2, 9, 8, 5, 6, 5, 5, 8, 1, 2, 9
Numpy, by default, stores arrays in row-major ordering (like C). This thinking is reflected in some numpy's methods: e.g., when you ask numpy to reshape or flatten a array, the result is what you expect as if it were stored in row-major ordering.
M = np.array([[7, 8, 5, 1], [2, 5, 5, 2], [9, 6, 8, 9]])
M
M.reshape(2, 6) # Just a different view of the same data
M.ravel() # The 1D data of M in row-major ordering
But the actual situation is more complicated since numpy allows users to override the default storage ordering. You can decide to store an array like in C or like in Fortran. Here is how to store the same array in Fortran's column-major ordering.
A = np.array(M, order='F')
A
Obviously, it is the same matrix. How the data is stored internally is mostly immaterial (except for some performance optimizations). The behavior (of most) of numpy methods does not change even if the user has opted to store the array in a different ordering. If you really need to see a matrix's internal ordering, you can do so by calling the ravel
method with keyword argument order='A'
.
A.ravel(order='A') # A's internal ordering is Fortran style
M.ravel(order='A') # M's internal ordering is default C-style
If a numpy array is given indices that are boolean (instead of integers), then rows or columns are selected based on True
indices. This is called masking. It is very useful together with vectorized conditionals.
N = np.arange(25).reshape(5,5)
N
How will you isolate elements in N
whose value is between 7 and 18?
mask = (N>7) & (N<18)
mask
These are the elements we needed:
N[mask]
And these are their locations:
i, j = np.where(mask) # Returns i and j indices where mask[i,j] is True.
i, j # 1st True value of mask is at i[0],j[0],
# 2nd True value of mask is at i[1],j[1], etc.
Numpy can work with general-dimensional arrays, not just 1D or 2D arrays. For an n
-dimensional array, the shape
of a numpy array is a tuple of n
integers giving the sizes in each dimension.
data = np.random.randint(low=0, high=10, size=30) # 1D array
T2 = np.reshape(data, (6, 5)) # 2D array
T2
T3 = np.reshape(data, (2, 3, 5)) # 3D array
T3
print('T3 is a ', T3.ndim, 'dimensional array of shape ', T3.shape)
print('T2 is a ', T2.ndim, 'dimensional array of shape ', T2.shape)
print('data is a ', data.ndim, 'dimensional array of shape ', data.shape)
Here are a few other features of numpy arrays to note:
ndim
and shape
.c
, or np.array(c)
is considered to have ndim=0
and shape=()
.n
, when viewed as a row vector has ndim=1
and shape=(n,)
.n
, when viewed as a column vector has ndim=2
and shape=(n, 1)
.a
to a column vector by a[:, np.newaxis]
.newaxis
to add a new dimension, e.g., T3[:, :, np.newaxis, :]
has shape=(2, 3, 1, 5)
.In mathematics, it would be an illegal operation to add matrices of different shapes. But it is not surprising that we would want to: e.g., viewing the number 10 as a 1x1 matrix and considering a matrix A
of any other size, wouldn't it be nice to say 10 + A
to add 10 to all elements of A? Wouldn't it also be nice to be able to use +
to add a vector to all columns of a matrix with more than one columns? All this and more is made possible in numpy by broadcasting rules, which extend the possibilities of vectorized operations. A very clear explanation of broadcasting is in [JV-H].
To see if you can add up (or apply another binary ufunc) differently shaped arrays, follow this algorithm, which uses the ndim
and shape
attributes we just saw.
Step 1. If two arrays differ in their ndim
, then revise the shape
of the one with lower ndim
by prepending 1 on the left until the ndim
s are equal.
Step 2. If shape[i]
(after its possible revision from Step 1) of the two arrays are unequal and one of them equals 1, then increase the latter to match the other shape[i]
.
If the resulting revised shape
s of the arrays are still unequal, then broadcasting fails and you can't add them. In Step 1, when we increase ndim
by prepending a 1 to shape
, we are not really changing the array: we are just imagining it with one extra dimension. In Step 2, when we increase shape[i]
from 1 to something larger, we are imagining the array elements repeated along the i
-th dimension (without actually performing an operation copying the elements). Here are a few examples to illustrate these rules.
Example 1:
a + b = [1, 8, 3] + [1]
[0, 6, 5] [8]
a.ndim=2 b.ndim=2 <= No need for Step 1 modification
a.shape=(2, 3) b.shape=(2, 1) <= Apply Step 2 to equalize shape
Step 2: a.shape=(2, 3) b.shape=(2, 3)
a + b = [1, 8, 3] + [1, 1, 1] = [ 2, 9, 4]
[0, 6, 5] [8, 8, 8] [ 8, 14, 13]
Example 2:
a + b = [1, 8, 3] + [1]
[0, 6, 5]
a.ndim=2 b.ndim=0 <= Apply Step 1 to equalize ndim
a.shape=(2, 3) b.shape=() (prepend 1 until ndim equalizes)
Step 1: a.shape=(2, 3) b.shape=(1, 1) <= Next apply Step 2 to equalize shape
Step 2: a.shape=(2, 3) b.shape=(2, 3)
a + b = [1, 8, 3] + [1, 1, 1] = [ 2, 9, 4]
[0, 6, 5] [1, 1, 1] [ 1, 7, 5]
Example 3:
a + b = [1, 8, 3] + [1, 3]
[0, 6, 5]
a.ndim=2 b.ndim=1 <= Apply Step 1 to equalize ndim
a.shape=(2, 3) b.shape=(2, )
Step 1: a.shape=(2, 3) b.shape=(1, 2) <= Next apply Step 2 to equalize shape
Step 2: a.shape=(2, 3) b.shape=(2, 2) <= Still unequal: broadcasting fails
As a simple exercise to further fix ideas, follow the above procedure and see if you can explain whether broadcasting rules apply, or not, to the following (with T2
and T3
as set previously).
T2 + T3
T3 + 1
T2[:3,:] + T3
Author: Jay Gopalakrishnan
License: ©2020. CC-BY-SA
$\ll$Table of Contents