Chapter 3 Numpy and Pandas
import numpy as np
10) np.random.seed(
Base python does not include true vectorized data structures–vectors, matrices, and data frames. For small things one can use lists, lists of lists, and list comprehensions. However, such code will be bulky and slow.
This deficiency is addressed by additional libraries, in particular numpy and pandas. Numpy is the primary way to handle matrices and vectors in python. This is the way to model either a variable or a whole dataset so vector/matrix approach is very important when working with datasets. Even more, these objects also model the vectors/matrices as mathematical objects. Matrix computations are extremely important in statistics and hence also in machine learning.
3.1 Numpy
Numpy is the most popular python library for matrix/vector computations. Due to python’s popularity, it is also one of the leading libraries for numerical analysis, and a frequent target for computing benchmarks and optimization.
It is important to keep in mind that numpy is a separate library that is not part of the base python. Unlike R, base python is not vectorized, and one has to load numpy (or another vectorized library, such as pandas) in order to use vectorized operations. This also causes certain differences between the base python approach and the way to do vectorized operations.
3.1.1 Importing numpy
Numpy is typically imported as np
:
import numpy as np
np
is pretty much the standard acronym for the numpy and widely used
in online documentation. Below we assume numpy has been imported as
np
.
3.1.2 Array: The Fundamental Data Structure in Numpy
Numpy is fundamentally based on arrays, N-dimensional data structures. Here we mainly stay with one- and two-dimensional structures (vectors and matrices) but the arrays can also have higher dimension (called tensors). Besides arrays, numpy also provides a plethora of functions that operate on the arrays, including vectorized mathematics and logical operations.
Arrays can be created with np.array
. For instance, we can create a
1-D vector of numbers from 1 to 4 by feeding a list of desired numbers
to the np.array
:
= np.array([1,2,3,4])
a print("a:\n", a)
## a:
## [1 2 3 4]
Note that it is printed in brackets as list, but unlike a list, it does not have commas separating the components.
If we want to create a matrix (two-dimensional array), we can feed
np.array
with a list of lists, one sublist for each row of the
matrix:
= np.array([[1,2], [3,4]])
b print("b:\n", b)
## b:
## [[1 2]
## [3 4]]
The output does not have the best formatting but it is clear enough.
One of the fundamental property of arrays its dimension, called
shape in numpy. Shape is array’s size along
all of its dimensions. This can be queried by attribute .shape
which returns the sizes in a form of a tuple:
a.shape
## (4,)
b.shape
## (2, 2)
One can see that vector a
has a single dimension of size 4, and
matrix b
has two dimensions, both of size 2 (remember: (4,)
is a
tuple of length 1!).
One can also reshape arrays, i.e. change their shape into another
compatible shape. This can be achieved with .reshape()
method.
.reshape
takes one argument, the new shape (as a tuple) of the array.
For instance, we can reshape the length-4 vector
into a 2x2 matrix as
2,2)) a.reshape((
## array([[1, 2],
## [3, 4]])
and we can “straighten” matrix b
into a vector with
4,)) b.reshape((
## array([1, 2, 3, 4])
3.1.3 Creating Arrays
Sometimes it is practical to create arrays manually as we did above, but usually it is much more important to make those by computation. Below we list a few options.
np.arange
creates sequences, quite a bit like range
, but the
result will be a numpy vector. If needed, we can reshape the vector
into a desired format:
10) # vector of length 10 np.arange(
## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
10).reshape((2,5)) # 2x5 matrix np.arange(
## array([[0, 1, 2, 3, 4],
## [5, 6, 7, 8, 9]])
np.zeros
and np.ones
create arrays filled with zeros and ones
respectively:
5,)) np.zeros((
## array([0., 0., 0., 0., 0.])
2,4)) np.ones((
## array([[1., 1., 1., 1.],
## [1., 1., 1., 1.]])
Arrays can be combined in different ways, e.g. np.column_stack
combines
them as columns (next to each other), and np.row_stack
combines these
as rows (underneath each other). For instance, we can combine a
column of ones and two columns of zeros as follows:
= np.ones((5,)) # a single vector of ones
oneCol = np.zeros((5,2)) # two columns of zeros
zeroCols # 5x3 columns np.column_stack((oneCol, zeroCols))
## array([[1., 0., 0.],
## [1., 0., 0.],
## [1., 0., 0.],
## [1., 0., 0.],
## [1., 0., 0.]])
Note that column_stack
expects all arrays to be passed as a single
tuple (or list).
Exercise 3.1 Use np.zeros
, np.ones
, mathematical operations and concatenation
to create the following array:
## array([[-1., -1., -1., -1.],
## [ 0., 0., 0., 0.],
## [ 2., 2., 2., 2.]])
See the solution
3.1.4 Vectorized Functions (Universal Functions)
It is possible to use loops to do computation with numpy objects exactly in the same way when working with lists. However, one should use vectorized operations instead whenever possible. Vectorized operations are easier to code, easier to read, and result in faster code.
Numpy offers a plethora of vectorized functions and operators, called universal functions. Many of these work as expected. For instance, mathematical operations. We create a matrix, and then add “100” to it, and then rise “2” to the power of the values:
= np.arange(12).reshape((3,4))
a print(a)
## [[ 0 1 2 3]
## [ 4 5 6 7]
## [ 8 9 10 11]]
print(100 + a, "\n")
## [[100 101 102 103]
## [104 105 106 107]
## [108 109 110 111]]
print(2**a, "\n") # remember: exponent with **, not with ^
## [[ 1 2 4 8]
## [ 16 32 64 128]
## [ 256 512 1024 2048]]
Both of these mathematical operations, +
and **
are performed
elementwise2
for every single element of the matrix.
Exercise 3.2 Create the following array:
## array([[ 2, 4, 6, 8, 10],
## [12, 14, 16, 18, 20],
## [22, 24, 26, 28, 30],
## [32, 34, 36, 38, 40]])
See the solution
Comparison operators are vectorized too:
> 6 a
## array([[False, False, False, False],
## [False, False, False, True],
## [ True, True, True, True]])
== 7 a
## array([[False, False, False, False],
## [False, False, False, True],
## [False, False, False, False]])
As comparison operators are vectorized, one might expect that the other
logical operators, and, or and not, are also vectorized. But
this is not the case. There are vectorized logical operators, but
they differ from the base python version. These are more similar to
corresponding operators in
R or C, namely &
for logical and, |
for
logical or, and ~
for logical not:
< 3) | (a > 8) # logical or (a
## array([[ True, True, True, False],
## [False, False, False, False],
## [False, True, True, True]])
> 4) & (a < 7) # logical and (a
## array([[False, False, False, False],
## [False, True, True, False],
## [False, False, False, False]])
~(a > 6) # logical not
## array([[ True, True, True, True],
## [ True, True, True, False],
## [False, False, False, False]])
There is no vectorized multi-way comparison like 1 < x < 2
.
3.1.5 Array Indexing and Slicing
Indexing refer to extracting elements based on their position or certain criteria. This is one of the fundamental operations with arrays. There are two ways to extract elements: based on position, and based on logical criteria. Unfortunately, this also makes indexing somewhat confusing, and it needs some time to become familiar with.
3.1.5.1 Extracting elements based on position
Array indexing is very similar to list indexing. As matrices have two dimensions, we need two indices.
= np.arange(12)
a print(a[::2]) # every second element
## [ 0 2 4 6 8 10]
However, unlike lists, one can do vectorized assignments in numpy:
5:11] = -1 # assign multiple elements
a[ a
## array([ 0, 1, 2, 3, 4, -1, -1, -1, -1, -1, -1, 11])
One can also extract multiple elements from a vector:
4,5,7]] # extract 3 elements in one go a[[
## array([ 4, -1, -1])
When working with matrices (2-D arrays), we need two indices, separated by comma. Comma separates two slices
= np.arange(12).reshape((3,4))
c c
## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])
1,2] # 2nd row, 3rd column c[
## 6
1] # 2nd row c[
## array([4, 5, 6, 7])
Comma can separate not just two indices but two slices, so we can write
2] # all rows, 3rd column c[:,
## array([ 2, 6, 10])
2] # 1st, 2nd row c[:
## array([[0, 1, 2, 3],
## [4, 5, 6, 7]])
2, :3] # 1s, 2nd row, first three columns c[:
## array([[0, 1, 2],
## [4, 5, 6]])
Exercise 3.3 Create matrix and access rows and columns
- create a 4x5 array of even numbers: 10, 12, 14, …
- extract third column
- set the fourth row to 1,2,3,4,5
Note: there are many ways to achieve this.
See the solution
3.1.5.2 Boolean indexing
An extremely widely used approach is to extract elements of an array
based on a logical criteria. Fundamentally, it is just using a
logical vector for indexing. The vector must be of the same lengts as
the array in question, and the results contains only those elements
the correspond to True
in the indexing vector. Here is an example
how we can do this manually:
= np.array([1,2,7,8])
a = np.array([True, False, True, False])
i # 1, 7 a[i]
## array([1, 7])
It is important you understand what is going on here: arrays a
and
i
will be “matched”, so each element of a
will have its “match” in
i
. Next, only those elements of a
that are matched with True
are extracted, in this case just 1 and 7.
The previous example–manually creating a logical index vectors of
trues and falses is hardly ever useful. Almost always we use logical
operations instead. For instance, we can extract all elements of a
that are greater than 5:
= a > 5
i i
## array([False, False, True, True])
a[i]
## array([7, 8])
This is often written in a more compact manner by skipping explicit
logical vector i
:
> 5] a[a
## array([7, 8])
New users of numpy (and other languages that support logical indexing) sometimes forget that the logical condition does not have to be related to the same array that we are attempting to extract. For instance, we can extract all results for a certain person:
= np.array(["Cyrus", "Darius", "Xerxes", "Artaxerxes", "Cyrus", "Darius"])
names = np.array([17, 14, 20, 18, 13, 15])
results == "Darius"] results[names
## array([14, 15])
Here index vector
is based on the variable name
only and is not directly
related to results
. However, we use it to extract values from the
latter.
Finally, we also can extract rows (or columns) from a 2-D array in a fairly similar fashion:
= np.array(["Cyrus", "Darius", "Xerxes"])
names = np.array([[17, 14], [20, 18], [13, 15]])
results results
## array([[17, 14],
## [20, 18],
## [13, 15]])
== "Darius",:] results[names
## array([[20, 18]])
The results is the second row of the 2-D array results
,
corresponding to the name “Darius”.
Logical indexing can also be used on the left-hand-side of the
expression, in order to replace elements.
Below is an example where we replace
all the negative elements of a
with zero.
= np.random.randn(2,3)
a a
## array([[ 1.3315865 , 0.71527897, -1.54540029],
## [-0.00838385, 0.62133597, -0.72008556]])
< 0] = 0
a[a a
## array([[1.3315865 , 0.71527897, 0. ],
## [0. , 0.62133597, 0. ]])
When replacing elements in such fashion then we need to supply the
replacement vector that is either length 1 (all elements are replaced
by “0” in the example above), or alternatively we should supply a
replacement vector of correct length. For instance, we can replace
the positive numbers left in a
with 1, 2, 3:
> 0] = np.array([1, 2, 3])
a[a a
## array([[1., 2., 0.],
## [0., 3., 0.]])
Exercise 3.4 consider two vectors
= np.array(["Roxana", "Statira", "Roxana", "Statira", "Roxana"])
names = np.array([126, 115, 130, 141, 132]) score
Do the following using a single one-line vectorized operation.
- Extract all test scores that are smaller than 130
- Extract all test scores by Statira
- Add 10 points to Roxana’s scores. (You need to extract it first.)
See the solution
3.1.6 Random numbers
TBD: make a separate section with some simulations and related examples.
Numpy offer a large set of random number generators. These can be invoked asnp.random.
generator(
params, size)
. For instance,
np.random.choice(N)
can be used to create random numbers from 0 to
\(N-1\). size
determines the shape of the resulting object.
The argument is size, not shape, although it determines the output shape!
Here is an example to simulate roll of a die for 5 times:
= np.random.choice(6, size=5)
x x
## array([0, 2, 0, 4, 3])
But maybe we prefer not to label the results as 0..5 but 1..6. So we can just add one to the result. Here is an example that creates 2-D array of die rolls:
1 + np.random.choice(6, size=(2,4))
## array([[1, 5, 4, 1],
## [4, 3, 2, 1]])
Numpy offers a large set of various random values. Here we list a few more:
3.1.6.1 Random elements from list
random.choice
can also extract random elements from a list:
= ["A", "G", "C", "T"]
nucleotides = np.random.choice(nucleotides, 20)
dna "".join(dna)
## 'ACGTCGGGTGCGACCCGAGT'
As the example demonstrates, random.choice
picks random elements
with replacement (use replace
option to change this behavior).
3.1.6.2 Random normals
random.normal(loc, scale, size)
generates normally distributed
random numbers. The distribution is centered at loc and its
variance is scale:
1000, 100, size=10) np.random.normal(
## array([ 969.13995143, 984.62645147, 1067.89311948, 808.64457868,
## 905.75419444, 825.4457021 , 897.90547332, 983.79309738,
## 934.20291005, 1042.21130617])
3.1.6.3 Binomial random numbers
random.binomial(n, p, size)
creates random binomials where
probability of success is p and sample size is n:
2, 0.5, size=(2,4)) np.random.binomial(
## array([[2, 2, 1, 1],
## [2, 2, 1, 2]])
Exercise 3.5 We can describe a coin toss as Binomial(1, 0.5) where 1 refers to the fact that we toss a single coin, and 0.5 means it has probability 0.5 to come heads up. So such random variables are sequences of zeros and ones. But how can we get a sequence of -1 and 1 instead? Demonstrate it on computer!
See the solution
3.1.6.4 Uniform random numbers
random.uniform(low, high, size)
creates uniformly distributed random
numbers in the interval [low, high]:
-1, 1, size=(3,4)) # random numbers in [-1, 1] np.random.uniform(
## array([[-0.4078626 , -0.73741789, 0.68563587, 0.31807261],
## [ 0.19087921, -0.1272926 , -0.28749935, 0.17426185],
## [-0.70105733, -0.6575228 , -0.20567095, 0.27590313]])
3.1.6.5 Repeating the exact same random sequence
The random numbers are often called pseudorandom as they are not truly random–they are computed based on a well-defined algorithm, so when feeding the same initial values to the algorithm, one always gets the same random numbers. However, normally the initial values are taken from certain hart-to-control parameters outside of the program control, such as time in microseconds and hard disk serial number, so in practice it is impossible to replicate the same sequence.
However, if you need to replicate your results exactly, you have to
set the initial values explicitly using random.seed(value)
. This
re-initializes RNG-s to the given initial state:
1)
np.random.seed(=5) # 1st batch of numbers np.random.uniform(size
## array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,
## 1.46755891e-01])
=5) # 2nd batch is different np.random.uniform(size
## array([0.09233859, 0.18626021, 0.34556073, 0.39676747, 0.53881673])
1)
np.random.seed(=5) # repeat the 1st batch np.random.uniform(size
## array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,
## 1.46755891e-01])
3.1.7 Statistical functions
Numpy offers a set of basic statistical functions, including sum,
mean, and standard deviations std. These can be applied to the
array as a whole, or separately to rows or columns. In the latter
case one has to specify the argument axis
, where the value 0 means
to apply the operation row-wise (and preserve columns) and axis=1
means to apply the operation column-wise (and preserve rows). Here is
an example:
= np.arange(12).reshape((3,4))
a # 3 rows, 4 columns a
## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])
sum() # total sum a.
## 66
sum(axis=0) # add rows, preserve columns a.
## array([12, 15, 18, 21])
sum(axis=1) # add columns, preserve rows a.
## array([ 6, 22, 38])
The functions come in two forms: as a method x.sum()
, and as a
separate function np.sum(x)
. These two ways are pretty much
equivalent.
By default, a missing value of an array causes the function to return missing:
= a.astype(float) # as np.nan is float, need a float array
a 1,2] = np.nan
a[ a
## array([[ 0., 1., 2., 3.],
## [ 4., 5., nan, 7.],
## [ 8., 9., 10., 11.]])
sum(a) np.
## nan
This differs from the corresponding functionality in pandas where missings are ignored by default!
The other statistical functions include
mean
for averagemedian
for medianvar
for variancestd
for standard deviationnp.percentile
andnp.quantile
for quantiles
3.2 Pandas
Pandas is the standard python library to work with dataframes.
Unlike in R, this is not a part of base python and must be imported
separately. It is typically imported as pd
:
import pandas as pd
Pandas relies heavily on numpy but is a separate package. Unfortunately, it also uses a somewhat different syntax and somewhat different defaults. However, as it is “made of” numpy, it works very well together with the latter.
Pandas contains two central data types: Series and DataFrame.
Series is often used as a second-class citizen, just as a single
variable (column) in data frame. But it can also be used as a
vectorized dict that links keys (indices) to values.
DataFrame is broadly similar to other dataframes as implemented in R
or spark. When you extract its
individual columns and rows you normally get those in the form of Series.
So it is extremely useful to know the basics of Series when working
with data frames.
Both DataFrame and Series include index, a glorified row name,
which is very useful for extracting information based on names, or
for merging different variables into a data frame
(See Section Concatenating data with pd.concat
).
We start by introducing Series as this is a simpler data structure than DataFrame, and allows us to introduce index.
3.2.1 Series
Series is a one-dimensional positional column (or row) of values. It is in some sense similar to list, but from another point of view it is more like a dict, as it contains index, and you can look up values based on index as a key. So it allows not only positional access but also index-based (key-based) access. In terms of internal structure, it is implemented with vectorized operations in mind, so it supports vectorized arithmetic, and vectorized logical, string, and other operations. Unlike dicts, it also supports multi-element extraction.
Let’s create a simple series:
= pd.Series([1,2,5,6])
s s
## 0 1
## 1 2
## 2 5
## 3 6
## dtype: int64
Series is printed in two columns. The first one is the index, the second one is the value. In this example, index is essentially just the row number and it is not very useful. This is because we did not provide any specific index and hence pandas picked just the row number. Underneath the two columns, you can also see the data type, in this case it is 64-bit integer, the default data type for integers in python.
Now let’s make another example with a more informative index:
= pd.Series( [ 38, 26, 19, 19],
pop = ['ca', 'tx', 'ny', 'fl'])
index # population, in millions
pop
## ca 38
## tx 26
## ny 19
## fl 19
## dtype: int64
Now the index is helpful: we are looking at state populations, and index tells us which state is in which row. Another advantage of possessing index is that even when we filter and manipulate the series, it’s index will still retain the original row label. So we know that index “fl” will always correspond to Florida. But if we have removed a few cases, or re-ordered the series, then Florida may not be on the fourth position any more.
Exercise 3.6 Create a series of 4 capital cities where the index is the name of corresponding country.
See the solution
We can extract values and index using the corresponding attributes:
pop.values
## array([38, 26, 19, 19])
pop.index
## Index(['ca', 'tx', 'ny', 'fl'], dtype='object')
Note that values are returned as np array, and index is a special index object. If desired, this can be converted to a list:
list(pop.index)
## ['ca', 'tx', 'ny', 'fl']
Series also supports ordinary mathematics, e.g. we can do operations like
> 20 pop
## ca True
## tx True
## ny False
## fl False
## dtype: bool
the result will be another series, here of logical values, as indicated by the “bool” data type.
3.2.2 DataFrame
DataFrame is the central data structure for holding 2-dimensional rectangular data. It is in many ways similar to R dataframes. However, it also shares a number of features with Series, in particular the index, so you can imagine a data frame is just a number of series stacked next to each other. Also, extracting single rows or columns from DataFrames typically results in a series.
3.2.2.1 Creating data frames
DataFrame can be created manually as a dict of lists (or series). The keys of the list are the variable names and values are the variable values, normally these are lists or series. As an example, let’s create a data frame with three variables, ca, tx and md, and three rows:
= {'ca': [35, 37, 38], 'tx': [23, 24, 26], 'md': [5,5,6]}
df = pd.DataFrame(df)
pop print('population:\n', pop, '\n')
## population:
## ca tx md
## 0 35 23 5
## 1 37 24 5
## 2 38 26 6
The data frame is printed as four columns. Exactly as in case of series, the first column is index. In the example above we did not specify the index and hence pandas picked just row numbers. But we can provide an explicit index, for instance the year of observation:
= pd.DataFrame(df, index = [2010,2012,2014])
pop print('population:\n', pop, '\n')
## population:
## ca tx md
## 2010 35 23 5
## 2012 37 24 5
## 2014 38 26 6
In this case the index is rather useful.
Exercise 3.7 Create a dataframe of (at least 4) countries, with 2 variables: population and capital. Country name should be the index.
Hint: feel free to invent populations!
See the solution
3.2.2.2 Read data from file
To create data frames manually is useful for testing and
debugging, in real applications we typically read data from disk.
This can be done with pd.read_csv
that takes the file
name as the first argument, and also supports many other options. In
the example below, we read
data about G.W.Bush approval rate
in fall 2001. pd.read_csv
assumes files are comma-separated by
default, but as this example file
is tab-separated we have to declare it using sep="\t"
as
an extra argument.
We also read the first 10 rows only for demonstration:
= pd.read_csv("../data/gwbush-approval.csv", sep="\t", nrows=10)
approval approval
## date approve disapprove dontknow
## 0 2001 Dec 14-16 86 11 3
## 1 2001 Dec 6-9 86 10 4
## 2 2001 Nov 26-27 87 8 5
## 3 2001 Nov 8-11 87 9 4
## 4 2001 Nov 2-4 87 9 4
## 5 2001 Oct 19-21 88 9 3
## 6 2001 Oct 11-14 89 8 3
## 7 2001 Oct 5-6 87 10 3
## 8 2001 Sep 21-22 90 6 4
## 9 2001 Sep 14-15 86 10 4
Exercise 3.8 In the example above: how many columns are printed? How many variables does the dataframe contain?
See the solution
What happens if we use a wrong separator? This can be easily checked with printing the number of columns, and printing a few lines of data. Here is an example:
= pd.read_csv("../data/gwbush-approval.csv") # wrong separator
a a.shape
## (31, 1)
2) a.head(
## date\tapprove\tdisapprove\tdontknow
## 0 2001 Dec 14-16\t86\t11\t3
## 1 2001 Dec 6-9\t86\t10\t4
Two problems are immediately visible: first, the file contains a single column only (because it does not consider tab symbols as separators), and the two lines we printed look weird. If you ask for variable names, you can also see that all variable names are combined together into a single weird name:
a.columns
## Index(['date\tapprove\tdisapprove\tdontknow'], dtype='object')
The tab markers \t
in printout give strong hints that the correct
separator is tab.
It may initially be quite confusing to understand how to specify the
file name. If you load data in a jupyter notebook, then the working
directory is normally the same directory where the notebook is
located3. Notebook also let’s you to complete file
names with TAB key. But in any case, the working directory
can be found with os.getcwd
(get current working directory):
import os
os.getcwd()
## '/home/siim/tyyq/lecturenotes/machinelearning-py'
This helps to specify the relative path if your data file is not
located in the same place as your code. You can also find which files does python
find in a given folder, e.g. in ../data/
:
= os.listdir("../data/")
files 5] files[:
## ['house-votes-84.csv.bz2', 'marathon.csv.bz2', 'growth-unemployment-2016.csv.bz2', 'hadcrut-5.0.1.0-annual.csv.bz2', 'trump-approval.csv']
As we see, this function returns a list of file names it finds in the given location.
Exercise 3.9 Refresh your knowledge of relative paths!
- What is your current working directory?
- List all files in
- your current folder
- in the parent folder of it.
See the solution
As another complication, notebooks are often run on a separate server or in a docker container. These may have no access to files in your computer (as the server), or only have a limited access (like docker container). In such a case, you should upload the file to notebook, even if it is running on your computer!
3.3 Indexing data frames and series
Indexing refers to selecting data from data frames and series based on variable names, logical conditions, and position. It is a complex task with many different methods, and unfortunately also with many caveats. Below, the topic is split into several subsections:
- Select variables explains how to select desired variables from a data frame
- Filter observations with logical operations describes how to filter rows
- Positional indexing of Series introduces positional indexing, indexing based on row number, and how to do it with series
- Positional indexing of data frames explains positional indexing, indexing based on both row and column numbers, for data frames
- Modifying data frames: there are slight differences when modifying data instead of extracting, these are discussed here.
- Indexing: summary and comparison provides a summary of all methods.
Fortunately, Series and data frames behave in a broadly similar way, e.g. selecting cases by logical conditions, based on index, and location are rather similar. As series do not have columns, we cannot access elements by column name or by column position though.
These notes do not provide a comprehensive overview, consult e.g. McKinney "Python for Data Analysis for more details.
3.3.1 Select variables in data frames
We use the G.W.Bush approval data we loaded above to demonstrate variable access. For a refresher, the first lines of the data frame look like
4) approval.head(
## date approve disapprove dontknow
## 0 2001 Dec 14-16 86 11 3
## 1 2001 Dec 6-9 86 10 4
## 2 2001 Nov 26-27 87 8 5
## 3 2001 Nov 8-11 87 9 4
To begin with, data frames have variable names. We can extract a
single variable either with ["varname"]
or a shorthand as attribute
.varname
(note: replace varname with the name of the relevant
variable):
"approve"] # approval, as series approval[
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
## Name: approve, dtype: int64
# the same, as series approval.approve
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
## Name: approve, dtype: int64
These constructs return the column as a series. If we prefer to get a single-column data frame, we can wrap the variable name into a list:
"approve"]] # approval, as data frame approval[[
## approve
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
The attribute shorthand is usually the easier way, but it does not work if you need to use indirect variable name (variable name that is stored in another variable) or if the variable name contains spaces or other special characters. It also does not work for creating new variables in the data frame. See more in Section 3.3.5.
The previous example where we extracted a single column as a data frame instead of Series also hints how to extract more than one variable: just wrap all the required variable names into a list:
vars = ["date", "approve"]
vars] approval[
## date approve
## 0 2001 Dec 14-16 86
## 1 2001 Dec 6-9 86
## 2 2001 Nov 26-27 87
## 3 2001 Nov 8-11 87
## 4 2001 Nov 2-4 87
## 5 2001 Oct 19-21 88
## 6 2001 Oct 11-14 89
## 7 2001 Oct 5-6 87
## 8 2001 Sep 21-22 90
## 9 2001 Sep 14-15 86
There are no attribute shortcuts to extract multiple columns.
3.3.2 Filter observations with logical operations
Filtering refers to extracting only a subset of rows from the dataframe based on certain conditions. The conditions are logical operations that can be either true or false, depending on the values in each row. Filtering produces a sub-dataframe where only those observations that meet the selection criteria are present: Here is an example:
> 88] approval[approval.approve
## date approve disapprove dontknow
## 6 2001 Oct 11-14 89 8 3
## 8 2001 Sep 21-22 90 6 4
Note that we have to refer to data variables as approval.approve
,
not just approve
, unlike in R dplyr where one can just write
approve
.
This is somewhat harder to write but it is less ambiguous and produces
fewer hard-to-find bugs.
Obviously we can use more complex selection conditions, for instance we can look for very low or very high approval rates as follows:
< 86) | (approval.approve > 89)] approval[(approval.approve
## date approve disapprove dontknow
## 8 2001 Sep 21-22 90 6 4
Note that we are using the vectorized “or” operator |
, not the base
python or
. We also need to wrap both the “less than” and “greater
than” parts in parenthesis.
See more in Section 3.1.4.
Exercise 3.10 How many polls in the data show the president’s approval rate at least 88%? At which dates are those polls conducted?
See the solution
The filtered object is not a new data frame but a view of the
original data frame. This may give you warnings and errors later when
you attempt to modify the filtered data. If
you intend to do that, perform a deep copy of data using the .copy
method. See more in Section
3.3.5.
3.3.3 Positional indexing of Series
Besides selecting variables and filtering by logical conditions, we occasionally need to access elements by index, or by position (location). Here we demonstrate the positional indexing using a series object, positional indexing of data frames is discussed in Section 3.3.4 below:
= pd.Series([32.7, 267.7, 15.3], # in millions
pop =["MY", "ID", "KH"])
index pop
## MY 32.7
## ID 267.7
## KH 15.3
## dtype: float64
We can access series’ values in two ways: by position, and by index.
In order to access elements by position, we have to use attribute
.iloc[]
where i loc refers to “integer”. Unlike most other
methods, .iloc
expects arguments in brackets. A single number
in brackets returns the element as an element (e.g. a single number),
if brackets contain a list (this looks like double brackets),
it returns a series, potentially
containing only a single element.
So in order to extract
2nd and 3rd element in the population series, we can write:
1] # extract 2nd element as a number pop.iloc[
## 267.7
1,2]] # extract 2nd, 3th as a series pop.iloc[[
## ID 267.7
## KH 15.3
## dtype: float64
Alternatively, we can also extract the elements by index. This works
in a similar fashion, except we have to use .loc[]
instead of
.iloc[]
. The rules for single and double brackets apply in the similar
fashion as in case of positional access.
"ID"] # extract Indonesian population as a number pop.loc[
## 267.7
"ID", "MY"]] # extract Indonesian and Malaysian population pop.loc[[
## ID 267.7
## MY 32.7
## dtype: float64
# as a series
Exercise 3.11 Use your series of capital cities (see the exercise above). Extract:
- 1st, 3rd element by position as single elements (city names)
- 2nd element by country name as a 1-element series.
See the solution
One can also drop the .loc[]
syntax and just use square brackets, so
instead of writing pop.loc[["ID", "MY"]]
, one can just write
pop[["ID", "MY"]]
.
The fact that there are several ways to extract positional
data causes a lot of confusion for beginners. It
is not helped by the common habit of not using indices and just
relying on the automatic row-numbers. In this case
positional access by .iloc[]
produces exactly the same results
as the index access
by .loc[]
, and one can conveniently forget about the index and use
whatever feels easier. But sometimes
the index changes as a result of certain operations and
that may lead to errors or unexpected results.
For instance, we can create an alternative population series without
explicit index:
= pd.Series([np.nan, 26, 19, 13]) # index is 0, 1, ...
pop1 pop1
## 0 NaN
## 1 26.0
## 2 19.0
## 3 13.0
## dtype: float64
In this example, position and index are equivalent and hence it is
easy to forget that .loc[]
is index-based access, not positional access!
So one may freely mix both methods (and remember, .loc
is not needed):
2] pop1.loc[
## 19.0
2] pop1.iloc[
## 19.0
2] pop1[
## 19.0
This becomes a problem if a numeric index is not equivalent to row number any more, for instance after we drop missings:
= pop1.dropna() # remove missings
pop2 # note: the first row has index 1 pop2
## 1 26.0
## 2 19.0
## 3 13.0
## dtype: float64
2] # this is by position pop2.iloc[
## 13.0
2] # this is by index pop2.loc[
## 19.0
2] # also by index pop2[
## 19.0
Additionally, if pop2
for some reason turns into a
numpy array, then pop2[2]
is is
based on position as arrays do not have index!
3.3.4 Positional indexing of data frames
We use a small data frame of capital cities to demonstrate how indexing on data frames works. The data frame contains two variables, name of the capital city and population as variables, index is the country name:
= pd.DataFrame({"capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],
countries "population":[32.7, 267.7, 15.3]}, # in millions
=["MY", "ID", "KH"])
index countries
## capital population
## MY Kuala Lumpur 32.7
## ID Jakarta 267.7
## KH Phnom Penh 15.3
(MY is Malaysia, ID Indonesia and KH is Cambodia).
Exactly as series, data frames allow positional access by
.iloc[]
. However, as data frames are two-dimensional objects, .iloc
accepts two arguments (in brackets, separated by comma), the first one
for rows, the second one for columns. So we can write
2] # 3rd row, as series countries.iloc[
## capital Phnom Penh
## population 15.3
## Name: KH, dtype: object
2]] # 3rd row, as data frame countries.iloc[[
## capital population
## KH Phnom Penh 15.3
2,1] # 3rd row, 2nd column, as a number countries.iloc[
## 15.3
There is also an index-based extractor .loc[]
that
accepts one (for rows) or two (for rows and columns) indices. In case
of data frames, the default row index is just the row number; but the
column index is the variable names. So we can write
"MY","capital"] # Malaisian capital countries.loc[
## 'Kuala Lumpur'
"KH", "ID"], ["population", "capital"]] countries.loc[[
## population capital
## KH 15.3 Phnom Penh
## ID 267.7 Jakarta
# Extract a sub-dataframe
Unfortunately, data frames add their confusing constructs. When
accessing data frames with .loc[]
then we have to specify rows first,
and possibly columns second. If we drop .loc
then we cannot
specify rows. That is, unless we extract one variable with brackets,
get a series and extract the desired row in the second set of brackets…
"capital"] countries[
## MY Kuala Lumpur
## ID Jakarta
## KH Phnom Penh
## Name: capital, dtype: object
"capital"]["MY"] countries[
## 'Kuala Lumpur'
Finally, remember that 2-D numpy arrays will use similar
integer-positional syntax as .iloc[]
, just without .iloc
.
In conclusion, it is very important to know what is your data type when using numpy and pandas. Indexing is all around us when working with data, there are many somewhat similar ways to extract elements, and which way is correct depends on the exact data type.
3.3.5 Modifying data frames
Modifying data frames can be done in a broadly similar way as extracting elements, you just need to put the expression on the left-hand side.
We can create a new column by assigning a list to a new column name:
"temperature"] = [27.7, 26.1, 26.6] # daily mean, January
countries[ countries
## capital population temperature
## MY Kuala Lumpur 32.7 27.7
## ID Jakarta 267.7 26.1
## KH Phnom Penh 15.3 26.6
One can also overwrite an existing column in a similar fashion.
However, there are several exceptions and caveats. Let’s demonstrate this by modifying the data frame of three countries we created above.
3.3.5.1 One cannot create variables with dot-attribute
We can extract a single series as countries.capital
, but we cannot
create a new variable with dot-notation. For instance:
= [1, 2, 3] # does not work countries.foo
## <string>:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
But after we have created a column with bracket-notation, we can access it using dot-notation.
3.3.5.2 Explicitly make copy when working with filtered data
A typical data science workflow consists of a) filtering data to relevant cases only, and b) modifying the resulting subset. The first step often involves removing missing values, or limiting the analysis to a certain subset of interest. It is important to realize that Pandas’ filtering does not copy the interesting cases in memory, it may instead just create a view, i.e. re-use the same location in computer memory but just limit access to certain part of it.4 This is a very good idea in terms of conserving memory and avoiding unnecessary copy operations. However, this may cause warnings and errors when modifying the filtered data later. We demonstrate this on the same dataset.
Select only large countries (population over 20M):
= countries[countries.population > 20]
large large
## capital population temperature
## MY Kuala Lumpur 32.7 27.7
## ID Jakarta 267.7 26.1
We got a subset of Malaysia and Indonesia. Now let’s add another variable to these large countries:
"language"] = ["Malay", "Indonesian"] large[
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
large
## capital population temperature language
## MY Kuala Lumpur 32.7 27.7 Malay
## ID Jakarta 267.7 26.1 Indonesian
Note the warning: A value is trying to be set on a copy of a
slice…. This tells you that filtering countries[countries.population > 20]
did not create a new data frame but a view of the existing one in
memory, and Pandas is unhappy with the code modifying just a part of
the original data frame.
Although the result appears correct here, do not rely on this approach! It may or may not work, depending on the exact memory layout of the dataset!
Fortunately, the solution is very simple. We need to make an explicit
copy with .copy
method before we start any modifications:
= countries[countries.population > 20].copy() # explicit copy
large "language"] = ["Malay", "Indonesian"]
large[ large
## capital population temperature language
## MY Kuala Lumpur 32.7 27.7 Malay
## ID Jakarta 267.7 26.1 Indonesian
Now the modification works without a warning.
Explicit copy is not needed before you start modifying data, you can
do various filtering steps without .copy
as long as you make the
copy before modifications.
3.3.5.3 Modifying index
The index that is attached to series’ and data frames is potentially a
useful and iformative tool. But sometimes it is not very useful. For
instance, when you load data from disk, then the index defaults to be
the row number, and this is rarely what we are interested in. In such
cases one may want to change the index. If you want to create a
new index then you can just assign it to df.index
. For instance, we
can just assign country names as index to our data frame of large
countries:
= ["Malaysia", "Indonesia"]
large.index large
## capital population temperature language
## Malaysia Kuala Lumpur 32.7 27.7 Malay
## Indonesia Jakarta 267.7 26.1 Indonesian
Alternatively, we can convert a column to index with .set_index()
method:
"capital") large.set_index(
## population temperature language
## capital
## Kuala Lumpur 32.7 27.7 Malay
## Jakarta 267.7 26.1 Indonesian
This will remove the column “capital” from data frame as its values
will be in index instead. Note that by default, .set_index()
returns a new data frame instead of modifying it in place, so if you
want to preserve it, you have to store it in a new variable.
The opposite–converting the
index into a column can be done with .reset_index()
.
Exercise 3.12 Take the data frame of capital-population data frame from Section 3.3.4.
- Replace the index by country names
- Convert the index into a variable “country”
Ensure that you store and print the final data frame!
See the solution
3.3.6 Indexing: summary and comparison
Indexing data is complex. Here we repeat and summarize the main methods we have discussed so far. First create three objects, a numpy matrix, a data frame, and a series. The first two are 2-dimensional but the last one 1-dimensional.
= np.array([[1507, 12478],
M -500, 11034],
[1537, 8443],
[1591, 6810]])
[ M
## array([[ 1507, 12478],
## [ -500, 11034],
## [ 1537, 8443],
## [ 1591, 6810]])
= pd.DataFrame(M, columns=["established", "population"],
df =["Mumbai", "Delhi", "Bangalore", "Hyderabad"])
index df
## established population
## Mumbai 1507 12478
## Delhi -500 11034
## Bangalore 1537 8443
## Hyderabad 1591 6810
= pd.Series(M[:,0], index=df.index)
s s
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## dtype: int64
(This is data about four cities, the year when those were established, and population in thousands).
Exercise 3.13 Create another numpy matrix and a data frame about cities in a similar fashion: create a
matrix of data, and create a data frame from it using pd.DataFrame
.
Specify index (row names) and
columns (variable names). Include at least 3 cities and 3 variables (e.g.
population in millions, size in km2, and population density
people per km2).
Hint: you may invent both city names and the figures!
See the solution
Extract rows/columns by number (integer):
- Numpy array: just use the numbers in brackets:
1,0] # second row, first column M[
## -500
2,:] # third row M[
## array([1537, 8443])
- Data frames: use
iloc
and brackets:
1,0] # second row, first column df.iloc[
## -500
2,:] # third row df.iloc[
## established 1537
## population 8443
## Name: Bangalore, dtype: int64
- Series: use
iloc
and brackets (but these are just 1-dimensional):
1] # second row s.iloc[
## -500
Extract using index (city names/column names):
- numpy array: not possible
- Data frames: use
loc
and brackets:
"Delhi","established"] # second row, first column df.loc[
## -500
"Bangalore",:] # third row df.loc[
## established 1537
## population 8443
## Name: Bangalore, dtype: int64
- Series: use
loc
(but not columns here):
"Delhi"] s.loc[
## -500
If we want to extract individual columns, we can do the following:
- Numpy arrays: use brackets and use a colon
:
in row indicators place:
0] M[:,
## array([1507, -500, 1537, 1591])
- Data frames: you can use
iloc
and brackets, exactly as in case of numpy arrays. You can also use brackets and column names (column index) withoutiloc
, or dot-column name:
0] df.iloc[:,
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
"established"] df[
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
df.established
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
If you want to extract rows and columns in a mixed, e.g. rows by number, and columns by column names (index), you can use double extraction (two sets of brackets) and chain your extractions into a single line:
3,:]["population"] df.iloc[:
## Mumbai 12478
## Delhi 11034
## Bangalore 8443
## Name: population, dtype: int64
Exercise 3.14 Take your own city matrix and city data frame. From both of these extract:
- population density (for all cities)
- data for the third city. For the data frame do it in two ways: using index, and using row number!
- area of the second city. For the data frame, do it in two ways: using column name (column index), and column number!
See the solution
Finally, if asking for a single entry (singleton), pandas simplifies the result into a lower-ranked object (series instead of data frame, or a number instead of series). If you want to retain a similar data structure as the original one, wrap your selector in a list. For instance, the previous example that returns a data frame: single line:
3,:][["population"]] df.iloc[:
## population
## Mumbai 12478
## Delhi 11034
## Bangalore 8443
All these methods can create rather confusing situations sometimes.
For instance, if we do not specify index, it will be automatically
created as row numbers (but starting from 0, not 1). In that case
df.iloc[i]
and df.loc[i]
give the same result (assuming i
is a
list of row numbers). Even worse, if the
index skips some numbers, then df.loc[i]
may or may not work, and
even where it works, it may give wrong results! In a similar fashion,
M[i,j]
works but df[i,j]
does not work, df.loc[i,j]
works but
M.loc[i,j]
does not work. In order to tell if the syntax is correct
it is necessary to know what is the data structure.
There are also operations that are not performed elementwise when using array, in particular matrix product↩︎
If you run your code from command line, the working directory is the directory where you run the command, not the directory where the program is located.↩︎
Pandas decides whether to make a copy or a view in each case separately, depending on what is the more efficient approach.↩︎