A Solutions

## ModuleNotFoundError: No module named 'pandas'
## ModuleNotFoundError: No module named 'matplotlib'

A.1 Python

A.1.1 Operators

A.1.1.1 Going out with friends

friends = 7
budget = 150
print("I am going out with", friends, "friends")
## I am going out with 7 friends
mealprice = 14
price = mealprice*(friends + 1)
total = price*1.15
print("total price:", total)
## total price: 128.79999999999998
if total <= budget:
    print("can afford")
else:
    print("cannot afford")
## can afford

A.1.2 Strings

A.1.2.1 Combine strings

h = "5'" + '3"'
print("height is", h)
## height is 5'3"

A.1.3 Collections

A.1.3.1 Extract sublists

l = list(range(1, 13))
for i in range(3):
    print(l[i::3])        
## [1, 4, 7, 10]
## [2, 5, 8, 11]
## [3, 6, 9, 12]

A.1.3.2 Combining lists

friends = ['Paulina', 'Severo']
others = []
others.append("Lai Ming")
others.append("Lynn")
people = friends + others
print("all people", people)
## all people ['Paulina', 'Severo', 'Lai Ming', 'Lynn']

A.1.3.3 Assign people to seats

Consider two lists:

names = ["Adam", "Ashin", "Inukai", "Tanaka", "Ikki"]
seats = [33, 12, 45, 2, 17]
assigneds = []
for i in range(len(names)):
    a = names[i] + ": " + str(seats[i])
    assigneds.append(a)
assigneds    
## ['Adam: 33', 'Ashin: 12', 'Inukai: 45', 'Tanaka: 2', 'Ikki: 17']

A.1.3.4 List comprehension

  • make list of not i but of i + 1:
[1 + i for i in range(10)]
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  • To make pizzas with different toppings you can just use string concatenation with +:
toppings = ['mushrooms', 'mozarella', 'ananas']
['pizza with ' + t for t in toppings]
## ['pizza with mushrooms', 'pizza with mozarella', 'pizza with ananas']
  • Here you can just loop over the lists, and each time pick out element [1] from the list:
triples = [[1,2,3], ["a", "b", "c"], [len, print, type]]
[l[1] for l in triples]
## [2, 'b', <built-in function print>]

A.1.3.5 Dict mapping coordinates to cities

placenames = {(31.228611, 121.474722): "Shanghai",
              (23.763889, 90.388889): "Dhaka",
              (13.7525, 100.494167): "Bangkok"
          }
print(placenames)
## {(31.228611, 121.474722): 'Shanghai', (23.763889, 90.388889): 'Dhaka', (13.7525, 100.494167): 'Bangkok'}

Note that such lookup is usually not what we want because it only returns the correct city name if given the exact coordinates. If coordinates are just a tiny bit off, the dict cannot find the city. If you consider cities to be continuous areas, you should do geographic lookup, not dictionary lookup.

A.1.3.6 Dict of dicts

tsinghua = {"house":30, "street":"Shuangqing Rd",
            "district":"Haidan", "city":"Beijing",
            "country":"CH"}
kenyatta = {"pobox":"43844-00100", "city":"Nairobi",
            "country":"KE"}
places = {"Tsinghua":tsinghua, "Kenyatta":kenyatta}
places["Teski Refugio"] = {"house":256,
                   "street":"Volcan Osorno",
                   "city":"La Ensenada",
                   "region":"Los Lagos",
                   "country":"CL"}
print(places)
## {'Tsinghua': {'house': 30, 'street': 'Shuangqing Rd', 'district': 'Haidan', 'city': 'Beijing', 'country': 'CH'}, 'Kenyatta': {'pobox': '43844-00100', 'city': 'Nairobi', 'country': 'KE'}, 'Teski Refugio': {'house': 256, 'street': 'Volcan Osorno', 'city': 'La Ensenada', 'region': 'Los Lagos', 'country': 'CL'}}

A.1.3.7 Find the total bill

payments = {"jan":1200, "feb":1200, "mar":1400, "apr":1200}
total = 0
for month in payments.keys():
    total += payments[month]
print("Total rent:", total)    
## Total rent: 5000

A.1.3.8 Sets: unique names

kings = ["Jun", "Gang", "An", "jun", "HYE", "JUN", "hyo", "yang", "WON", "WON", "Yang"]
Kings = [n.title() for n in kings]
s = set(Kings)
print(s)
## {'Yang', 'Won', 'Hyo', 'Hye', 'An', 'Jun', 'Gang'}
print(len(s), "unique kings")
## 7 unique kings

A.1.4 Language Constructs

A.1.4.1 Solution: odd or even

for n in range(1,11):
    parity = n % 2
    if parity == 1:
        print(n, "odd")
    else:
        print(n, "even")
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even

A.1.4.2 Solution: join names

.join() is a string method and must be applied at the end of the separator string:

names = ["viola glabella", "monothropa hypopithys", "lomatium utriculatum"]
", ".join(names)
## 'viola glabella, monothropa hypopithys, lomatium utriculatum'

Note that the syntax is separator.join(names), not names.join(separator).

A.1.5 Modules

A.1.5.1 Solution: access your file system

Below is the solution by importing the whole libraries. You can also use, e.g. from os import getcwd, listdir, or other approaches.

import os
os.getcwd()
## '/home/otoomet/tyyq/lecturenotes/machinelearning-py'
os.listdir()
## ['index.md', 'images.rmd', 'Makefile', 'neural-nets.rmd', 'ml-techniques.rmd~', 'python.rmd', 'neural-nets.md', 'web-scraping.rmd', 'ml-workflow.md', 'text.rmd', 'descriptive-statistics.rmd', 'python.md', 'titanic-tree.png', 'predictions.md', 'solutions.rmd', '_bookdown.yml', 'datasets.md', 'descriptive-statistics.md', 'ml-techniques.md', 'index.rmd', 'predictions.rmd', 'build', '.cache', 'regularization.rmd', 'trees-forests.rmd', 'images.md', 'linear-regression.rmd', 'overfitting-validation.rmd', 'linear-algebra.md', 'linear-algebra.rmd', 'keras-cats-vs-dogs.py', 'svm.rmd', 'overfitting-validation.md', 'cleaning-data.md', 'numpy-pandas.rmd', 'plotting.rmd', '_output.yml', 'figs', 'logistic-regression.rmd', 'keras-color-spiral.py', 'descriptive-analysis.rmd', 'ml-techniques.rmd', 'linear-regression.md', 'cleaning-data.rmd', 'ml-workflow.rmd', 'unsupervised-learning.rmd', 'numpy-pandas.md', 'logistic-regression.md', 'files', 'descriptive-analysis.md', 'text.md', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'regularization.md', 'plotting.md']

A.2 Numpy and Pandas

A.2.1 Numpy

A.2.1.1 Solution: concatenate arrays

## <string>:1: DeprecationWarning: `row_stack` alias is deprecated. Use `np.vstack` directly.
## array([[-1., -1., -1., -1.],
##        [ 0.,  0.,  0.,  0.],
##        [ 2.,  2.,  2.,  2.]])

Obviously, you can use np.ones and np.zeros directly in np.row_stack and skip creating the temporary variables.

A.2.1.2 Solution: create matrix of even numbers

2 + 2*np.arange(20).reshape(4,5)
## array([[ 2,  4,  6,  8, 10],
##        [12, 14, 16, 18, 20],
##        [22, 24, 26, 28, 30],
##        [32, 34, 36, 38, 40]])

A.2.1.3 Solution: create matrix, play with columns

a = 10 + np.arange(20).reshape(4,5)*2
print(a[:,2], '\n')
## [14 24 34 44]
a[3] = 1 + np.arange(5)
a
## array([[10, 12, 14, 16, 18],
##        [20, 22, 24, 26, 28],
##        [30, 32, 34, 36, 38],
##        [ 1,  2,  3,  4,  5]])

A.2.1.4 Solution: Logical indexing

names = np.array(["Roxana", "Statira", "Roxana", "Statira", "Roxana"])
scores = np.array([126, 115, 130, 141, 132])

scores[scores < 130]  # extract < 130
## array([126, 115])
scores[names == "Statira"]  # extract by Statira
## array([115, 141])
scores[names == "Roxana"] = scores[names == "Roxana"] + 10  # add 10
scores
## array([136, 115, 140, 141, 142])

A.2.1.5 Solution: create a sequence of -1, 1

We can just multiply the (0, 1) sequence by 2 and subtract 1:

2*np.random.binomial(1, 0.5, size=10) - 1
## array([ 1, -1, -1,  1, -1, -1,  1, -1, -1, -1])

A.2.2 Pandas

A.2.2.1 Solution: series of capital cities

cities = pd.Series(["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
                   index=["Congo", "Gabon", "Equatorial Guinea", "Cameroon"])
## NameError: name 'pd' is not defined
cities
## NameError: name 'cities' is not defined

A.2.2.2 Solution: extract capital cities

cities.iloc[0]
## NameError: name 'cities' is not defined
cities.iloc[2]
## NameError: name 'cities' is not defined
cities[["Gabon"]]
## NameError: name 'cities' is not defined

A.2.2.3 Solution: create city dataframe

To keep code clean, we create data as a separate dict, and index as a list. Thereafter we make a dataframe out of these two:

data = {"capital":["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
        "population":[1696, 703, 297, 2765]}
countries = ["Congo", "Gabon", "Equatorial Guinea", "Cameroon"]
cityDF = pd.DataFrame(data, index=countries)
## NameError: name 'pd' is not defined
cityDF
## NameError: name 'cityDF' is not defined

A.2.2.4 Solution: Variables in G.W.Bush approval data frame

Pandas prints five columns: index, and four variables— date, approve, disapprove, dontknow.

A.2.2.5 Show files in current/parent folder

Remember that parent folder is denoted as double dots ... Current folder is a single dot ., but usually it is not needed.

Current working directory:

import os
os.getcwd()
## '/home/otoomet/tyyq/lecturenotes/machinelearning-py'

Files in the current folder

os.listdir()  # or os.listdir(".")
## ['index.md', 'images.rmd', 'Makefile', 'neural-nets.rmd', 'ml-techniques.rmd~', 'python.rmd', 'neural-nets.md', 'web-scraping.rmd', 'ml-workflow.md', 'text.rmd', 'descriptive-statistics.rmd', 'python.md', 'titanic-tree.png', 'predictions.md', 'solutions.rmd', '_bookdown.yml', 'datasets.md', 'descriptive-statistics.md', 'ml-techniques.md', 'index.rmd', 'predictions.rmd', 'build', 'trees-forests.md', '.cache', 'regularization.rmd', 'trees-forests.rmd', 'unsupervised-learning.md', 'images.md', 'linear-regression.rmd', 'overfitting-validation.rmd', 'web-scraping.md', 'linear-algebra.md', 'linear-algebra.rmd', 'keras-cats-vs-dogs.py', 'svm.md', 'svm.rmd', 'overfitting-validation.md', 'cleaning-data.md', 'numpy-pandas.rmd', 'plotting.rmd', '_output.yml', 'figs', 'logistic-regression.rmd', 'keras-color-spiral.py', 'descriptive-analysis.rmd', 'ml-techniques.rmd', 'linear-regression.md', 'cleaning-data.rmd', 'ml-workflow.rmd', 'unsupervised-learning.rmd', 'numpy-pandas.md', 'logistic-regression.md', 'files', 'descriptive-analysis.md', 'text.md', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'regularization.md', 'plotting.md']

Files in the parent folder

os.listdir("..")
## ['machineLearning.mtc3', 'machineLearning.mtc17', 'machineLearning.mtc8', 'machineLearning.mtc2', 'data', 'Makefile', 'machineLearning.pdf', 'machineLearning.mtc14', 'machineLearning.mtc20', 'lecturenotes.bib.bak', 'machinelearning-R.html', 'machineLearning.mtc10', 'latexmkrc', 'tex', 'auto', '.RData', 'machineLearning.mtc9', 'preamble.tex', 'scripts', 'machineLearning.mtc18', 'machineLearning.exc', 'machineLearning.mtc13', 'intro-to-stats.rnw', 'machineLearning.mtc21', 'machineLearning.mtc19', 'machineLearning.mtc16', 'machineLearning.mtc5', '.cache', 'material', 'machineLearning.tex', 'solutions.rnw', 'machineLearning.maf', 'machineLearning.mtc0', 'machineLearning.mtc11', 'ml-models.rnw', 'machineLearning.mtc12', 'machinelearning-py', 'kbordermatrix.sty', 'machinelearning-R', 'solutions.tex', 'datascience-intro', 'machinelearning-common', 'machineLearning.chs', 'test', 'machineLearning.mtc1', 'machineLearning.exm', 'machineLearning.mtc23', 'machineLearning.mtc15', 'figs', 'machineLearning.mtc7', '.Rprofile', '.git', 'intro-to-stats.tex', 'machineLearning.xdv', 'machineLearning.mtc4', 'machineLearning.rnw', 'img', 'machineLearning.mtc', 'machineLearning.bbl', 'README.md', 'files', 'machineLearning.mtc22', '.Rhistory', 'videos', '.gitignore', 'machineLearning.rip', '.fig', 'ml-models.tex', 'literature', 'machineLearning.mtc6']

Obviously, everyone has different files in these folders.

A.2.2.6 Solution: presidents approval 88%

approval = pd.read_csv("../data/gwbush-approval.csv", sep="\t", nrows=10) 
## NameError: name 'pd' is not defined
approval[approval.approve >= 88][["date", "approve"]]  # only print date,
## NameError: name 'approval' is not defined
    # approval rate
approval[approval.approve >= 88].shape[0]  # data for at least 90% approval
## NameError: name 'approval' is not defined

A.2.2.7 Solution: change index, convert to variable

The original data frame was created as

capitals = pd.DataFrame(
        {"capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],
         "population":[32.7, 267.7, 15.3]},  # in millions
        index=["MY", "ID", "KH"])
## NameError: name 'pd' is not defined

We can modify it as

capitals.index = ["Malaysia", "Indonesia", "Cambodia"]
## NameError: name 'capitals' is not defined
capitals  # now the index is country names
## NameError: name 'capitals' is not defined
capitals = capitals.reset_index(name = "country")
## NameError: name 'capitals' is not defined

A.2.2.8 Solution: create city dataframe

cityM = np.array([[3.11, 5282, 19800],
              [18.9, 306, 46997],
              [4.497, 1886, 22000]])
names = ["Chittagong", "Dhaka", "Kolkata"]
vars = ["population", "area", "density"]
cityDF = pd.DataFrame(cityM, index=names,
                      columns=vars)
## NameError: name 'pd' is not defined
cityDF
## NameError: name 'cityDF' is not defined

A.2.2.9 Solution: extract city data

# density:
cityM[:,2]
## array([19800., 46997., 22000.])
cityDF.density
## NameError: name 'cityDF' is not defined
# third city
cityM[2,:]
## array([4.497e+00, 1.886e+03, 2.200e+04])
cityDF.loc["Kolkata",:]
## NameError: name 'cityDF' is not defined
cityDF.iloc[2,:]
## NameError: name 'cityDF' is not defined
## second city area
cityM[1,1]
## np.float64(306.0)
cityDF.loc["Kolkata","area"]
## NameError: name 'cityDF' is not defined
cityDF.iloc[1,1]
## NameError: name 'cityDF' is not defined

A.2.2.10 Solution: Titanic line 1000

Extract the 1000th row (note: index 999!) as data frame, and thereafter extract name, survival status, and age. We have to extract twice as we cannot extract by row number and column names at the same time.

titanic = pd.read_csv("../data/titanic.csv.bz2")
## NameError: name 'pd' is not defined
titanic.loc[[999]][["name", "survived", "age"]]
## NameError: name 'titanic' is not defined

A.2.2.11 Solution: Titanic male/female age distribution

The two relevant variables are clearly sex and age. If you already have loaded data then you may just presereve these variables:

titanic[["sex", "age"]].sample(4)
## NameError: name 'titanic' is not defined

Alternatively, you may also just read these two columns only:

pd.read_csv("../data/titanic.csv.bz2",
            usecols=["sex", "age"]).sample(4)
## NameError: name 'pd' is not defined

A.3 Descriptive analysis with Pandas

A.3.1 What are the values?

A.3.1.1 Solution: which methods can be applied to the whole data frame

There is no better way than just to try:

titanic.min()  # works
## NameError: name 'titanic' is not defined
titanic.mean()  # works
## NameError: name 'titanic' is not defined
titanic.unique()  # does not work
## NameError: name 'titanic' is not defined
titanic.nunique()  # works again
## NameError: name 'titanic' is not defined

The mathematical functions work, but may skip the non-numeric variables. .min however, find the first value when ordered alphabetically. It is not immediately clear why .unique does not work while .nunique works. It may be because there is a different number of unique elements for each variables, but hey, you can just use a more complex data structure and still compute that.

A.4 Cleaning and Manipulating Data

A.4.1 Missing Observations

A.4.1.1 Solution: Missings in fare in Titanic Data

We can count the NaN-s as

titanic.fare.isna().sum()
## NameError: name 'titanic' is not defined

There is just a single missing value. About non-reasonable values: first we should check its data type:

titanic.fare.dtype
## NameError: name 'titanic' is not defined

We see it is coded as numeric (64-bit float), so we can query its range:

titanic.fare.min(), titanic.fare.max()
## NameError: name 'titanic' is not defined

While 512 pounds seems a reasonable ticket price, value 0 for minimum is suspicious. We do not know if any passengers really did not pay a fare, or more likely, it just means that the data collectors did not have the information. So it is just a missing value, coded as 0.

A.4.2 Converting Variables

A.4.2.1 Solution: convert males’ dataset school to categories

Load the males dataset. It is instructive to test the code first on a sequence of years of schooling before converting actual data. In particular, we want to ensure that we get the boundaries right, “12” should be HS while “13” should be “some college” and so on.

We choose to specify the right boundary at integer values, e.g. “HS” is interval \([12, 13)\). In order to ensure that “12” belong to this interval while “13” does not we tell right=False, i.e. remove the right boundary for the interval (and hence include it to the interval above):

# test on years of schooling 10-17
school = np.arange(10,18)
# convert to categories
categories = pd.cut(school,
                    bins=[-np.Inf, 12, 13, 16, np.Inf],
                    labels=["Less than HS", "HS", "Some college", "College"],
                    right=False)
## NameError: name 'pd' is not defined
# print in a way that years and categories are next to each other
pd.Series(categories, index=school)
## NameError: name 'pd' is not defined

Now we see it works correctly, e.g. 11 years of schooling is “Less than HS” while 12 years is “HS”. Now we can do the actual conversion:

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
## NameError: name 'pd' is not defined
pd.cut(males.school,
       bins=[-np.Inf, 12, 13, 16, np.Inf],
       labels=["Less than HS", "HS", "Some college", "College"],
       right=False)
## NameError: name 'pd' is not defined

A.4.2.2 Solution: Convert Males’ dataset residence to dummies

Rest of the task pretty much repeats the examples in Converting categorical variables to dummies, just you have to find the prefix_sep argument to remove the underscore between the prefix “R” and the category name. The code might look like

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
## NameError: name 'pd' is not defined
residence = pd.get_dummies(males.residence, prefix="R", prefix_sep="")
## NameError: name 'pd' is not defined
residence.drop("Rsouth", axis=1).sample(8)
## NameError: name 'residence' is not defined

A.4.2.3 Solution: Convert Titanic’s age categories, sex, pclass to dummies

titanic = pd.read_csv("../data/titanic.csv.bz2")
## NameError: name 'pd' is not defined
titanic = titanic[["age", "sex", "pclass"]]
## NameError: name 'titanic' is not defined
titanic["age"] = pd.cut(titanic.age,
                        bins=[0, 14, 50, np.inf],
                        labels=["0-13", "14-49", "50-"],
                        right=False)
## NameError: name 'pd' is not defined
d = pd.get_dummies(titanic, columns=["age", "sex", "pclass"])
## NameError: name 'pd' is not defined
d.sample(7)
## NameError: name 'd' is not defined

One may also want to drop one of the dummy levels with drop_first argument.

A.4.2.4 Solution: Convert Males’ residence, ethn to dummies and concatenate

We create the dummies separately for residence and ethn and give them corresponding prefix to make the variable names more descriptive.

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
## NameError: name 'pd' is not defined
residence = pd.get_dummies(males.residence, prefix="residence")
## NameError: name 'pd' is not defined
## remove the reference category
residence = residence.drop("residence_north_east", axis=1)
## NameError: name 'residence' is not defined
residence.sample(4)
## NameError: name 'residence' is not defined
## convert and remove using chaining
ethn = pd.get_dummies(males.ethn, prefix="ethn")\
         .drop("ethn_other", axis=1)
## NameError: name 'pd' is not defined
ethn.sample(4)
## NameError: name 'ethn' is not defined
## combine these variables next to each other
d = pd.concat((males.wage, residence, ethn), axis=1)
## NameError: name 'pd' is not defined
d.sample(7)
## NameError: name 'd' is not defined

A.5 Descriptive Statistics

A.5.1 Inequality

A.5.1.1 Solution: 80-20 ratio of income

First load the data and compute total income:

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
## NameError: name 'pd' is not defined
income = treatment.re78  # extract income for simplicity
## NameError: name 'treatment' is not defined
total = income.sum()  # total income
## NameError: name 'income' is not defined

Next, we can just start trying with the uppermost 1% (lowermost 99):

pct = 99
threshold = np.percentile(treatment.re78, pct)
## NameError: name 'treatment' is not defined

The income share of the richest 1% is

share = income[income > threshold].sum()/total
## NameError: name 'income' is not defined
share
## NameError: name 'share' is not defined