A Solutions

A.1 Python

A.1.1 Operators

A.1.1.1 Going out with friends

friends = 7
budget = 150
print("I am going out with", friends, "friends")
## I am going out with 7 friends
mealprice = 14
price = mealprice*(friends + 1)
total = price*1.15
print("total price:", total)
## total price: 128.79999999999998
if total <= budget:
    print("can afford")
else:
    print("cannot afford")
## can afford

A.1.2 Strings

A.1.2.1 Combine strings

h = "5'" + '3"'
print("height is", h)
## height is 5'3"

A.1.3 Collections

A.1.3.1 Extract sublists

l = list(range(1, 13))
for i in range(3):
    print(l[i::3])        
## [1, 4, 7, 10]
## [2, 5, 8, 11]
## [3, 6, 9, 12]

A.1.3.2 Combining lists

friends = ['Paulina', 'Severo']
others = []
others.append("Lai Ming")
others.append("Lynn")
people = friends + others
print("all people", people)
## all people ['Paulina', 'Severo', 'Lai Ming', 'Lynn']

A.1.3.3 Assign people to seats

Consider two lists:

names = ["Adam", "Ashin", "Inukai", "Tanaka", "Ikki"]
seats = [33, 12, 45, 2, 17]
assigneds = []
for i in range(len(names)):
    a = names[i] + ": " + str(seats[i])
    assigneds.append(a)
assigneds    
## ['Adam: 33', 'Ashin: 12', 'Inukai: 45', 'Tanaka: 2', 'Ikki: 17']

A.1.3.4 List comprehension

Create pizzas with different toppings:

[1 + i for i in range(10)]
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
toppings = ['mushrooms', 'mozarella', 'ananas']
['pizza with ' + t for t in toppings]
## ['pizza with mushrooms', 'pizza with mozarella', 'pizza with ananas']

A.1.3.5 Dict mapping coordinates to cities

placenames = {(31.228611, 121.474722): "Shanghai",
              (23.763889, 90.388889): "Dhaka",
              (13.7525, 100.494167): "Bangkok"
          }
print(placenames)
## {(31.228611, 121.474722): 'Shanghai', (23.763889, 90.388889): 'Dhaka', (13.7525, 100.494167): 'Bangkok'}

Note that such lookup is usually not what we want because it only returns the correct city name if given the exact coordinates. If coordinates are just a tiny bit off, the dict cannot find the city. If you consider cities to be continuous areas, you should do geographic lookup, not dictionary lookup.

A.1.3.6 Dict of dicts

tsinghua = {"house":30, "street":"Shuangqing Rd",
            "district":"Haidan", "city":"Beijing",
            "country":"CH"}
kenyatta = {"pobox":"43844-00100", "city":"Nairobi",
            "country":"KE"}
places = {"Tsinghua":tsinghua, "Kenyatta":kenyatta}
places["Teski Refugio"] = {"house":256,
                   "street":"Volcan Osorno",
                   "city":"La Ensenada",
                   "region":"Los Lagos",
                   "country":"CL"}
print(places)
## {'Tsinghua': {'house': 30, 'street': 'Shuangqing Rd', 'district': 'Haidan', 'city': 'Beijing', 'country': 'CH'}, 'Kenyatta': {'pobox': '43844-00100', 'city': 'Nairobi', 'country': 'KE'}, 'Teski Refugio': {'house': 256, 'street': 'Volcan Osorno', 'city': 'La Ensenada', 'region': 'Los Lagos', 'country': 'CL'}}

A.1.3.7 Find the total bill

payments = {"jan":1200, "feb":1200, "mar":1400, "apr":1200}
total = 0
for month in payments.keys():
    total += payments[month]
print("Total rent:", total)    
## Total rent: 5000

A.1.3.8 Sets: unique names

kings = ["Jun", "Gang", "An", "jun", "HYE", "JUN", "hyo", "yang", "WON", "WON", "Yang"]
Kings = [n.title() for n in kings]
s = set(Kings)
print(s)
## {'Yang', 'Hyo', 'Hye', 'Gang', 'An', 'Won', 'Jun'}
print(len(s), "unique kings")
## 7 unique kings

A.1.4 Language Constructs

A.1.4.1 Solution: odd or even

for n in range(1,11):
    parity = n % 2
    if parity == 1:
        print(n, "odd")
    else:
        print(n, "even")
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even

A.1.4.2 Solution: join names

.join() is a string method and must be applied at the end of the separator string:

names = ["viola glabella", "monothropa hypopithys", "lomatium utriculatum"]
", ".join(names)
## 'viola glabella, monothropa hypopithys, lomatium utriculatum'

Note that the syntax is separator.join(names), not names.join(separator).

A.1.5 Modules

A.1.5.1 Solution: access your file system

Below is the solution by importing the whole libraries. You can also use, e.g. from os import getcwd, listdir, or other approaches.

import os
os.getcwd()
## '/home/siim/tyyq/lecturenotes/machinelearning-py'
os.listdir()
## ['text.rmd', 'logistic-regression.rmd', 'linear-regression.rmd', 'ml-workflow.md', 'index.rmd', 'images.rmd', 'regularization.md', 'trees-forests.rmd', 'svm.md', 'color-spiral-keras.png', 'svm.rmd', 'images.md', 'numpy-pandas.rmd', 'solutions.md', 'plotting.rmd', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'web-scraping.md', 'numpy-pandas.md', 'linear-algebra.rmd', 'descriptive-analysis.rmd', 'descriptive-analysis.md', 'descriptive-statistics.md', 'regularization.rmd', 'svm.rmd~', 'keras-color-spiral.py', 'Makefile', 'unsupervised-learning.md', '_bookdown.yml', 'linear-regression.md', 'figs', 'web-scraping.rmd', 'solutions.rmd', 'predictions.md', 'neural-nets.rmd', 'predictions.rmd', 'figure', 'linear-algebra.md', 'files', 'unsupervised-learning.rmd', 'cleaning-data.md', 'keras-cats-vs-dogs.py', 'titanic-tree.png', 'logistic-regression.md', 'neural-nets.md', 'python.md', 'index.md', 'ml-workflow.rmd', 'build', 'text.md', 'trees-forests.md', '.cache', 'svm.md~', 'python.rmd', 'cleaning-data.rmd', 'datasets.md', 'descriptive-statistics.rmd', '_output.yml', 'datasets.rmd~', 'plotting.md', '_bookdown_files', 'overfitting-validation.md', 'overfitting-validation.rmd']

A.2 Numpy and Pandas

A.2.1 Numpy

A.2.1.1 Solution: concatenate arrays

## array([[-1., -1., -1., -1.],
##        [ 0.,  0.,  0.,  0.],
##        [ 2.,  2.,  2.,  2.]])

Obviously, you can use np.ones and np.zeros directly in np.row_stack and skip creating the temporary variables.

A.2.1.2 Solution: create matrix of even numbers

2 + 2*np.arange(20).reshape(4,5)
## array([[ 2,  4,  6,  8, 10],
##        [12, 14, 16, 18, 20],
##        [22, 24, 26, 28, 30],
##        [32, 34, 36, 38, 40]])

A.2.1.3 Solution: create matrix, play with columns

a = 10 + np.arange(20).reshape(4,5)*2
print(a[:,2], '\n')
## [14 24 34 44]
a[3] = 1 + np.arange(5)
a
## array([[10, 12, 14, 16, 18],
##        [20, 22, 24, 26, 28],
##        [30, 32, 34, 36, 38],
##        [ 1,  2,  3,  4,  5]])

A.2.1.4 Solution: Logical indexing

names = np.array(["Roxana", "Statira", "Roxana", "Statira", "Roxana"])
scores = np.array([126, 115, 130, 141, 132])

scores[scores < 130]  # extract < 130
## array([126, 115])
scores[names == "Statira"]  # extract by Statira
## array([115, 141])
scores[names == "Roxana"] = scores[names == "Roxana"] + 10  # add 10
scores
## array([136, 115, 140, 141, 142])

A.2.1.5 Solution: create a sequence of -1, 1

We can just multiply the (0, 1) sequence by 2 and subtract 1:

2*np.random.binomial(1, 0.5, size=10) - 1
## array([ 1,  1, -1,  1,  1, -1,  1, -1,  1,  1])

A.2.2 Pandas

A.2.2.1 Solution: series of capital cities

cities = pd.Series(["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
                   index=["Congo", "Gabon", "Equatorial Guinea", "Cameroon"])
cities
## Congo                Brazzaville
## Gabon                 Libreville
## Equatorial Guinea         Malabo
## Cameroon                 Yaoundé
## dtype: object

A.2.2.2 Solution: extract capital cities

cities.iloc[0]
## 'Brazzaville'
cities.iloc[2]
## 'Malabo'
cities[["Gabon"]]
## Gabon    Libreville
## dtype: object

A.2.2.3 Solution: create city dataframe

To keep code clean, we create data as a separate dict, and index as a list. Thereafter we make a dataframe out of these two:

data = {"capital":["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
        "population":[1696, 703, 297, 2765]}
countries = ["Congo", "Gabon", "Equatorial Guinea", "Cameroon"]
cityDF = pd.DataFrame(data, index=countries)
cityDF
##                        capital  population
## Congo              Brazzaville        1696
## Gabon               Libreville         703
## Equatorial Guinea       Malabo         297
## Cameroon               Yaoundé        2765

A.2.2.4 Solution: Variables in G.W.Bush approval data frame

Pandas prints five columns: index, and four variables— date, approve, disapprove, dontknow.

A.2.2.5 Show files in current/parent folder

Remember that parent folder is denoted as double dots ... Current folder is a single dot ., but usually it is not needed.

Current working directory:

import os
os.getcwd()
## '/home/siim/tyyq/lecturenotes/machinelearning-py'

Files in the current folder

os.listdir()  # or os.listdir(".")
## ['text.rmd', 'logistic-regression.rmd', 'linear-regression.rmd', 'ml-workflow.md', 'index.rmd', 'images.rmd', 'regularization.md', 'trees-forests.rmd', 'svm.md', 'color-spiral-keras.png', 'svm.rmd', 'images.md', 'numpy-pandas.rmd', 'solutions.md', 'plotting.rmd', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'web-scraping.md', 'numpy-pandas.md', 'linear-algebra.rmd', 'descriptive-analysis.rmd', 'descriptive-analysis.md', 'descriptive-statistics.md', 'regularization.rmd', 'svm.rmd~', 'keras-color-spiral.py', 'Makefile', 'unsupervised-learning.md', '_bookdown.yml', 'linear-regression.md', 'figs', 'web-scraping.rmd', 'solutions.rmd', 'predictions.md', 'neural-nets.rmd', 'predictions.rmd', 'figure', 'linear-algebra.md', 'files', 'unsupervised-learning.rmd', 'cleaning-data.md', 'keras-cats-vs-dogs.py', 'titanic-tree.png', 'logistic-regression.md', 'neural-nets.md', 'python.md', 'index.md', 'ml-workflow.rmd', 'build', 'text.md', 'trees-forests.md', '.cache', 'svm.md~', 'python.rmd', 'cleaning-data.rmd', 'datasets.md', 'descriptive-statistics.rmd', '_output.yml', 'datasets.rmd~', 'plotting.md', '_bookdown_files', 'overfitting-validation.md', 'overfitting-validation.rmd']

Files in the parent folder

os.listdir("..")
## ['machineLearning.rip', 'literature', '.RData', 'machineLearning.mtc19', 'machineLearning.mtc8', 'machineLearning.mtc16', 'machineLearning.mtc9', 'machineLearning.exc', 'machineLearning.chs', 'machinelearning-common', '.gitignore', 'datascience-intro', 'machineLearning.mtc12', '.Rhistory', 'machineLearning.mtc18', 'machineLearning.mtc20', 'machineLearning.dvi', 'machineLearning.mtc2', 'README.md', 'machineLearning.mtc15', 'machineLearning.mtc13', 'scripts', 'machinelearning-R', '.fig', 'machinelearning-R.html', 'machineLearning.tex', 'kbordermatrix.sty', 'test', '.git', 'lecturenotes.bib.bak', 'machineLearning.exm', 'videos', 'machineLearning.rnw', 'auto', 'machineLearning.mtc11', 'machineLearning.mtc22', 'machineLearning.mtc5', '.Rprofile', 'machineLearning.mtc', 'machineLearning.mtc1', 'machinelearning-py', 'Makefile', 'machineLearning.mtc17', 'bitbucket-pipelines.yml', 'figs', 'machineLearning.mtc0', 'machineLearning.xdv', 'machineLearning.mtc21', 'files', 'machineLearning.mtc6', 'isomath.tex', 'machineLearning.mtc23', 'machineLearning.mtc14', 'machineLearning.bbl', 'machineLearning.mtc7', 'lecturenotes.bib', 'machineLearning.mtc3', '_fig', 'data', '.cache', 'latexmkrc', 'machineLearning.mtc10', 'machineLearning.maf', 'machineLearning.mtc4', 'img', 'machineLearning.pdf', 'material']

Obviously, everyone has different files in these folders.

A.2.2.6 Solution: presidents approval 88%

approval = pd.read_csv("../data/gwbush-approval.csv", sep="\t", nrows=10) 
approval[approval.approve >= 88][["date", "approve"]]  # only print date,
    # approval rate
##              date  approve
## 5  2001 Oct 19-21       88
## 6  2001 Oct 11-14       89
## 8  2001 Sep 21-22       90
approval[approval.approve >= 88].shape[0]  # data for at least 90% approval
## 3

A.2.2.7 Solution: change index, convert to variable

The original data frame was created as

capitals = pd.DataFrame(
        {"capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],
         "population":[32.7, 267.7, 15.3]},  # in millions
        index=["MY", "ID", "KH"])

We can modify it as

capitals.index = ["Malaysia", "Indonesia", "Cambodia"]
capitals  # now the index is country names
##                 capital  population
## Malaysia   Kuala Lumpur        32.7
## Indonesia       Jakarta       267.7
## Cambodia     Phnom Penh        15.3
capitals = capitals.reset_index(name = "country")
## Error: TypeError: DataFrame.reset_index() got an unexpected keyword argument 'name'

A.2.2.8 Solution: create city dataframe

cityM = np.array([[3.11, 5282, 19800],
              [18.9, 306, 46997],
              [4.497, 1886, 22000]])
names = ["Chittagong", "Dhaka", "Kolkata"]
vars = ["population", "area", "density"]
cityDF = pd.DataFrame(cityM, index=names,
                      columns=vars)
cityDF
##             population    area  density
## Chittagong       3.110  5282.0  19800.0
## Dhaka           18.900   306.0  46997.0
## Kolkata          4.497  1886.0  22000.0

A.2.2.9 Solution: extract city data

# density:
cityM[:,2]
## array([19800., 46997., 22000.])
cityDF.density
# third city
## Chittagong    19800.0
## Dhaka         46997.0
## Kolkata       22000.0
## Name: density, dtype: float64
cityM[2,:]
## array([4.497e+00, 1.886e+03, 2.200e+04])
cityDF.loc["Kolkata",:]
## population        4.497
## area           1886.000
## density       22000.000
## Name: Kolkata, dtype: float64
cityDF.iloc[2,:]
## second city area
## population        4.497
## area           1886.000
## density       22000.000
## Name: Kolkata, dtype: float64
cityM[1,1]
## 306.0
cityDF.loc["Kolkata","area"]
## 1886.0
cityDF.iloc[1,1]
## 306.0

A.2.2.10 Solution: Titanic line 1000

Extract the 1000th row (note: index 999!) as data frame, and thereafter extract name, survival status, and age. We have to extract twice as we cannot extract by row number and column names at the same time.

titanic = pd.read_csv("../data/titanic.csv.bz2")
titanic.loc[[999]][["name", "survived", "age"]]
##                                   name  survived  age
## 999  McCarthy, Miss. Catherine "Katie"         1  NaN

A.2.2.11 Solution: Titanic male/female age distribution

The two relevant variables are clearly sex and age. If you already have loaded data then you may just presereve these variables:

titanic[["sex", "age"]].sample(4)
##       sex   age
## 933  male  26.0
## 237  male   NaN
## 52   male  28.0
## 505  male  48.0

Alternatively, you may also just read these two columns only:

pd.read_csv("../data/titanic.csv.bz2",
            usecols=["sex", "age"]).sample(4)
##        sex   age
## 1253  male   NaN
## 1068  male  61.0
## 142   male  46.0
## 1245  male   NaN

A.3 Descriptive analysis with Pandas

A.3.1 What are the values?

A.3.1.1 Solution: which methods can be applied to the whole data frame

There is no better way than just to try:

titanic.min()  # works
## pclass                        1
## survived                      0
## name        Abbing, Mr. Anthony
## sex                      female
## age                      0.1667
## sibsp                         0
## parch                         0
## ticket                   110152
## fare                        0.0
## body                        1.0
## dtype: object
## 
## <string>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
titanic.mean()  # works
## pclass        2.294882
## survived      0.381971
## age          29.881135
## sibsp         0.498854
## parch         0.385027
## fare         33.295479
## body        160.809917
## dtype: float64
titanic.unique()  # does not work
## Error: AttributeError: 'DataFrame' object has no attribute 'unique'
titanic.nunique()  # works again
## pclass          3
## survived        2
## name         1307
## sex             2
## age            98
## sibsp           7
## parch           8
## ticket        929
## fare          281
## cabin         186
## embarked        3
## boat           27
## body          121
## home.dest     369
## dtype: int64

The mathematical functions work, but may skip the non-numeric variables. .min however, find the first value when ordered alphabetically. It is not immediately clear why .unique does not work while .nunique works. It may be because there is a different number of unique elements for each variables, but hey, you can just use a more complex data structure and still compute that.

A.4 Cleaning and Manipulating Data

A.4.1 Missing Observations

A.4.1.1 Solution: Missings in fare in Titanic Data

We can count the NaN-s as

titanic.fare.isna().sum()
## 1

There is just a single missing value. About non-reasonable values: first we should check its data type:

titanic.fare.dtype
## dtype('float64')

We see it is coded as numeric (64-bit float), so we can query its range:

titanic.fare.min(), titanic.fare.max()
## (0.0, 512.3292)

While 512 pounds seems a reasonable ticket price, value 0 for minimum is suspicious. We do not know if any passengers really did not pay a fare, or more likely, it just means that the data collectors did not have the information. So it is just a missing value, coded as 0.

A.4.2 Converting Variables

A.4.2.1 Solution: convert males’ dataset school to categories

Load the males dataset. It is instructive to test the code first on a sequence of years of schooling before converting actual data. In particular, we want to ensure that we get the boundaries right, “12” should be HS while “13” should be “some college” and so on.

We choose to specify the right boundary at integer values, e.g. “HS” is interval \([12, 13)\). In order to ensure that “12” belong to this interval while “13” does not we tell right=False, i.e. remove the right boundary for the interval (and hence include it to the interval above):

# test on years of schooling 10-17
school = np.arange(10,18)
# convert to categories
categories = pd.cut(school,
                    bins=[-np.Inf, 12, 13, 16, np.Inf],
                    labels=["Less than HS", "HS", "Some college", "College"],
                    right=False)
# print in a way that years and categories are next to each other
pd.Series(categories, index=school)
## 10    Less than HS
## 11    Less than HS
## 12              HS
## 13    Some college
## 14    Some college
## 15    Some college
## 16         College
## 17         College
## dtype: category
## Categories (4, object): ['Less than HS' < 'HS' < 'Some college' < 'College']

Now we see it works correctly, e.g. 11 years of schooling is “Less than HS” while 12 years is “HS”. Now we can do the actual conversion:

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
pd.cut(males.school,
       bins=[-np.Inf, 12, 13, 16, np.Inf],
       labels=["Less than HS", "HS", "Some college", "College"],
       right=False)
## 0       Some college
## 1       Some college
## 2       Some college
## 3       Some college
## 4       Some college
##             ...     
## 4355    Less than HS
## 4356    Less than HS
## 4357    Less than HS
## 4358    Less than HS
## 4359    Less than HS
## Name: school, Length: 4360, dtype: category
## Categories (4, object): ['Less than HS' < 'HS' < 'Some college' < 'College']

A.4.2.2 Solution: Convert Males’ dataset residence to dummies

Rest of the task pretty much repeats the examples in Converting categorical variables to dummies, just you have to find the prefix_sep argument to remove the underscore between the prefix “R” and the category name. The code might look like

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
residence = pd.get_dummies(males.residence, prefix="R", prefix_sep="")
residence.drop("Rsouth", axis=1).sample(8)
##       Rnorth_east  Rnothern_central  Rrural_area
## 3531            0                 0            0
## 4188            0                 1            0
## 3570            0                 0            0
## 1291            1                 0            0
## 1499            1                 0            0
## 1525            1                 0            0
## 3031            0                 0            0
## 2997            0                 0            0

A.4.2.3 Solution: Convert Titanic’s age categories, sex, pclass to dummies

titanic = pd.read_csv("../data/titanic.csv.bz2")
titanic = titanic[["age", "sex", "pclass"]]
titanic["age"] = pd.cut(titanic.age,
                        bins=[0, 14, 50, np.inf],
                        labels=["0-13", "14-49", "50-"],
                        right=False)
d = pd.get_dummies(titanic, columns=["age", "sex", "pclass"])
d.sample(7)
##       age_0-13  age_14-49  age_50-  ...  pclass_1  pclass_2  pclass_3
## 423          0          1        0  ...         0         1         0
## 573          0          1        0  ...         0         1         0
## 257          0          1        0  ...         1         0         0
## 942          0          0        0  ...         0         0         1
## 1282         0          0        0  ...         0         0         1
## 572          0          1        0  ...         0         1         0
## 509          0          1        0  ...         0         1         0
## 
## [7 rows x 8 columns]

One may also want to drop one of the dummy levels with drop_first argument.

A.4.2.4 Solution: Convert Males’ residence, ethn to dummies and concatenate

We create the dummies separately for residence and ethn and give them corresponding prefix to make the variable names more descriptive.

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
residence = pd.get_dummies(males.residence, prefix="residence")
## remove the reference category
residence = residence.drop("residence_north_east", axis=1)
residence.sample(4)
## convert and remove using chaining
##       residence_nothern_central  residence_rural_area  residence_south
## 1546                          0                     0                0
## 3595                          0                     0                1
## 3596                          0                     0                1
## 777                           0                     0                0
ethn = pd.get_dummies(males.ethn, prefix="ethn")\
         .drop("ethn_other", axis=1)
ethn.sample(4)
## combine these variables next to each other
##       ethn_black  ethn_hisp
## 24             0          0
## 3271           0          1
## 3346           1          0
## 3527           1          0
d = pd.concat((males.wage, residence, ethn), axis=1)
d.sample(7)
##           wage  residence_nothern_central  ...  ethn_black  ethn_hisp
## 1209  1.295189                          0  ...           0          0
## 2210  1.064160                          0  ...           0          0
## 3913  1.485029                          0  ...           0          1
## 4241  1.285396                          0  ...           0          1
## 1895  1.332716                          1  ...           0          0
## 2888  1.379507                          0  ...           1          0
## 201   1.661706                          1  ...           0          0
## 
## [7 rows x 6 columns]

A.5 Descriptive Statistics

A.5.1 Inequality

A.5.1.1 Solution: 80-20 ratio of income

First load the data and compute total income:

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
income = treatment.re78  # extract income for simplicity
total = income.sum()  # total income

Next, we can just start trying with the uppermost 1% (lowermost 99):

pct = 99
threshold = np.percentile(treatment.re78, pct)

The income share of the richest 1% is

share = income[income > threshold].sum()/total
share
## 0.04234557467220405

This is less than 99%, so we should try a smaller number. Instead, we find that the top 1% of the richest8 earn 4% of total income. For the top 2% we have

pct = 98
threshold = np.percentile(treatment.re78, pct)
income[income > threshold].sum()/total
## 0.07508822500042818

Again, this is less than 98%. Instead of manually trying to find the correct value, we can do a loop:

for pct in range(99, 50, -1):
    threshold = np.percentile(treatment.re78, pct)
    share = income[income > threshold].sum()/total
    if 100*share > pct:
        print(pct, share)
        break
## 63 0.6474562618876845

So in this data, the richest 37% earn approximately 65% of total income. If we want more precise ratio then we can write another loop that goes down to fractions of percent.

A.6 Linear Algebra

A.6.1 Numpy Arrays as Vectors and Matrices

A.6.1.1 Solution: Berlin-Germany-France-Paris

berlin = np.array([-0.562, 0.630, -0.453, -0.299, -0.006])
germany = np.array([0.194, 0.507, 0.287,  0.132, -0.281])
france = np.array([0.605, -0.678, -0.436, -0.019, -0.291])
paris = np.array([-0.074, -0.855, -0.689, -0.057, -0.139])
v = berlin - germany + france
v
## array([-0.151, -0.555, -1.176, -0.45 , -0.016])
v - paris
## array([-0.077,  0.3  , -0.487, -0.393,  0.123])

By just eyeballing the result, it does not look to be very close to Paris. However, when doing the exercise for all 100 components, Paris turns indeed out to be the most similar word to Berlin - Germany + France.

A.6.1.2 Solution: create and multipy row vectors

There are several ways to create the row vectors, here are a few different approaches

x1 = (1 + np.arange(5)).reshape((5,1))  # use 1 + arange().reshape
x2 = np.array([[-1, 2, -3, 4, -5]]).T  # create manually
x3 = np.arange(5,0,-1).reshape((5,1))  # use arange(arguments).reshape
# Stack all three row vectors into a matrix
X = np.row_stack((x1.T, x2.T, x3.T))
beta = 0.1*np.ones((5,1))  # use np.ones()
# and now compute
x1.T @ beta
## array([[1.5]])
x2.T @ beta
## array([[-0.3]])
x3.T @ beta
## array([[1.5]])
X @ beta
## array([[ 1.5],
##        [-0.3],
##        [ 1.5]])

Note that when multiplying \(\mathsf{X}\) which is all three vectors stached underneath each other, we get all three answers stacked underneath each other.

A.7 Linear Regression

A.7.0.1 Solution: Compute \(MSE\) and plot regression line

The solution is to combine the \(MSE\) computation and plotting the results. One can do it like this:

iris = pd.read_csv("../data/iris.csv.bz2", sep="\t")
versicolor = iris[iris.Species == "versicolor"].rename(
    columns={"Petal.Length": "pl", "Petal.Width":"pw"})
def msePlot(beta0, beta1):
    hat_w = beta0 + beta1*versicolor.pl  # note: vectorized operators!
    e = versicolor.pw - hat_w
    mse = np.mean(e**2)
    plt.scatter(versicolor.pl, versicolor.pw, edgecolor="k")
    length = np.linspace(versicolor.pl.min(), versicolor.pl.max(), 11)
    hatw = beta0 + beta1*length
    plt.plot(length, hatw, c='r')
    plt.xlabel("Petal length")
    plt.ylabel("Petal width")
    plt.show()
    return mse

Next, we can attempt to follow coordinate descent approach by keeping one parameter constant while minimizing the MSE by modifying the other, and thereafter keeping the other fixed while modifyin the first:

msePlot(-0.5, 0.5)  # start with the old value as benchmark
## 0.11320000000000001

plot of chunk unnamed-chunk-51

msePlot(-0.5, 0.6)  # did we improve or not?
## 0.5631599999999998

plot of chunk unnamed-chunk-51

msePlot(-0.5, 0.4)  # 0.6 was worse, so try this instead
# now modify beta0
## 0.03051999999999999

plot of chunk unnamed-chunk-51

msePlot(-0.4, 0.4)  # feel like we should lift the line a bit
# Now we are pretty happy with the results
## 0.016119999999999995

plot of chunk unnamed-chunk-51

A.7.1 Linear Regression in python: statsmodels.formula.api and sklearn

A.7.1.1 Solution: Explain virginica sepal width using linear regression with sklearn

from sklearn.linear_model import LinearRegression
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
iris = pd.read_csv("../data/iris.csv.bz2", sep="\t")
virginica = iris[iris.Species == "virginica"]
## extract explanatory variables as X
X = virginica[["Sepal.Length", "Petal.Width", "Petal.Length"]]
## call the dependent variable 'y' for consistency
y = virginica[["Sepal.Width"]]
m = LinearRegression()
_ = m.fit(X, y)
## compute R2
m.score(X, y)
## 0.39434712368959646

A.7.1.2 Solution: Male wages as function of schooling and ethnicity

from sklearn.linear_model import LinearRegression

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
## extract explanatory variables as X
X = males[["school", "ethn"]]
X.head(2)  # to see what it is
##    school   ethn
## 0      14  other
## 1      14  other

For a check, let’s see what are the possible values of ethn:

males.ethn.unique()
## array(['other', 'black', 'hisp'], dtype=object)

We have three different values. Hence we should have three coefficients as well–one for school and two for ethn as we drop one category as the reference.

Now we convert \(\mathsf{X}\) to dummies and fit the model

X = pd.get_dummies(X, drop_first=True)
## For quick check
X.sample(2)
## call the dependent variable 'y' for consistency
##       school  ethn_hisp  ethn_other
## 892       11          0           1
## 3257       8          1           0
y = males[["wage"]]
m = LinearRegression()
_ = m.fit(X, y)

Finally, check how many coefficients do we have:

m.coef_
## array([[0.07709427, 0.1471867 , 0.12256369]])

Indeed, we have three coefficients. Compute \(R^2\):

m.score(X, y)
## 0.06965044236285678

A.7.2 Model Goodness

A.7.2.1 Solution: downward sloping, and well-aligned point clouds

Create downward trending point cloud: larger \(x\) must correspond to smaller \(y\), hence \(\beta_1\) should be negative. Choose \(\beta_1 = -1\):

n = 100
beta0, beta1 = 1, -1  # choose beta1 negative
x = np.random.normal(size=n) 
eps = np.random.normal(size=n)
y = beta0 + beta1*x + eps 
_ = plt.scatter(x, y, edgecolor="k")
_ = plt.xlabel("x")
_ = plt.ylabel("y")
_ = plt.show()

plot of chunk unnamed-chunk-58

scale defines the variation of the random values, the default value is 1. As we add \(\epsilon\) when computing \(y\), it describes how much the \(y\) values are off from the perfect alignment:

n = 100
beta0, beta1 = 1, -1
x = np.random.normal(size=n) 
eps = np.random.normal(scale=0.1, size=n)  # make eps small
y = beta0 + beta1*x + eps 
_ = plt.scatter(x, y, edgecolor="k")
_ = plt.xlabel("x")
_ = plt.ylabel("y")
_ = plt.show()

plot of chunk unnamed-chunk-59

A.7.2.2 Solution: Manually compute \(R^2\) for regression

import statsmodels.formula.api as smf
# find TSS
TSS = np.sum((versicolor.pw - np.mean(versicolor.pw))**2)
# find predicted values
m = smf.ols("pw ~ pl", data=versicolor).fit()
beta0, beta1 = m.params[0], m.params[1]
pwhat = beta0 + beta1*versicolor.pl
# find ESS
ESS = np.sum((versicolor.pw - pwhat)**2)
R2 = 1 - ESS/TSS
R2
## 0.6188466815001422

A.7.2.3 Solution: Create a large and a small \(R^2\) value

Small variance (scale parameter) of \(\epsilon\) results in well-aligned datapoints and hence high \(R^2\). E.g.

n = 100
beta0, beta1 = 1, -1  # choose beta1 negative
x = np.random.normal(size=n) 
eps = np.random.normal(scale=0.05, size=n)
y = beta0 + beta1*x + eps 
plt.scatter(x, y, edgecolor="k")
## <matplotlib.collections.PathCollection object at 0x71fd32d1d2a0>
plt.xlabel("x")
## Text(0.5, 0, 'x')
plt.ylabel("y")
## Text(0, 0.5, 'y')
plt.show()

plot of chunk unnamed-chunk-61

randomDF = pd.DataFrame({"x":x, "y":y})
m = smf.ols("y ~ x", data=randomDF).fit()  # fit the model
m.rsquared
## 0.9968237337819345

In contrary, a large variance of the error term makes points very weakly aligned with the regression line:

eps = np.random.normal(scale=5, size=n)
y = beta0 + beta1*x + eps 
plt.scatter(x, y, edgecolor="k")
## <matplotlib.collections.PathCollection object at 0x71fd32d476a0>
plt.xlabel("x")
## Text(0.5, 0, 'x')
plt.ylabel("y")
## Text(0, 0.5, 'y')
plt.show()

plot of chunk unnamed-chunk-62

randomDF = pd.DataFrame({"x":x, "y":y})
m = smf.ols("y ~ x", data=randomDF).fit()  # fit the model
m.rsquared
## 0.0011606762128384407

A.8 Logistic Regression

A.8.1 Logistic Regression in python: statsmodels.formula.api and sklearn

A.8.1.1 Solution: Predict treatment using all variables

First we load data, and convert all non-numeric variables to dummies.

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment.head(2)
##    treat  age  educ   ethn  married  re74  re75      re78   u74   u75
## 0   True   37    11  black     True   0.0   0.0   9930.05  True  True
## 1   True   30    12  black    False   0.0   0.0  24909.50  True  True
y = treatment.treat
## There are categoricals: convert these to dummies
X = pd.get_dummies(treatment.drop("treat", axis=1), drop_first=True)
## get a glimpse of X--does it look good?
X.sample(3)
##      age  educ  married     re74  ...    u74    u75  ethn_hispanic  ethn_other
## 294   24    12     True  13714.9  ...  False  False              0           0
## 162   18     8     True      0.0  ...   True   True              1           0
## 605   50     8     True   8985.2  ...  False  False              0           0
## 
## [3 rows x 10 columns]

The design matrix looks reasonable. Note that logical values are kept as they are and not transformed to dummies–logical values are essentially dummies already. Now fit the model:

from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
_ = m.fit(X, y)
m.score(X, y)
## 0.9652336448598131

Indeed, score is now better than 93%.

A.9 Predictions and Model Goodness

A.9.1 Predicting using linear regression

A.9.1.1 Solution: manually predict distance to galaxies using modern data

hubble = pd.read_csv("../data/hubble.csv", sep="\t")

## fit with modern values
m = smf.ols("Rmodern ~ vModern", data=hubble).fit()
## assign estimated parameter values
beta0, beta1 = m.params

v7331 = hubble.vModern.iloc[17]
v7331  # velocity
## 816.0
beta0 + beta1*v7331  # distance
## 12.205922925181298
vs = np.array([500, 1000, 2000])
beta0 + beta1*vs  # 8 to 27 Mpc
## array([ 8.20010195, 14.53842628, 27.21507493])

A.9.1.2 Solution: predict petal width

from sklearn.linear_model import LinearRegression

iris = pd.read_csv("../data/iris.csv.bz2", sep="\t")
versicolor = iris[iris.Species == "versicolor"]
y = versicolor[["Petal.Width"]]
X = versicolor[["Sepal.Length", "Sepal.Width"]]
# note: length first, width second
## get a glimpse of X--does it look good?
X.sample(3)
## initialize and fit the model
##     Sepal.Length  Sepal.Width
## 60           5.0          2.0
## 55           5.7          2.8
## 56           6.3          3.3
m = LinearRegression()
_ = m.fit(X, y)
## predict
yhat = m.predict(X)
rmse = np.sqrt(np.mean((y - yhat)**2))
rmse
## Petal.Width    0.139161
## dtype: float64

RMSE is rather small, 0.1391612cm. This is because the range of petal width is small, only 0.8 cm.

In order to predict the flower of given size, we can create a matrix with the requested numbers:

newX = np.array([[5, 2]])  # length first, width second
m.predict(newX)
## array([[0.97560275]])

A.9.1.3 Solution: predict hispanic male wage in service sector

As the first step we should load data check how is industry coded:

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
males.industry.unique()
## array(['Business_and_Repair_Service', 'Personal_Service', 'Trade',
##        'Construction', 'Manufacturing', 'Transportation',
##        'Professional_and_Related Service', 'Finance', 'Entertainment',
##        'Public_Administration', 'Agricultural', 'Mining'], dtype=object)

We see that finance is coded just as “Finance”. Now we can follow the example in Section :

As second step we can fit the linear regression. We call the original data matrix with categorical Xc in order to distinguish it from the final design matrix X below:

from sklearn.linear_model import LinearRegression

## Extract relevant variables in original categorical form
## we need it below
Xc = males[["school", "ethn", "industry"]]
y = males.wage
X = pd.get_dummies(Xc, drop_first=True)
m = LinearRegression()
_ = m.fit(X, y)

Now m is a fitted model that we can use for prediction. Next, we design the data frame for prediction using the categories and attach it on top of the original categorical data matrix Xc:

## design prediction
newDf = pd.DataFrame({"school":16, "ethn":"hisp",
                      "industry":"Finance"},
                     index=["c1"])
## merge new on top of original
newDf = pd.concat((newDf, Xc), axis=0, sort=False)
## Does it look good?
newDf.head(3)
## create dummies of everything
##     school   ethn                     industry
## c1      16   hisp                      Finance
## 0       14  other  Business_and_Repair_Service
## 1       14  other             Personal_Service
allX = pd.get_dummies(newDf, drop_first=True)
## do dummies look reasonable?
allX.head(3)
##     school  ethn_hisp  ...  industry_Trade  industry_Transportation
## c1      16          1  ...               0                        0
## 0       14          0  ...               0                        0
## 1       14          0  ...               0                        0
## 
## [3 rows x 14 columns]

Finally, we extract the relevant cases from the extended design matrix and predict:

## extract the topmost part from X
newX = allX.loc[["c1"]]
newX  # looks good
##     school  ethn_hisp  ...  industry_Trade  industry_Transportation
## c1      16          1  ...               0                        0
## 
## [1 rows x 14 columns]
m.predict(newX)
## array([2.16154055])

Predicted hourly salary is $2.16.

A.9.2 Predicting with Logistic Regression

A.9.2.1 Solution: Predict Titanic survival for two persons using statsmodels

Load data and fit the model:

titanic = pd.read_csv("../data/titanic.csv.bz2")

m = smf.logit("survived ~ sex + C(pclass)", data=titanic).fit()
## Optimization terminated successfully.
##          Current function value: 0.480222
##          Iterations 6

Next, create data for the desired cases. A simple way is to use dict of lists. We put the man at the first position and the woman at the second position:

## create new data (a dict)
## man in 1st class, woman in 3rd class
newx = {"sex":["male", "female"], "pclass":[1, 3]}

Now we can predict using these data:

phat = m.predict(newx)
phat  # first for the man, second for the woman
## 0    0.399903
## 1    0.595320
## dtype: float64
yhat = phat > 0.5
yhat
## 0    False
## 1     True
## dtype: bool

As the results indicate, a 1st class man has approximately 40% percent chance to survive, while a 3rd class woman has 60%. Hence we predict that the man died while the woman survived.

A.9.2.2 Solution: Test different thresholds with statsmodels

Load data and fit the model:

titanic = pd.read_csv("../data/titanic.csv.bz2")

m = smf.logit("survived ~ sex + C(pclass)", data=titanic).fit()
## Optimization terminated successfully.
##          Current function value: 0.480222
##          Iterations 6

Predict the probabilities:

phat = m.predict()

Now try a few different thresholds:

yhat = phat > 0.5
np.mean(yhat == titanic.survived)
## 0.7799847211611918
yhat = phat > 0.8
np.mean(yhat == titanic.survived)
## 0.7203972498090145
yhat = phat > 0.3
np.mean(yhat == titanic.survived)
## 0.7364400305576776

So far, 0.5 seems to be the best option.

For the plot, we can just create an array of thresholds and compute accuracy for each of these in a loop:

thresholds = np.linspace(0, 1, 100)
accuracies = np.empty(100)
for i in range(100):
    yhat = phat > thresholds[i]
    accuracies[i] = np.mean(yhat == titanic.survived)
_ = plt.plot(thresholds, accuracies)
_ = plt.xlabel("Threshold")
_ = plt.ylabel("Accuracy")
_ = plt.show()

plot of chunk unnamed-chunk-78

The plot shows that the best accuracy is achieved at threshold value 0.6..0.7. This is, obviously, data dependent.

A.9.2.3 Solution: Row of four zeros

The object is a pandas data frame, and hence the first column–the first zero–is its index. This means we are using the first row of data.

The second column corresponds to dummy sex_male. Hence the person in the first row is a female.

The 3rd and 4th columns are dummies, describing the passenger class. Hence together they indicate that the passenger was traveling in the 1st class.

To conclude: the zeros show that the passenger in the first line of data was a woman who was traveling in first class.

A.9.2.4 Solution: Predict treatment status using all variables

First load data and convert all non-numeric variables to dummies:

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
y = treatment.treat
## drop outcome (treat) and convert all categoricals
## to dummmies
X = pd.get_dummies(treatment.drop("treat", axis=1), drop_first=True)
## get a glimpse of X--does it look good?
X.sample(3)
##       age  educ  married     re74  ...    u74    u75  ethn_hispanic  ethn_other
## 2155   37    13     True  45063.1  ...  False  False              0           1
## 1302   23    15     True  16653.8  ...  False  False              0           1
## 1726   27    12     True  25470.5  ...  False  False              0           1
## 
## [3 rows x 10 columns]

The design matrix looks reasonable, in particular we can see that the logical married is still logical, but ethn with three categories is converted to two dummies (ethn_hispanic and ethnc_other). Now fit the model:

from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
_ = m.fit(X, y)

Now we can extract the desired cases and predict values for those only:

newx = X.loc[[0, 9, 99]]  # extract desired rows
m.predict_proba(newx)[:,1]  # note: only output treatment probability
## array([0.36787074, 0.92516082, 0.6355106 ])

As the probabilities suggest, the first case is predicted not to participate, and two latter cases are predicted to participate.

A.9.3 Confusion Matrix–Based Model Goodness Measures

A.9.3.1 Solution: Titanic Suvival Confusion Matrix with Child Variable

We can create child with a simple logical operation:

child = titanic.age < 14

Note that we do not have to include child in the data frame, a variable in the working environment will do. Now we model and predict

m = smf.logit("survived ~ sex + C(pclass) + child", data=titanic).fit()
## Optimization terminated successfully.
##          Current function value: 0.471565
##          Iterations 6
yhat = m.predict(titanic) > 0.5

The confusion matrices, computed in two ways:

cm1 = pd.crosstab(titanic.survived, yhat)
cm1
## col_0     False  True
## survived             
## 0           682   127
## 1           156   344
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(titanic.survived, yhat)
cm2
## array([[682, 127],
##        [156, 344]])

Indeed, both ways to get the matrix yield the same result.

And finally we extract true negatives:

cm1.iloc[0,0]
## 682
cm2[0,0]
## 682

This is the same number as in the original model in Section 12.3.2. But now we have slightly larger number of true positives.

A.9.3.2 Solution: Titanic Survival with Naive Model

We can create a naive model with formula "survived ~ 1" but let’s do it manually here. We just predict everyone to the class that is more common:

titanic.survived.value_counts()
## 0    809
## 1    500
## Name: survived, dtype: int64

As one can see, surival 0 is the more common value, hence we predict everyone died:

yhat = pd.Series(0, index=titanic.index)

In order to better understand the results, let’s also output the confusion matrix:

pd.crosstab(titanic.survived, yhat)
## col_0       0
## survived     
## 0         809
## 1         500

As one can see, pd.crosstab only outputs a single column here, as we do not predict anyone to survive. Essentially the survivor column is all zeros but pd.crosstab does not show this (confusion_matrix will show this column).

We can compute the goodness values using the sklearn.metrics:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy_score(titanic.survived, yhat)
## 0.6180290297937356
precision_score(titanic.survived, yhat)
## 0.0
## 
## /usr/lib/python3/dist-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
##   _warn_prf(average, modifier, msg_start, len(result))
recall_score(titanic.survived, yhat)
## 0.0
f1_score(titanic.survived, yhat)
## 0.0

The results give us two errors: as we do not predict any survivors, we cannot compute precision as 0 out of 0 predicted survivors are correct. As \(F\)-score is (partly) based on precision, we cannot compute that either. Additionally, \(R=0\) as we do not capture any survivors.

A.10 Trees and Forests

A.10.1 Regression Trees

A.10.1.1 Solution:

First load and prepare data:

from sklearn.tree import DecisionTreeRegressor

boston = pd.read_csv("../data/boston.csv.bz2", sep="\t")
## define outcome (medv) and design matrix (X)
y = boston.medv
X = boston[["rm", "age"]]

Now define and fit the model:

m = DecisionTreeRegressor(max_depth=2)
_ = m.fit(X, y)

Compute \(R^2\) on training data:

m.score(X, y)
## 0.6091083830815363

This is a slight improvement, the original \(R^2\), using only rm, was 0.58. Hence the trees found it useful to include age in the decision-making. As the depth-2 tree is a very simple model, we are not really afraid of overfitting here.

## Error: ModuleNotFoundError: No module named 'graphviz'
## Error: NameError: name 'graphviz' is not defined
## Error: NameError: name 'graph' is not defined

Depth-2 Boston regression tree using both rm and age.

Here is the tree as seen by graphviz.

Indeed, now the tree considers age in the left node, with newer neighborhoods being somewhat more expensive.

Include all features:

X = boston.drop("medv", axis=1)
_ = m.fit(X, y)
m.score(X, y)
## 0.695574477973027

\(R^2\) improved noticeably. This hints that the tree was able to pick at least one other feature that was better to describe the price than rm and age.

## Error: ModuleNotFoundError: No module named 'graphviz'
## Error: NameError: name 'graphviz' is not defined
## Error: NameError: name 'graph' is not defined

Depth-2 Boston regression tree using all features.

Here is the tree when using all available features. rm is still the first feature to split data, but the other one is not lstat, the percentage of lower-status population.

A.11 Natural Language Processing

A.11.1 Categorize text

A.11.1.1 Solution: find poems with TF-IDF

We basically just swap CountVectorizer out for TfidfVectorizer. Everything else will remain the same:

from sklearn.feature_extraction.text import TfidfVectorizer
vrizer = TfidfVectorizer()
X = vrizer.fit_transform([rudaki1, rudaki2, phu1, phu2])
newX = vrizer.transform([newdoc])

from sklearn.neighbors import NearestNeighbors
m = NearestNeighbors(n_neighbors=1, metric="cosine")
_ = m.fit(X)
d = m.kneighbors(newX, n_neighbors=4)
d
## (array([[0.44494572, 0.64788098, 0.78482153, 0.89982691]]), array([[2, 3, 1, 0]]))

The closest document based on cosine distance is document 2 at distance 0.445.


  1. To be more precise, they are not the riches, but those with the largest income in this data.↩︎