A Solutions

A.1 Python

A.1.1 Operators

A.1.1.1 Going out with friends

friends = 7
budget = 150
print("I am going out with", friends, "friends")
## I am going out with 7 friends
mealprice = 14
price = mealprice*(friends + 1)
total = price*1.15
print("total price:", total)
## total price: 128.79999999999998
if total <= budget:
    print("can afford")
else:
    print("cannot afford")
## can afford

A.1.2 Strings

A.1.2.1 Combine strings

h = "5'" + '3"'
print("height is", h)
## height is 5'3"

A.1.3 Collections

A.1.3.1 Extract sublists

l = list(range(1, 13))
for i in range(3):
    print(l[i::3])        
## [1, 4, 7, 10]
## [2, 5, 8, 11]
## [3, 6, 9, 12]

A.1.3.2 Combining lists

friends = ['Paulina', 'Severo']
others = []
others.append("Lai Ming")
others.append("Lynn")
people = friends + others
print("all people", people)
## all people ['Paulina', 'Severo', 'Lai Ming', 'Lynn']

A.1.3.3 Assign people to seats

Consider two lists:

names = ["Adam", "Ashin", "Inukai", "Tanaka", "Ikki"]
seats = [33, 12, 45, 2, 17]
assigneds = []
for i in range(len(names)):
    a = names[i] + ": " + str(seats[i])
    assigneds.append(a)
assigneds    
## ['Adam: 33', 'Ashin: 12', 'Inukai: 45', 'Tanaka: 2', 'Ikki: 17']

A.1.3.4 List comprehension

Create pizzas with different toppings:

[1 + i for i in range(10)]
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
toppings = ['mushrooms', 'mozarella', 'ananas']
['pizza with ' + t for t in toppings]
## ['pizza with mushrooms', 'pizza with mozarella', 'pizza with ananas']

A.1.3.5 Dict mapping coordinates to cities

placenames = {(31.228611, 121.474722): "Shanghai",
              (23.763889, 90.388889): "Dhaka",
              (13.7525, 100.494167): "Bangkok"
          }
print(placenames)
## {(31.228611, 121.474722): 'Shanghai', (23.763889, 90.388889): 'Dhaka', (13.7525, 100.494167): 'Bangkok'}

Note that such lookup is usually not what we want because it only returns the correct city name if given the exact coordinates. If coordinates are just a tiny bit off, the dict cannot find the city. If you consider cities to be continuous areas, you should do geographic lookup, not dictionary lookup.

A.1.3.6 Dict of dicts

tsinghua = {"house":30, "street":"Shuangqing Rd",
            "district":"Haidan", "city":"Beijing",
            "country":"CH"}
kenyatta = {"pobox":"43844-00100", "city":"Nairobi",
            "country":"KE"}
places = {"Tsinghua":tsinghua, "Kenyatta":kenyatta}
places["Teski Refugio"] = {"house":256,
                   "street":"Volcan Osorno",
                   "city":"La Ensenada",
                   "region":"Los Lagos",
                   "country":"CL"}
print(places)
## {'Tsinghua': {'house': 30, 'street': 'Shuangqing Rd', 'district': 'Haidan', 'city': 'Beijing', 'country': 'CH'}, 'Kenyatta': {'pobox': '43844-00100', 'city': 'Nairobi', 'country': 'KE'}, 'Teski Refugio': {'house': 256, 'street': 'Volcan Osorno', 'city': 'La Ensenada', 'region': 'Los Lagos', 'country': 'CL'}}

A.1.3.7 Find the total bill

payments = {"jan":1200, "feb":1200, "mar":1400, "apr":1200}
total = 0
for month in payments.keys():
    total += payments[month]
print("Total rent:", total)    
## Total rent: 5000

A.1.3.8 Sets: unique names

kings = ["Jun", "Gang", "An", "jun", "HYE", "JUN", "hyo", "yang", "WON", "WON", "Yang"]
Kings = [n.title() for n in kings]
s = set(Kings)
print(s)
## {'Hye', 'An', 'Won', 'Jun', 'Hyo', 'Yang', 'Gang'}
print(len(s), "unique kings")
## 7 unique kings

A.1.4 Language Constructs

A.1.4.1 Solution: odd or even

for n in range(1,11):
    parity = n % 2
    if parity == 1:
        print(n, "odd")
    else:
        print(n, "even")
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even

A.1.4.2 Solution: join names

.join() is a string method and must be applied at the end of the separator string:

names = ["viola glabella", "monothropa hypopithys", "lomatium utriculatum"]
", ".join(names)
## 'viola glabella, monothropa hypopithys, lomatium utriculatum'

Note that the syntax is separator.join(names), not names.join(separator).

A.1.5 Modules

A.1.5.1 Solution: access your file system

Below is the solution by importing the whole libraries. You can also use, e.g. from os import getcwd, listdir, or other approaches.

import os
os.getcwd()
## '/home/siim/tyyq/lecturenotes/machinelearning-py'
os.listdir()
## ['text.rmd', 'logistic-regression.rmd', 'linear-regression.rmd', 'index.rmd', 'images.rmd', 'trees-forests.rmd', 'color-spiral-keras.png', 'svm.rmd', 'images.md', 'numpy-pandas.rmd', 'plotting.rmd', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'numpy-pandas.md', 'linear-algebra.rmd', 'descriptive-analysis.rmd', 'descriptive-analysis.md', 'descriptive-statistics.md', 'regularization.rmd', 'svm.rmd~', 'keras-color-spiral.py', 'Makefile', '_bookdown.yml', 'figs', 'web-scraping.rmd', 'solutions.rmd', 'neural-nets.rmd', 'predictions.rmd', 'figure', 'linear-algebra.md', 'files', 'unsupervised-learning.rmd', 'cleaning-data.md', 'keras-cats-vs-dogs.py', 'titanic-tree.png', 'logistic-regression.md', 'python.md', 'index.md', 'ml-workflow.rmd', 'build', '.cache', 'svm.md~', 'python.rmd', 'cleaning-data.rmd', 'datasets.md', 'descriptive-statistics.rmd', '_output.yml', 'datasets.rmd~', '_bookdown_files', 'overfitting-validation.md', 'overfitting-validation.rmd']

A.2 Numpy and Pandas

A.2.1 Numpy

A.2.1.1 Solution: concatenate arrays

## array([[-1., -1., -1., -1.],
##        [ 0.,  0.,  0.,  0.],
##        [ 2.,  2.,  2.,  2.]])

Obviously, you can use np.ones and np.zeros directly in np.row_stack and skip creating the temporary variables.

A.2.1.2 Solution: create matrix of even numbers

2 + 2*np.arange(20).reshape(4,5)
## array([[ 2,  4,  6,  8, 10],
##        [12, 14, 16, 18, 20],
##        [22, 24, 26, 28, 30],
##        [32, 34, 36, 38, 40]])

A.2.1.3 Solution: create matrix, play with columns

a = 10 + np.arange(20).reshape(4,5)*2
print(a[:,2], '\n')
## [14 24 34 44]
a[3] = 1 + np.arange(5)
a
## array([[10, 12, 14, 16, 18],
##        [20, 22, 24, 26, 28],
##        [30, 32, 34, 36, 38],
##        [ 1,  2,  3,  4,  5]])

A.2.1.4 Solution: Logical indexing

names = np.array(["Roxana", "Statira", "Roxana", "Statira", "Roxana"])
scores = np.array([126, 115, 130, 141, 132])

scores[scores < 130]  # extract < 130
## array([126, 115])
scores[names == "Statira"]  # extract by Statira
## array([115, 141])
scores[names == "Roxana"] = scores[names == "Roxana"] + 10  # add 10
scores
## array([136, 115, 140, 141, 142])

A.2.1.5 Solution: create a sequence of -1, 1

We can just multiply the (0, 1) sequence by 2 and subtract 1:

2*np.random.binomial(1, 0.5, size=10) - 1
## array([ 1, -1,  1, -1, -1, -1,  1, -1,  1,  1])

A.2.2 Pandas

A.2.2.1 Solution: series of capital cities

cities = pd.Series(["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
                   index=["Congo", "Gabon", "Equatorial Guinea", "Cameroon"])
cities
## Congo                Brazzaville
## Gabon                 Libreville
## Equatorial Guinea         Malabo
## Cameroon                 Yaoundé
## dtype: object

A.2.2.2 Solution: extract capital cities

cities.iloc[0]
## 'Brazzaville'
cities.iloc[2]
## 'Malabo'
cities[["Gabon"]]
## Gabon    Libreville
## dtype: object

A.2.2.3 Solution: create city dataframe

To keep code clean, we create data as a separate dict, and index as a list. Thereafter we make a dataframe out of these two:

data = {"capital":["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
        "population":[1696, 703, 297, 2765]}
countries = ["Congo", "Gabon", "Equatorial Guinea", "Cameroon"]
cityDF = pd.DataFrame(data, index=countries)
cityDF
##                        capital  population
## Congo              Brazzaville        1696
## Gabon               Libreville         703
## Equatorial Guinea       Malabo         297
## Cameroon               Yaoundé        2765

A.2.2.4 Solution: Variables in G.W.Bush approval data frame

Pandas prints five columns: index, and four variables— date, approve, disapprove, dontknow.

A.2.2.5 Show files in current/parent folder

Remember that parent folder is denoted as double dots ... Current folder is a single dot ., but usually it is not needed.

Current working directory:

import os
os.getcwd()
## '/home/siim/tyyq/lecturenotes/machinelearning-py'

Files in the current folder

os.listdir()  # or os.listdir(".")
## ['text.rmd', 'logistic-regression.rmd', 'linear-regression.rmd', 'index.rmd', 'images.rmd', 'trees-forests.rmd', 'color-spiral-keras.png', 'svm.rmd', 'images.md', 'numpy-pandas.rmd', 'plotting.rmd', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'numpy-pandas.md', 'linear-algebra.rmd', 'descriptive-analysis.rmd', 'descriptive-analysis.md', 'descriptive-statistics.md', 'regularization.rmd', 'svm.rmd~', 'keras-color-spiral.py', 'Makefile', '_bookdown.yml', 'linear-regression.md', 'figs', 'web-scraping.rmd', 'solutions.rmd', 'neural-nets.rmd', 'predictions.rmd', 'figure', 'linear-algebra.md', 'files', 'unsupervised-learning.rmd', 'cleaning-data.md', 'keras-cats-vs-dogs.py', 'titanic-tree.png', 'logistic-regression.md', 'python.md', 'index.md', 'ml-workflow.rmd', 'build', '.cache', 'svm.md~', 'python.rmd', 'cleaning-data.rmd', 'datasets.md', 'descriptive-statistics.rmd', '_output.yml', 'datasets.rmd~', '_bookdown_files', 'overfitting-validation.md', 'overfitting-validation.rmd']

Files in the parent folder

os.listdir("..")
## ['machineLearning.rip', 'literature', '.RData', 'machineLearning.mtc19', 'machineLearning.mtc8', 'machineLearning.mtc16', 'machineLearning.mtc9', 'machineLearning.exc', 'machinelearning-common', '.gitignore', 'datascience-intro', 'machineLearning.mtc12', '.Rhistory', 'machineLearning.mtc18', 'machineLearning.mtc20', 'lecturenotes.bib.sav', 'machineLearning.mtc2', 'README.md', 'machineLearning.mtc15', 'machineLearning.mtc13', 'scripts', 'machinelearning-R', 'machinelearning-R.html', 'machineLearning.tex', 'kbordermatrix.sty', 'test', 'solutions.rnw', '.git', 'lecturenotes.bib.bak', 'machineLearning.exm', 'videos', 'machineLearning.rnw', 'auto', 'machineLearning.mtc11', 'machineLearning.mtc22', 'machineLearning.mtc5', '.Rprofile', 'machineLearning.mtc', 'machineLearning.mtc1', 'machinelearning-py', 'Makefile', 'machineLearning.mtc17', 'bitbucket-pipelines.yml', 'figs', 'machineLearning.mtc0', 'machineLearning.mtc21', 'files', 'machineLearning.mtc6', 'isomath.tex', 'machineLearning.mtc23', 'machineLearning.mtc14', 'machineLearning.mtc7', 'lecturenotes.bib', 'machineLearning.mtc3', '_region_.tex', 'data', 'solutions.tex', 'latexmkrc', 'machineLearning.mtc10', 'machineLearning.maf', 'machineLearning.mtc4', 'img', 'machineLearning.pdf', 'material']

Obviously, everyone has different files in these folders.

A.2.2.6 Solution: presidents approval 88%

approval = pd.read_csv("../data/gwbush-approval.csv", sep="\t", nrows=10) 
approval[approval.approve >= 88][["date", "approve"]]  # only print date,
##              date  approve
## 5  2001 Oct 19-21       88
## 6  2001 Oct 11-14       89
## 8  2001 Sep 21-22       90
    # approval rate
approval[approval.approve >= 88].shape[0]  # data for at least 90% approval
## 3

A.2.2.7 Solution: change index, convert to variable

The original data frame was created as

capitals = pd.DataFrame(
        {"capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],
         "population":[32.7, 267.7, 15.3]},  # in millions
        index=["MY", "ID", "KH"])

We can modify it as

capitals.index = ["Malaysia", "Indonesia", "Cambodia"]
capitals  # now the index is country names
##                 capital  population
## Malaysia   Kuala Lumpur        32.7
## Indonesia       Jakarta       267.7
## Cambodia     Phnom Penh        15.3
capitals = capitals.reset_index(name = "country")
## TypeError: DataFrame.reset_index() got an unexpected keyword argument 'name'

A.2.2.8 Solution: create city dataframe

cityM = np.array([[3.11, 5282, 19800],
              [18.9, 306, 46997],
              [4.497, 1886, 22000]])
names = ["Chittagong", "Dhaka", "Kolkata"]
vars = ["population", "area", "density"]
cityDF = pd.DataFrame(cityM, index=names,
                      columns=vars)
cityDF
##             population    area  density
## Chittagong       3.110  5282.0  19800.0
## Dhaka           18.900   306.0  46997.0
## Kolkata          4.497  1886.0  22000.0

A.2.2.9 Solution: extract city data

# density:
cityM[:,2]
## array([19800., 46997., 22000.])
cityDF.density
## Chittagong    19800.0
## Dhaka         46997.0
## Kolkata       22000.0
## Name: density, dtype: float64
# third city
cityM[2,:]
## array([4.497e+00, 1.886e+03, 2.200e+04])
cityDF.loc["Kolkata",:]
## population        4.497
## area           1886.000
## density       22000.000
## Name: Kolkata, dtype: float64
cityDF.iloc[2,:]
## population        4.497
## area           1886.000
## density       22000.000
## Name: Kolkata, dtype: float64
## second city area
cityM[1,1]
## 306.0
cityDF.loc["Kolkata","area"]
## 1886.0
cityDF.iloc[1,1]
## 306.0

A.2.2.10 Solution: Titanic line 1000

Extract the 1000th row (note: index 999!) as data frame, and thereafter extract name, survival status, and age. We have to extract twice as we cannot extract by row number and column names at the same time.

titanic = pd.read_csv("../data/titanic.csv.bz2")
titanic.loc[[999]][["name", "survived", "age"]]
##                                   name  survived  age
## 999  McCarthy, Miss. Catherine "Katie"         1  NaN

A.2.2.11 Solution: Titanic male/female age distribution

The two relevant variables are clearly sex and age. If you already have loaded data then you may just presereve these variables:

titanic[["sex", "age"]].sample(4)
##          sex   age
## 1199    male   NaN
## 540   female   2.0
## 207   female  33.0
## 495     male   NaN

Alternatively, you may also just read these two columns only:

pd.read_csv("../data/titanic.csv.bz2",
            usecols=["sex", "age"]).sample(4)
##        sex   age
## 404   male  21.0
## 1145  male   8.0
## 518   male  19.0
## 1133  male  17.0

A.3 Descriptive analysis with Pandas

A.3.1 What are the values?

A.3.1.1 Solution: which methods can be applied to the whole data frame

There is no better way than just to try:

titanic.min()  # works
## <string>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
## pclass                        1
## survived                      0
## name        Abbing, Mr. Anthony
## sex                      female
## age                      0.1667
## sibsp                         0
## parch                         0
## ticket                   110152
## fare                        0.0
## body                        1.0
## dtype: object
titanic.mean()  # works
## pclass        2.294882
## survived      0.381971
## age          29.881135
## sibsp         0.498854
## parch         0.385027
## fare         33.295479
## body        160.809917
## dtype: float64
titanic.unique()  # does not work
## AttributeError: 'DataFrame' object has no attribute 'unique'
titanic.nunique()  # works again
## pclass          3
## survived        2
## name         1307
## sex             2
## age            98
## sibsp           7
## parch           8
## ticket        929
## fare          281
## cabin         186
## embarked        3
## boat           27
## body          121
## home.dest     369
## dtype: int64

The mathematical functions work, but may skip the non-numeric variables. .min however, find the first value when ordered alphabetically. It is not immediately clear why .unique does not work while .nunique works. It may be because there is a different number of unique elements for each variables, but hey, you can just use a more complex data structure and still compute that.

A.4 Cleaning and Manipulating Data

A.4.1 Missing Observations

A.4.1.1 Solution: Missings in fare in Titanic Data

We can count the NaN-s as

titanic.fare.isna().sum()
## 1

There is just a single missing value. About non-reasonable values: first we should check its data type:

titanic.fare.dtype
## dtype('float64')

We see it is coded as numeric (64-bit float), so we can query its range:

titanic.fare.min(), titanic.fare.max()
## (0.0, 512.3292)

While 512 pounds seems a reasonable ticket price, value 0 for minimum is suspicious. We do not know if any passengers really did not pay a fare, or more likely, it just means that the data collectors did not have the information. So it is just a missing value, coded as 0.

A.4.2 Converting Variables

A.4.2.1 Solution: convert males’ dataset school to categories

Load the males dataset. It is instructive to test the code first on a sequence of years of schooling before converting actual data. In particular, we want to ensure that we get the boundaries right, “12” should be HS while “13” should be “some college” and so on.

We choose to specify the right boundary at integer values, e.g. “HS” is interval \([12, 13)\). In order to ensure that “12” belong to this interval while “13” does not we tell right=False, i.e. remove the right boundary for the interval (and hence include it to the interval above):

# test on years of schooling 10-17
school = np.arange(10,18)
# convert to categories
categories = pd.cut(school,
                    bins=[-np.Inf, 12, 13, 16, np.Inf],
                    labels=["Less than HS", "HS", "Some college", "College"],
                    right=False)
# print in a way that years and categories are next to each other
pd.Series(categories, index=school)
## 10    Less than HS
## 11    Less than HS
## 12              HS
## 13    Some college
## 14    Some college
## 15    Some college
## 16         College
## 17         College
## dtype: category
## Categories (4, object): ['Less than HS' < 'HS' < 'Some college' < 'College']

Now we see it works correctly, e.g. 11 years of schooling is “Less than HS” while 12 years is “HS”. Now we can do the actual conversion:

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
pd.cut(males.school,
       bins=[-np.Inf, 12, 13, 16, np.Inf],
       labels=["Less than HS", "HS", "Some college", "College"],
       right=False)
## 0       Some college
## 1       Some college
## 2       Some college
## 3       Some college
## 4       Some college
##             ...     
## 4355    Less than HS
## 4356    Less than HS
## 4357    Less than HS
## 4358    Less than HS
## 4359    Less than HS
## Name: school, Length: 4360, dtype: category
## Categories (4, object): ['Less than HS' < 'HS' < 'Some college' < 'College']

A.4.2.2 Solution: Convert Males’ dataset residence to dummies

Rest of the task pretty much repeats the examples in Converting categorical variables to dummies, just you have to find the prefix_sep argument to remove the underscore between the prefix “R” and the category name. The code might look like

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
residence = pd.get_dummies(males.residence, prefix="R", prefix_sep="")
residence.drop("Rsouth", axis=1).sample(8)
##       Rnorth_east  Rnothern_central  Rrural_area
## 2901            1                 0            0
## 3137            0                 0            0
## 2357            0                 0            0
## 2359            0                 0            0
## 2778            0                 1            0
## 2688            0                 0            0
## 638             0                 0            0
## 2959            0                 0            0

A.4.2.3 Solution: Convert Titanic’s age categories, sex, pclass to dummies

titanic = pd.read_csv("../data/titanic.csv.bz2")
titanic = titanic[["age", "sex", "pclass"]]
titanic["age"] = pd.cut(titanic.age,
                        bins=[0, 14, 50, np.inf],
                        labels=["0-13", "14-49", "50-"],
                        right=False)
d = pd.get_dummies(titanic, columns=["age", "sex", "pclass"])
d.sample(7)
##       age_0-13  age_14-49  age_50-  sex_female  sex_male  pclass_1  pclass_2  pclass_3
## 879          0          0        0           0         1         0         0         1
## 1264         0          1        0           0         1         0         0         1
## 927          0          0        0           0         1         0         0         1
## 490          0          0        1           1         0         0         1         0
## 825          1          0        0           0         1         0         0         1
## 407          0          1        0           1         0         0         1         0
## 794          1          0        0           1         0         0         0         1

One may also want to drop one of the dummy levels with drop_first argument.

A.4.2.4 Solution: Convert Males’ residence, ethn to dummies and concatenate

We create the dummies separately for residence and ethn and give them corresponding prefix to make the variable names more descriptive.

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
residence = pd.get_dummies(males.residence, prefix="residence")
## remove the reference category
residence = residence.drop("residence_north_east", axis=1)
residence.sample(4)
##       residence_nothern_central  residence_rural_area  residence_south
## 1609                          1                     0                0
## 3704                          1                     0                0
## 2346                          0                     0                0
## 3672                          0                     0                0
## convert and remove using chaining
ethn = pd.get_dummies(males.ethn, prefix="ethn")\
         .drop("ethn_other", axis=1)
ethn.sample(4)
##       ethn_black  ethn_hisp
## 3502           0          1
## 1965           0          0
## 1093           0          0
## 3961           0          0
## combine these variables next to each other
d = pd.concat((males.wage, residence, ethn), axis=1)
d.sample(7)
##           wage  residence_nothern_central  residence_rural_area  residence_south  ethn_black  ethn_hisp
## 2071  1.800216                          0                     0                1           0          0
## 1296  1.554879                          0                     0                0           0          0
## 4157  1.754424                          1                     0                0           0          0
## 3343  1.524607                          0                     0                0           0          0
## 2025  1.397355                          0                     0                1           1          0
## 1343  1.646478                          0                     0                0           0          0
## 2064  1.802248                          0                     0                1           0          0

A.5 Descriptive Statistics

A.5.1 Inequality

A.5.1.1 Solution: 80-20 ratio of income

First load the data and compute total income:

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
income = treatment.re78  # extract income for simplicity
total = income.sum()  # total income

Next, we can just start trying with the uppermost 1% (lowermost 99):

pct = 99
threshold = np.percentile(treatment.re78, pct)

The income share of the richest 1% is

share = income[income > threshold].sum()/total
share
## 0.04234557467220405

This is less than 99%, so we should try a smaller number. Instead, we find that the top 1% of the richest8 earn 4% of total income. For the top 2% we have

pct = 98
threshold = np.percentile(treatment.re78, pct)
income[income > threshold].sum()/total
## 0.07508822500042818

Again, this is less than 98%. Instead of manually trying to find the correct value, we can do a loop:

for pct in range(99, 50, -1):
    threshold = np.percentile(treatment.re78, pct)
    share = income[income > threshold].sum()/total
    if 100*share > pct:
        print(pct, share)
        break
## 63 0.6474562618876845

So in this data, the richest 37% earn approximately 65% of total income. If we want more precise ratio then we can write another loop that goes down to fractions of percent.

A.6 Linear Algebra

A.6.1 Numpy Arrays as Vectors and Matrices

A.6.1.1 Solution: Berlin-Germany-France-Paris

berlin = np.array([-0.562, 0.630, -0.453, -0.299, -0.006])
germany = np.array([0.194, 0.507, 0.287,  0.132, -0.281])
france = np.array([0.605, -0.678, -0.436, -0.019, -0.291])
paris = np.array([-0.074, -0.855, -0.689, -0.057, -0.139])
v = berlin - germany + france
v
## array([-0.151, -0.555, -1.176, -0.45 , -0.016])
v - paris
## array([-0.077,  0.3  , -0.487, -0.393,  0.123])

By just eyeballing the result, it does not look to be very close to Paris. However, when doing the exercise for all 100 components, Paris turns indeed out to be the most similar word to Berlin - Germany + France.

A.6.1.2 Solution: create and multipy row vectors

There are several ways to create the row vectors, here are a few different approaches

x1 = (1 + np.arange(5)).reshape((5,1))  # use 1 + arange().reshape
x2 = np.array([[-1, 2, -3, 4, -5]]).T  # create manually
x3 = np.arange(5,0,-1).reshape((5,1))  # use arange(arguments).reshape
# Stack all three row vectors into a matrix
X = np.row_stack((x1.T, x2.T, x3.T))
beta = 0.1*np.ones((5,1))  # use np.ones()
# and now compute
x1.T @ beta
## array([[1.5]])
x2.T @ beta
## array([[-0.3]])
x3.T @ beta
## array([[1.5]])
X @ beta
## array([[ 1.5],
##        [-0.3],
##        [ 1.5]])

Note that when multiplying \(\mathsf{X}\) which is all three vectors stached underneath each other, we get all three answers stacked underneath each other.

A.7 Linear Regression

A.7.0.1 Solution: Compute \(MSE\) and plot regression line

The solution is to combine the \(MSE\) computation and plotting the results. One can do it like this:

iris = pd.read_csv("../data/iris.csv.bz2", sep="\t")
versicolor = iris[iris.Species == "versicolor"].rename(
    columns={"Petal.Length": "pl", "Petal.Width":"pw"})
def msePlot(beta0, beta1):
    hat_w = beta0 + beta1*versicolor.pl  # note: vectorized operators!
    e = versicolor.pw - hat_w
    mse = np.mean(e**2)
    plt.scatter(versicolor.pl, versicolor.pw, edgecolor="k")
    length = np.linspace(versicolor.pl.min(), versicolor.pl.max(), 11)
    hatw = beta0 + beta1*length
    plt.plot(length, hatw, c='r')
    plt.xlabel("Petal length")
    plt.ylabel("Petal width")
    plt.show()
    return mse

Next, we can attempt to follow coordinate descent approach by keeping one parameter constant while minimizing the MSE by modifying the other, and thereafter keeping the other fixed while modifyin the first:

msePlot(-0.5, 0.5)  # start with the old value as benchmark
## 0.11320000000000001
plot of chunk unnamed-chunk-51

plot of chunk unnamed-chunk-51

msePlot(-0.5, 0.6)  # did we improve or not?
## 0.5631599999999998
plot of chunk unnamed-chunk-51

plot of chunk unnamed-chunk-51

msePlot(-0.5, 0.4)  # 0.6 was worse, so try this instead
## 0.03051999999999999
plot of chunk unnamed-chunk-51

plot of chunk unnamed-chunk-51

# now modify beta0
msePlot(-0.4, 0.4)  # feel like we should lift the line a bit
## 0.016119999999999995
plot of chunk unnamed-chunk-51

plot of chunk unnamed-chunk-51

# Now we are pretty happy with the results

A.7.1 Linear Regression in python: statsmodels.formula.api and sklearn

A.7.1.1 Solution: Explain virginica sepal width using linear regression with sklearn

from sklearn.linear_model import LinearRegression
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
##   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
iris = pd.read_csv("../data/iris.csv.bz2", sep="\t")
virginica = iris[iris.Species == "virginica"]
## extract explanatory variables as X
X = virginica[["Sepal.Length", "Petal.Width", "Petal.Length"]]
## call the dependent variable 'y' for consistency
y = virginica[["Sepal.Width"]]
m = LinearRegression()
_ = m.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## compute R2
m.score(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

A.7.1.2 Solution: Male wages as function of schooling and ethnicity

from sklearn.linear_model import LinearRegression
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
males = pd.read_csv("../data/males.csv.bz2", sep="\t")
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## extract explanatory variables as X
X = males[["school", "ethn"]]
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
X.head(2)  # to see what it is
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

For a check, let’s see what are the possible values of ethn:

males.ethn.unique()
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We have three different values. Hence we should have three coefficients as well–one for school and two for ethn as we drop one category as the reference.

Now we convert \(\mathsf{X}\) to dummies and fit the model

X = pd.get_dummies(X, drop_first=True)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## For quick check
X.sample(2)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## call the dependent variable 'y' for consistency
y = males[["wage"]]
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m = LinearRegression()
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = m.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Finally, check how many coefficients do we have:

m.coef_
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Indeed, we have three coefficients. Compute \(R^2\):

m.score(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

A.7.2 Model Goodness

A.7.2.1 Solution: downward sloping, and well-aligned point clouds

Create downward trending point cloud: larger \(x\) must correspond to smaller \(y\), hence \(\beta_1\) should be negative. Choose \(\beta_1 = -1\):

n = 100
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
beta0, beta1 = 1, -1  # choose beta1 negative
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
x = np.random.normal(size=n) 
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
eps = np.random.normal(size=n)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y = beta0 + beta1*x + eps 
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = plt.scatter(x, y, edgecolor="k")
_ = plt.xlabel("x")
_ = plt.ylabel("y")
_ = plt.show()
plot of chunk unnamed-chunk-58

plot of chunk unnamed-chunk-58

scale defines the variation of the random values, the default value is 1. As we add \(\epsilon\) when computing \(y\), it describes how much the \(y\) values are off from the perfect alignment:

n = 100
beta0, beta1 = 1, -1
x = np.random.normal(size=n) 
eps = np.random.normal(scale=0.1, size=n)  # make eps small
y = beta0 + beta1*x + eps 
_ = plt.scatter(x, y, edgecolor="k")
_ = plt.xlabel("x")
_ = plt.ylabel("y")
_ = plt.show()
plot of chunk unnamed-chunk-59

plot of chunk unnamed-chunk-59

A.7.2.2 Solution: Manually compute \(R^2\) for regression

import statsmodels.formula.api as smf
# find TSS
TSS = np.sum((versicolor.pw - np.mean(versicolor.pw))**2)
# find predicted values
m = smf.ols("pw ~ pl", data=versicolor).fit()
beta0, beta1 = m.params[0], m.params[1]
pwhat = beta0 + beta1*versicolor.pl
# find ESS
ESS = np.sum((versicolor.pw - pwhat)**2)
R2 = 1 - ESS/TSS
R2
## 0.6188466815001422

A.7.2.3 Solution: Create a large and a small \(R^2\) value

Small variance (scale parameter) of \(\epsilon\) results in well-aligned datapoints and hence high \(R^2\). E.g.

n = 100
beta0, beta1 = 1, -1  # choose beta1 negative
x = np.random.normal(size=n) 
eps = np.random.normal(scale=0.05, size=n)
y = beta0 + beta1*x + eps 
plt.scatter(x, y, edgecolor="k")
## <matplotlib.collections.PathCollection object at 0x7a346510fb50>
plt.xlabel("x")
## Text(0.5, 0, 'x')
plt.ylabel("y")
## Text(0, 0.5, 'y')
plt.show()
plot of chunk unnamed-chunk-61

plot of chunk unnamed-chunk-61

randomDF = pd.DataFrame({"x":x, "y":y})
m = smf.ols("y ~ x", data=randomDF).fit()  # fit the model
m.rsquared
## 0.9976016958673664

In contrary, a large variance of the error term makes points very weakly aligned with the regression line:

eps = np.random.normal(scale=5, size=n)
y = beta0 + beta1*x + eps 
plt.scatter(x, y, edgecolor="k")
## <matplotlib.collections.PathCollection object at 0x7a34651ef910>
plt.xlabel("x")
## Text(0.5, 0, 'x')
plt.ylabel("y")
## Text(0, 0.5, 'y')
plt.show()
plot of chunk unnamed-chunk-62

plot of chunk unnamed-chunk-62

randomDF = pd.DataFrame({"x":x, "y":y})
m = smf.ols("y ~ x", data=randomDF).fit()  # fit the model
m.rsquared
## 0.11010626058941486

A.8 Logistic Regression

A.8.1 Logistic Regression in python: statsmodels.formula.api and sklearn

A.8.1.1 Solution: Predict treatment using all variables

First we load data, and convert all non-numeric variables to dummies.

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment.head(2)
##    treat  age  educ   ethn  married  re74  re75      re78   u74   u75
## 0   True   37    11  black     True   0.0   0.0   9930.05  True  True
## 1   True   30    12  black    False   0.0   0.0  24909.50  True  True
y = treatment.treat
## There are categoricals: convert these to dummies
X = pd.get_dummies(treatment.drop("treat", axis=1), drop_first=True)
## get a glimpse of X--does it look good?
X.sample(3)
##       age  educ  married     re74      re75      re78    u74    u75  ethn_hispanic  ethn_other
## 645    27    11     True  10897.4   3938.71   7684.18  False  False              0           0
## 1274   54    12     True  15674.1  25064.50      0.00  False  False              0           1
## 448    50    12     True  29193.1  17903.20  25121.40  False  False              0           0

The design matrix looks reasonable. Note that logical values are kept as they are and not transformed to dummies–logical values are essentially dummies already. Now fit the model:

from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
_ = m.fit(X, y)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m.score(X, y)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Indeed, score is now better than 93%.

A.9 Predictions and Model Goodness

A.9.1 Predicting using linear regression

A.9.1.1 Solution: manually predict distance to galaxies using modern data

hubble = pd.read_csv("../data/hubble.csv", sep="\t")
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## fit with modern values
m = smf.ols("Rmodern ~ vModern", data=hubble).fit()
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## assign estimated parameter values
beta0, beta1 = m.params
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
v7331 = hubble.vModern.iloc[17]
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
v7331  # velocity
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
beta0 + beta1*v7331  # distance
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
vs = np.array([500, 1000, 2000])
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
beta0 + beta1*vs  # 8 to 27 Mpc
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

A.9.1.2 Solution: predict petal width

from sklearn.linear_model import LinearRegression
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
iris = pd.read_csv("../data/iris.csv.bz2", sep="\t")
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
versicolor = iris[iris.Species == "versicolor"]
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y = versicolor[["Petal.Width"]]
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
X = versicolor[["Sepal.Length", "Sepal.Width"]]
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# note: length first, width second
## get a glimpse of X--does it look good?
X.sample(3)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## initialize and fit the model
m = LinearRegression()
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = m.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## predict
yhat = m.predict(X)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
rmse = np.sqrt(np.mean((y - yhat)**2))
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
rmse
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RMSE is rather small, 0.1391612cm. This is because the range of petal width is small, only 0.8 cm.

In order to predict the flower of given size, we can create a matrix with the requested numbers:

newX = np.array([[5, 2]])  # length first, width second
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m.predict(newX)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

A.9.1.3 Solution: predict hispanic male wage in service sector

As the first step we should load data check how is industry coded:

males = pd.read_csv("../data/males.csv.bz2", sep="\t")
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
males.industry.unique()
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We see that finance is coded just as “Finance”. Now we can follow the example in Section :

As second step we can fit the linear regression. We call the original data matrix with categorical Xc in order to distinguish it from the final design matrix X below:

from sklearn.linear_model import LinearRegression
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## Extract relevant variables in original categorical form
## we need it below
Xc = males[["school", "ethn", "industry"]]
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y = males.wage
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
X = pd.get_dummies(Xc, drop_first=True)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m = LinearRegression()
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = m.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now m is a fitted model that we can use for prediction. Next, we design the data frame for prediction using the categories and attach it on top of the original categorical data matrix Xc:

## design prediction
newDf = pd.DataFrame({"school":16, "ethn":"hisp",
                      "industry":"Finance"},
                     index=["c1"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## merge new on top of original
newDf = pd.concat((newDf, Xc), axis=0, sort=False)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## Does it look good?
newDf.head(3)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## create dummies of everything
allX = pd.get_dummies(newDf, drop_first=True)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## do dummies look reasonable?
allX.head(3)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Finally, we extract the relevant cases from the extended design matrix and predict:

## extract the topmost part from X
newX = allX.loc[["c1"]]
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
newX  # looks good
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m.predict(newX)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Predicted hourly salary is $2.16.

A.9.2 Predicting with Logistic Regression

A.9.2.1 Solution: Predict Titanic survival for two persons using statsmodels

Load data and fit the model:

titanic = pd.read_csv("../data/titanic.csv.bz2")
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m = smf.logit("survived ~ sex + C(pclass)", data=titanic).fit()
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Next, create data for the desired cases. A simple way is to use dict of lists. We put the man at the first position and the woman at the second position:

## create new data (a dict)
## man in 1st class, woman in 3rd class
newx = {"sex":["male", "female"], "pclass":[1, 3]}
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now we can predict using these data:

phat = m.predict(newx)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
phat  # first for the man, second for the woman
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
yhat = phat > 0.5
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
yhat
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

As the results indicate, a 1st class man has approximately 40% percent chance to survive, while a 3rd class woman has 60%. Hence we predict that the man died while the woman survived.

A.9.2.2 Solution: Test different thresholds with statsmodels

Load data and fit the model:

titanic = pd.read_csv("../data/titanic.csv.bz2")
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m = smf.logit("survived ~ sex + C(pclass)", data=titanic).fit()
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Predict the probabilities:

phat = m.predict()
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now try a few different thresholds:

yhat = phat > 0.5
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
np.mean(yhat == titanic.survived)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
yhat = phat > 0.8
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
np.mean(yhat == titanic.survived)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
yhat = phat > 0.3
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
np.mean(yhat == titanic.survived)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

So far, 0.5 seems to be the best option.

For the plot, we can just create an array of thresholds and compute accuracy for each of these in a loop:

thresholds = np.linspace(0, 1, 100)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
accuracies = np.empty(100)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
for i in range(100):
    yhat = phat > thresholds[i]
    accuracies[i] = np.mean(yhat == titanic.survived)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = plt.plot(thresholds, accuracies)
_ = plt.xlabel("Threshold")
_ = plt.ylabel("Accuracy")
_ = plt.show()
plot of chunk unnamed-chunk-78

plot of chunk unnamed-chunk-78

The plot shows that the best accuracy is achieved at threshold value 0.6..0.7. This is, obviously, data dependent.

A.9.2.3 Solution: Row of four zeros

The object is a pandas data frame, and hence the first column–the first zero–is its index. This means we are using the first row of data.

The second column corresponds to dummy sex_male. Hence the person in the first row is a female.

The 3rd and 4th columns are dummies, describing the passenger class. Hence together they indicate that the passenger was traveling in the 1st class.

To conclude: the zeros show that the passenger in the first line of data was a woman who was traveling in first class.

A.9.2.4 Solution: Predict treatment status using all variables

First load data and convert all non-numeric variables to dummies:

treatment = pd.read_csv("../data/treatment.csv.bz2", sep="\t")
y = treatment.treat
## drop outcome (treat) and convert all categoricals
## to dummmies
X = pd.get_dummies(treatment.drop("treat", axis=1), drop_first=True)
## get a glimpse of X--does it look good?
X.sample(3)
##       age  educ  married     re74      re75     re78    u74    u75  ethn_hispanic  ethn_other
## 2667   29    10     True  41144.6   8951.61  31032.3  False  False              0           1
## 1233   28    13     True  15282.3  14322.60  14038.4  False  False              0           1
## 1346   22    12     True  17633.4  17903.20  14777.3  False  False              1           0

The design matrix looks reasonable, in particular we can see that the logical married is still logical, but ethn with three categories is converted to two dummies (ethn_hispanic and ethnc_other). Now fit the model:

from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
_ = m.fit(X, y)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now we can extract the desired cases and predict values for those only:

newx = X.loc[[0, 9, 99]]  # extract desired rows
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m.predict_proba(newx)[:,1]  # note: only output treatment probability
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

As the probabilities suggest, the first case is predicted not to participate, and two latter cases are predicted to participate.

A.9.3 Confusion Matrix–Based Model Goodness Measures

A.9.3.1 Solution: Titanic Suvival Confusion Matrix with Child Variable

We can create child with a simple logical operation:

child = titanic.age < 14
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Note that we do not have to include child in the data frame, a variable in the working environment will do. Now we model and predict

m = smf.logit("survived ~ sex + C(pclass) + child", data=titanic).fit()
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
yhat = m.predict(titanic) > 0.5
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The confusion matrices, computed in two ways:

cm1 = pd.crosstab(titanic.survived, yhat)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
cm1
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.metrics import confusion_matrix
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
cm2 = confusion_matrix(titanic.survived, yhat)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
cm2
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Indeed, both ways to get the matrix yield the same result.

And finally we extract true negatives:

cm1.iloc[0,0]
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
cm2[0,0]
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

This is the same number as in the original model in Section 12.3.2. But now we have slightly larger number of true positives.

A.9.3.2 Solution: Titanic Survival with Naive Model

We can create a naive model with formula "survived ~ 1" but let’s do it manually here. We just predict everyone to the class that is more common:

titanic.survived.value_counts()
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

As one can see, surival 0 is the more common value, hence we predict everyone died:

yhat = pd.Series(0, index=titanic.index)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In order to better understand the results, let’s also output the confusion matrix:

pd.crosstab(titanic.survived, yhat)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

As one can see, pd.crosstab only outputs a single column here, as we do not predict anyone to survive. Essentially the survivor column is all zeros but pd.crosstab does not show this (confusion_matrix will show this column).

We can compute the goodness values using the sklearn.metrics:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
accuracy_score(titanic.survived, yhat)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
precision_score(titanic.survived, yhat)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
recall_score(titanic.survived, yhat)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
f1_score(titanic.survived, yhat)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The results give us two errors: as we do not predict any survivors, we cannot compute precision as 0 out of 0 predicted survivors are correct. As \(F\)-score is (partly) based on precision, we cannot compute that either. Additionally, \(R=0\) as we do not capture any survivors.

A.10 Trees and Forests

A.10.1 Regression Trees

A.10.1.1 Solution:

First load and prepare data:

from sklearn.tree import DecisionTreeRegressor
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
boston = pd.read_csv("../data/boston.csv.bz2", sep="\t")
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## define outcome (medv) and design matrix (X)
y = boston.medv
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
X = boston[["rm", "age"]]
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now define and fit the model:

m = DecisionTreeRegressor(max_depth=2)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = m.fit(X, y)
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Compute \(R^2\) on training data:

m.score(X, y)
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

This is a slight improvement, the original \(R^2\), using only rm, was 0.58. Hence the trees found it useful to include age in the decision-making. As the depth-2 tree is a very simple model, we are not really afraid of overfitting here.

DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Depth-2 Boston regression tree using both rm and age.

Here is the tree as seen by graphviz.

Indeed, now the tree considers age in the left node, with newer neighborhoods being somewhat more expensive.

Include all features:

X = boston.drop("medv", axis=1)
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = m.fit(X, y)
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m.score(X, y)
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

\(R^2\) improved noticeably. This hints that the tree was able to pick at least one other feature that was better to describe the price than rm and age.

DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Depth-2 Boston regression tree using all features.

Here is the tree when using all available features. rm is still the first feature to split data, but the other one is not lstat, the percentage of lower-status population.

A.11 Natural Language Processing

A.11.1 Categorize text

A.11.1.1 Solution: find poems with TF-IDF

DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We basically just swap CountVectorizer out for TfidfVectorizer. Everything else will remain the same:

from sklearn.feature_extraction.text import TfidfVectorizer
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
vrizer = TfidfVectorizer()
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
X = vrizer.fit_transform([rudaki1, rudaki2, phu1, phu2])
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
newX = vrizer.transform([newdoc])
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.neighbors import NearestNeighbors
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m = NearestNeighbors(n_neighbors=1, metric="cosine")
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
_ = m.fit(X)
NearestNeighbors(metric='cosine', n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
d = m.kneighbors(newX, n_neighbors=4)
NearestNeighbors(metric='cosine', n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
d
NearestNeighbors(metric='cosine', n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The closest document based on cosine distance is document 2 at distance 0.445.


  1. To be more precise, they are not the riches, but those with the largest income in this data.↩︎