A Solutions
A.1 Python
A.1.1 Operators
A.1.1.1 Going out with friends
= 7
friends = 150
budget print("I am going out with", friends, "friends")
## I am going out with 7 friends
= 14
mealprice = mealprice*(friends + 1)
price = price*1.15
total print("total price:", total)
## total price: 128.79999999999998
if total <= budget:
print("can afford")
else:
print("cannot afford")
## can afford
A.1.3 Collections
A.1.3.1 Extract sublists
= list(range(1, 13))
l for i in range(3):
print(l[i::3])
## [1, 4, 7, 10]
## [2, 5, 8, 11]
## [3, 6, 9, 12]
A.1.3.2 Combining lists
= ['Paulina', 'Severo']
friends = []
others "Lai Ming")
others.append("Lynn")
others.append(= friends + others
people print("all people", people)
## all people ['Paulina', 'Severo', 'Lai Ming', 'Lynn']
A.1.3.3 Assign people to seats
Consider two lists:
= ["Adam", "Ashin", "Inukai", "Tanaka", "Ikki"]
names = [33, 12, 45, 2, 17]
seats = []
assigneds for i in range(len(names)):
= names[i] + ": " + str(seats[i])
a
assigneds.append(a) assigneds
## ['Adam: 33', 'Ashin: 12', 'Inukai: 45', 'Tanaka: 2', 'Ikki: 17']
A.1.3.4 List comprehension
Create pizzas with different toppings:
1 + i for i in range(10)] [
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
= ['mushrooms', 'mozarella', 'ananas']
toppings 'pizza with ' + t for t in toppings] [
## ['pizza with mushrooms', 'pizza with mozarella', 'pizza with ananas']
A.1.3.5 Dict mapping coordinates to cities
= {(31.228611, 121.474722): "Shanghai",
placenames 23.763889, 90.388889): "Dhaka",
(13.7525, 100.494167): "Bangkok"
(
}print(placenames)
## {(31.228611, 121.474722): 'Shanghai', (23.763889, 90.388889): 'Dhaka', (13.7525, 100.494167): 'Bangkok'}
Note that such lookup is usually not what we want because it only returns the correct city name if given the exact coordinates. If coordinates are just a tiny bit off, the dict cannot find the city. If you consider cities to be continuous areas, you should do geographic lookup, not dictionary lookup.
A.1.3.6 Dict of dicts
= {"house":30, "street":"Shuangqing Rd",
tsinghua "district":"Haidan", "city":"Beijing",
"country":"CH"}
= {"pobox":"43844-00100", "city":"Nairobi",
kenyatta "country":"KE"}
= {"Tsinghua":tsinghua, "Kenyatta":kenyatta}
places "Teski Refugio"] = {"house":256,
places["street":"Volcan Osorno",
"city":"La Ensenada",
"region":"Los Lagos",
"country":"CL"}
print(places)
## {'Tsinghua': {'house': 30, 'street': 'Shuangqing Rd', 'district': 'Haidan', 'city': 'Beijing', 'country': 'CH'}, 'Kenyatta': {'pobox': '43844-00100', 'city': 'Nairobi', 'country': 'KE'}, 'Teski Refugio': {'house': 256, 'street': 'Volcan Osorno', 'city': 'La Ensenada', 'region': 'Los Lagos', 'country': 'CL'}}
A.1.4 Language Constructs
A.1.4.1 Solution: odd or even
for n in range(1,11):
= n % 2
parity if parity == 1:
print(n, "odd")
else:
print(n, "even")
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even
A.1.4.2 Solution: join names
.join()
is a string method and must be applied at the end of the
separator string:
= ["viola glabella", "monothropa hypopithys", "lomatium utriculatum"]
names ", ".join(names)
## 'viola glabella, monothropa hypopithys, lomatium utriculatum'
Note that the syntax is separator.join(names)
, not
names.join(separator)
.
A.1.5 Modules
A.1.5.1 Solution: access your file system
Below is the solution by importing the whole libraries. You can also
use, e.g. from os import getcwd, listdir
, or other approaches.
import os
os.getcwd()
## '/home/siim/tyyq/lecturenotes/machinelearning-py'
os.listdir()
## ['text.rmd', 'logistic-regression.rmd', 'linear-regression.rmd', 'index.rmd', 'images.rmd', 'trees-forests.rmd', 'color-spiral-keras.png', 'svm.rmd', 'images.md', 'numpy-pandas.rmd', 'plotting.rmd', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'numpy-pandas.md', 'linear-algebra.rmd', 'descriptive-analysis.rmd', 'descriptive-analysis.md', 'descriptive-statistics.md', 'regularization.rmd', 'svm.rmd~', 'keras-color-spiral.py', 'Makefile', '_bookdown.yml', 'figs', 'web-scraping.rmd', 'solutions.rmd', 'neural-nets.rmd', 'predictions.rmd', 'figure', 'linear-algebra.md', 'files', 'unsupervised-learning.rmd', 'cleaning-data.md', 'keras-cats-vs-dogs.py', 'titanic-tree.png', 'logistic-regression.md', 'python.md', 'index.md', 'ml-workflow.rmd', 'build', '.cache', 'svm.md~', 'python.rmd', 'cleaning-data.rmd', 'datasets.md', 'descriptive-statistics.rmd', '_output.yml', 'datasets.rmd~', '_bookdown_files', 'overfitting-validation.md', 'overfitting-validation.rmd']
A.2 Numpy and Pandas
A.2.1 Numpy
A.2.1.1 Solution: concatenate arrays
## array([[-1., -1., -1., -1.],
## [ 0., 0., 0., 0.],
## [ 2., 2., 2., 2.]])
Obviously, you can use np.ones
and np.zeros
directly in
np.row_stack
and skip creating the temporary variables.
A.2.1.2 Solution: create matrix of even numbers
2 + 2*np.arange(20).reshape(4,5)
## array([[ 2, 4, 6, 8, 10],
## [12, 14, 16, 18, 20],
## [22, 24, 26, 28, 30],
## [32, 34, 36, 38, 40]])
A.2.1.3 Solution: create matrix, play with columns
= 10 + np.arange(20).reshape(4,5)*2
a print(a[:,2], '\n')
## [14 24 34 44]
3] = 1 + np.arange(5)
a[ a
## array([[10, 12, 14, 16, 18],
## [20, 22, 24, 26, 28],
## [30, 32, 34, 36, 38],
## [ 1, 2, 3, 4, 5]])
A.2.1.4 Solution: Logical indexing
= np.array(["Roxana", "Statira", "Roxana", "Statira", "Roxana"])
names = np.array([126, 115, 130, 141, 132])
scores
< 130] # extract < 130 scores[scores
## array([126, 115])
== "Statira"] # extract by Statira scores[names
## array([115, 141])
== "Roxana"] = scores[names == "Roxana"] + 10 # add 10
scores[names scores
## array([136, 115, 140, 141, 142])
A.2.2 Pandas
A.2.2.1 Solution: series of capital cities
= pd.Series(["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
cities =["Congo", "Gabon", "Equatorial Guinea", "Cameroon"])
index cities
## Congo Brazzaville
## Gabon Libreville
## Equatorial Guinea Malabo
## Cameroon Yaoundé
## dtype: object
A.2.2.2 Solution: extract capital cities
0] cities.iloc[
## 'Brazzaville'
2] cities.iloc[
## 'Malabo'
"Gabon"]] cities[[
## Gabon Libreville
## dtype: object
A.2.2.3 Solution: create city dataframe
To keep code clean, we create data as a separate dict, and index as a list. Thereafter we make a dataframe out of these two:
= {"capital":["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
data "population":[1696, 703, 297, 2765]}
= ["Congo", "Gabon", "Equatorial Guinea", "Cameroon"]
countries = pd.DataFrame(data, index=countries)
cityDF cityDF
## capital population
## Congo Brazzaville 1696
## Gabon Libreville 703
## Equatorial Guinea Malabo 297
## Cameroon Yaoundé 2765
A.2.2.4 Solution: Variables in G.W.Bush approval data frame
Pandas prints five columns: index, and four variables— date, approve, disapprove, dontknow.
A.2.2.5 Show files in current/parent folder
Remember that parent folder is denoted as double dots ..
. Current
folder is a single dot .
, but usually it is not needed.
Current working directory:
import os
os.getcwd()
## '/home/siim/tyyq/lecturenotes/machinelearning-py'
Files in the current folder
# or os.listdir(".") os.listdir()
## ['text.rmd', 'logistic-regression.rmd', 'linear-regression.rmd', 'index.rmd', 'images.rmd', 'trees-forests.rmd', 'color-spiral-keras.png', 'svm.rmd', 'images.md', 'numpy-pandas.rmd', 'plotting.rmd', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'numpy-pandas.md', 'linear-algebra.rmd', 'descriptive-analysis.rmd', 'descriptive-analysis.md', 'descriptive-statistics.md', 'regularization.rmd', 'svm.rmd~', 'keras-color-spiral.py', 'Makefile', '_bookdown.yml', 'linear-regression.md', 'figs', 'web-scraping.rmd', 'solutions.rmd', 'neural-nets.rmd', 'predictions.rmd', 'figure', 'linear-algebra.md', 'files', 'unsupervised-learning.rmd', 'cleaning-data.md', 'keras-cats-vs-dogs.py', 'titanic-tree.png', 'logistic-regression.md', 'python.md', 'index.md', 'ml-workflow.rmd', 'build', '.cache', 'svm.md~', 'python.rmd', 'cleaning-data.rmd', 'datasets.md', 'descriptive-statistics.rmd', '_output.yml', 'datasets.rmd~', '_bookdown_files', 'overfitting-validation.md', 'overfitting-validation.rmd']
Files in the parent folder
"..") os.listdir(
## ['machineLearning.rip', 'literature', '.RData', 'machineLearning.mtc19', 'machineLearning.mtc8', 'machineLearning.mtc16', 'machineLearning.mtc9', 'machineLearning.exc', 'machinelearning-common', '.gitignore', 'datascience-intro', 'machineLearning.mtc12', '.Rhistory', 'machineLearning.mtc18', 'machineLearning.mtc20', 'lecturenotes.bib.sav', 'machineLearning.mtc2', 'README.md', 'machineLearning.mtc15', 'machineLearning.mtc13', 'scripts', 'machinelearning-R', 'machinelearning-R.html', 'machineLearning.tex', 'kbordermatrix.sty', 'test', 'solutions.rnw', '.git', 'lecturenotes.bib.bak', 'machineLearning.exm', 'videos', 'machineLearning.rnw', 'auto', 'machineLearning.mtc11', 'machineLearning.mtc22', 'machineLearning.mtc5', '.Rprofile', 'machineLearning.mtc', 'machineLearning.mtc1', 'machinelearning-py', 'Makefile', 'machineLearning.mtc17', 'bitbucket-pipelines.yml', 'figs', 'machineLearning.mtc0', 'machineLearning.mtc21', 'files', 'machineLearning.mtc6', 'isomath.tex', 'machineLearning.mtc23', 'machineLearning.mtc14', 'machineLearning.mtc7', 'lecturenotes.bib', 'machineLearning.mtc3', '_region_.tex', 'data', 'solutions.tex', 'latexmkrc', 'machineLearning.mtc10', 'machineLearning.maf', 'machineLearning.mtc4', 'img', 'machineLearning.pdf', 'material']
Obviously, everyone has different files in these folders.
A.2.2.6 Solution: presidents approval 88%
= pd.read_csv("../data/gwbush-approval.csv", sep="\t", nrows=10)
approval >= 88][["date", "approve"]] # only print date, approval[approval.approve
## date approve
## 5 2001 Oct 19-21 88
## 6 2001 Oct 11-14 89
## 8 2001 Sep 21-22 90
# approval rate
>= 88].shape[0] # data for at least 90% approval approval[approval.approve
## 3
A.2.2.7 Solution: change index, convert to variable
The original data frame was created as
= pd.DataFrame(
capitals "capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],
{"population":[32.7, 267.7, 15.3]}, # in millions
=["MY", "ID", "KH"]) index
We can modify it as
= ["Malaysia", "Indonesia", "Cambodia"]
capitals.index # now the index is country names capitals
## capital population
## Malaysia Kuala Lumpur 32.7
## Indonesia Jakarta 267.7
## Cambodia Phnom Penh 15.3
= capitals.reset_index(name = "country") capitals
## TypeError: DataFrame.reset_index() got an unexpected keyword argument 'name'
A.2.2.8 Solution: create city dataframe
= np.array([[3.11, 5282, 19800],
cityM 18.9, 306, 46997],
[4.497, 1886, 22000]])
[= ["Chittagong", "Dhaka", "Kolkata"]
names vars = ["population", "area", "density"]
= pd.DataFrame(cityM, index=names,
cityDF =vars)
columns cityDF
## population area density
## Chittagong 3.110 5282.0 19800.0
## Dhaka 18.900 306.0 46997.0
## Kolkata 4.497 1886.0 22000.0
A.2.2.9 Solution: extract city data
# density:
2] cityM[:,
## array([19800., 46997., 22000.])
cityDF.density
## Chittagong 19800.0
## Dhaka 46997.0
## Kolkata 22000.0
## Name: density, dtype: float64
# third city
2,:] cityM[
## array([4.497e+00, 1.886e+03, 2.200e+04])
"Kolkata",:] cityDF.loc[
## population 4.497
## area 1886.000
## density 22000.000
## Name: Kolkata, dtype: float64
2,:] cityDF.iloc[
## population 4.497
## area 1886.000
## density 22000.000
## Name: Kolkata, dtype: float64
## second city area
1,1] cityM[
## 306.0
"Kolkata","area"] cityDF.loc[
## 1886.0
1,1] cityDF.iloc[
## 306.0
A.2.2.10 Solution: Titanic line 1000
Extract the 1000th row (note: index 999!) as data frame, and thereafter extract name, survival status, and age. We have to extract twice as we cannot extract by row number and column names at the same time.
= pd.read_csv("../data/titanic.csv.bz2")
titanic 999]][["name", "survived", "age"]] titanic.loc[[
## name survived age
## 999 McCarthy, Miss. Catherine "Katie" 1 NaN
A.2.2.11 Solution: Titanic male/female age distribution
The two relevant variables are clearly sex and age. If you already have loaded data then you may just presereve these variables:
"sex", "age"]].sample(4) titanic[[
## sex age
## 1199 male NaN
## 540 female 2.0
## 207 female 33.0
## 495 male NaN
Alternatively, you may also just read these two columns only:
"../data/titanic.csv.bz2",
pd.read_csv(=["sex", "age"]).sample(4) usecols
## sex age
## 404 male 21.0
## 1145 male 8.0
## 518 male 19.0
## 1133 male 17.0
A.3 Descriptive analysis with Pandas
A.3.1 What are the values?
A.3.1.1 Solution: which methods can be applied to the whole data frame
There is no better way than just to try:
min() # works titanic.
## <string>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
## pclass 1
## survived 0
## name Abbing, Mr. Anthony
## sex female
## age 0.1667
## sibsp 0
## parch 0
## ticket 110152
## fare 0.0
## body 1.0
## dtype: object
# works titanic.mean()
## pclass 2.294882
## survived 0.381971
## age 29.881135
## sibsp 0.498854
## parch 0.385027
## fare 33.295479
## body 160.809917
## dtype: float64
# does not work titanic.unique()
## AttributeError: 'DataFrame' object has no attribute 'unique'
# works again titanic.nunique()
## pclass 3
## survived 2
## name 1307
## sex 2
## age 98
## sibsp 7
## parch 8
## ticket 929
## fare 281
## cabin 186
## embarked 3
## boat 27
## body 121
## home.dest 369
## dtype: int64
The mathematical functions work, but may skip the non-numeric
variables. .min
however, find the first value when ordered
alphabetically.
It is not immediately clear why .unique
does not work while
.nunique
works. It may be because there is a different number of
unique elements for each variables, but hey, you can just use a more
complex data structure and still compute that.
A.4 Cleaning and Manipulating Data
A.4.1 Missing Observations
A.4.1.1 Solution: Missings in fare in Titanic Data
We can count the NaN-s as
sum() titanic.fare.isna().
## 1
There is just a single missing value. About non-reasonable values: first we should check its data type:
titanic.fare.dtype
## dtype('float64')
We see it is coded as numeric (64-bit float), so we can query its range:
min(), titanic.fare.max() titanic.fare.
## (0.0, 512.3292)
While 512 pounds seems a reasonable ticket price, value 0 for minimum is suspicious. We do not know if any passengers really did not pay a fare, or more likely, it just means that the data collectors did not have the information. So it is just a missing value, coded as 0.
A.4.2 Converting Variables
A.4.2.1 Solution: convert males’ dataset school to categories
Load the males dataset. It is instructive to test the code first on a sequence of years of schooling before converting actual data. In particular, we want to ensure that we get the boundaries right, “12” should be HS while “13” should be “some college” and so on.
We choose to specify the right boundary at integer values, e.g. “HS”
is interval \([12, 13)\). In order to ensure that “12” belong to this
interval while “13” does not we tell right=False
, i.e. remove the
right boundary for the interval (and hence include it to the interval
above):
# test on years of schooling 10-17
= np.arange(10,18)
school # convert to categories
= pd.cut(school,
categories =[-np.Inf, 12, 13, 16, np.Inf],
bins=["Less than HS", "HS", "Some college", "College"],
labels=False)
right# print in a way that years and categories are next to each other
=school) pd.Series(categories, index
## 10 Less than HS
## 11 Less than HS
## 12 HS
## 13 Some college
## 14 Some college
## 15 Some college
## 16 College
## 17 College
## dtype: category
## Categories (4, object): ['Less than HS' < 'HS' < 'Some college' < 'College']
Now we see it works correctly, e.g. 11 years of schooling is “Less than HS” while 12 years is “HS”. Now we can do the actual conversion:
= pd.read_csv("../data/males.csv.bz2", sep="\t")
males
pd.cut(males.school,=[-np.Inf, 12, 13, 16, np.Inf],
bins=["Less than HS", "HS", "Some college", "College"],
labels=False) right
## 0 Some college
## 1 Some college
## 2 Some college
## 3 Some college
## 4 Some college
## ...
## 4355 Less than HS
## 4356 Less than HS
## 4357 Less than HS
## 4358 Less than HS
## 4359 Less than HS
## Name: school, Length: 4360, dtype: category
## Categories (4, object): ['Less than HS' < 'HS' < 'Some college' < 'College']
A.4.2.2 Solution: Convert Males’ dataset residence to dummies
Rest of the task pretty much repeats the examples in Converting
categorical variables to
dummies, just you have to
find the prefix_sep
argument to remove the underscore between the
prefix “R” and the category name. The code might look like
= pd.read_csv("../data/males.csv.bz2", sep="\t")
males = pd.get_dummies(males.residence, prefix="R", prefix_sep="")
residence "Rsouth", axis=1).sample(8) residence.drop(
## Rnorth_east Rnothern_central Rrural_area
## 2901 1 0 0
## 3137 0 0 0
## 2357 0 0 0
## 2359 0 0 0
## 2778 0 1 0
## 2688 0 0 0
## 638 0 0 0
## 2959 0 0 0
A.4.2.3 Solution: Convert Titanic’s age categories, sex, pclass to dummies
= pd.read_csv("../data/titanic.csv.bz2")
titanic = titanic[["age", "sex", "pclass"]]
titanic "age"] = pd.cut(titanic.age,
titanic[=[0, 14, 50, np.inf],
bins=["0-13", "14-49", "50-"],
labels=False)
right= pd.get_dummies(titanic, columns=["age", "sex", "pclass"])
d 7) d.sample(
## age_0-13 age_14-49 age_50- sex_female sex_male pclass_1 pclass_2 pclass_3
## 879 0 0 0 0 1 0 0 1
## 1264 0 1 0 0 1 0 0 1
## 927 0 0 0 0 1 0 0 1
## 490 0 0 1 1 0 0 1 0
## 825 1 0 0 0 1 0 0 1
## 407 0 1 0 1 0 0 1 0
## 794 1 0 0 1 0 0 0 1
One may also want to drop one of the dummy levels with drop_first
argument.
A.4.2.4 Solution: Convert Males’ residence, ethn to dummies and concatenate
We create the dummies separately for residence and ethn and give them corresponding prefix to make the variable names more descriptive.
= pd.read_csv("../data/males.csv.bz2", sep="\t")
males = pd.get_dummies(males.residence, prefix="residence")
residence ## remove the reference category
= residence.drop("residence_north_east", axis=1)
residence 4) residence.sample(
## residence_nothern_central residence_rural_area residence_south
## 1609 1 0 0
## 3704 1 0 0
## 2346 0 0 0
## 3672 0 0 0
## convert and remove using chaining
= pd.get_dummies(males.ethn, prefix="ethn")\
ethn "ethn_other", axis=1)
.drop(4) ethn.sample(
## ethn_black ethn_hisp
## 3502 0 1
## 1965 0 0
## 1093 0 0
## 3961 0 0
## combine these variables next to each other
= pd.concat((males.wage, residence, ethn), axis=1)
d 7) d.sample(
## wage residence_nothern_central residence_rural_area residence_south ethn_black ethn_hisp
## 2071 1.800216 0 0 1 0 0
## 1296 1.554879 0 0 0 0 0
## 4157 1.754424 1 0 0 0 0
## 3343 1.524607 0 0 0 0 0
## 2025 1.397355 0 0 1 1 0
## 1343 1.646478 0 0 0 0 0
## 2064 1.802248 0 0 1 0 0
A.5 Descriptive Statistics
A.5.1 Inequality
A.5.1.1 Solution: 80-20 ratio of income
First load the data and compute total income:
= pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment = treatment.re78 # extract income for simplicity
income = income.sum() # total income total
Next, we can just start trying with the uppermost 1% (lowermost 99):
= 99
pct = np.percentile(treatment.re78, pct) threshold
The income share of the richest 1% is
= income[income > threshold].sum()/total
share share
## 0.04234557467220405
This is less than 99%, so we should try a smaller number. Instead, we find that the top 1% of the richest8 earn 4% of total income. For the top 2% we have
= 98
pct = np.percentile(treatment.re78, pct)
threshold > threshold].sum()/total income[income
## 0.07508822500042818
Again, this is less than 98%. Instead of manually trying to find the correct value, we can do a loop:
for pct in range(99, 50, -1):
= np.percentile(treatment.re78, pct)
threshold = income[income > threshold].sum()/total
share if 100*share > pct:
print(pct, share)
break
## 63 0.6474562618876845
So in this data, the richest 37% earn approximately 65% of total income. If we want more precise ratio then we can write another loop that goes down to fractions of percent.
A.6 Linear Algebra
A.6.1 Numpy Arrays as Vectors and Matrices
A.6.1.1 Solution: Berlin-Germany-France-Paris
= np.array([-0.562, 0.630, -0.453, -0.299, -0.006])
berlin = np.array([0.194, 0.507, 0.287, 0.132, -0.281])
germany = np.array([0.605, -0.678, -0.436, -0.019, -0.291])
france = np.array([-0.074, -0.855, -0.689, -0.057, -0.139])
paris = berlin - germany + france
v v
## array([-0.151, -0.555, -1.176, -0.45 , -0.016])
- paris v
## array([-0.077, 0.3 , -0.487, -0.393, 0.123])
By just eyeballing the result, it does not look to be very close to Paris. However, when doing the exercise for all 100 components, Paris turns indeed out to be the most similar word to Berlin - Germany + France.
A.6.1.2 Solution: create and multipy row vectors
There are several ways to create the row vectors, here are a few different approaches
= (1 + np.arange(5)).reshape((5,1)) # use 1 + arange().reshape
x1 = np.array([[-1, 2, -3, 4, -5]]).T # create manually
x2 = np.arange(5,0,-1).reshape((5,1)) # use arange(arguments).reshape
x3 # Stack all three row vectors into a matrix
= np.row_stack((x1.T, x2.T, x3.T))
X = 0.1*np.ones((5,1)) # use np.ones()
beta # and now compute
@ beta x1.T
## array([[1.5]])
@ beta x2.T
## array([[-0.3]])
@ beta x3.T
## array([[1.5]])
@ beta X
## array([[ 1.5],
## [-0.3],
## [ 1.5]])
Note that when multiplying \(\mathsf{X}\) which is all three vectors stached underneath each other, we get all three answers stacked underneath each other.
A.7 Linear Regression
A.7.0.1 Solution: Compute \(MSE\) and plot regression line
The solution is to combine the \(MSE\) computation and plotting the results. One can do it like this:
= pd.read_csv("../data/iris.csv.bz2", sep="\t")
iris = iris[iris.Species == "versicolor"].rename(
versicolor ={"Petal.Length": "pl", "Petal.Width":"pw"})
columnsdef msePlot(beta0, beta1):
= beta0 + beta1*versicolor.pl # note: vectorized operators!
hat_w = versicolor.pw - hat_w
e = np.mean(e**2)
mse ="k")
plt.scatter(versicolor.pl, versicolor.pw, edgecolor= np.linspace(versicolor.pl.min(), versicolor.pl.max(), 11)
length = beta0 + beta1*length
hatw ='r')
plt.plot(length, hatw, c"Petal length")
plt.xlabel("Petal width")
plt.ylabel(
plt.show()return mse
Next, we can attempt to follow coordinate descent approach by keeping one parameter constant while minimizing the MSE by modifying the other, and thereafter keeping the other fixed while modifyin the first:
-0.5, 0.5) # start with the old value as benchmark msePlot(
## 0.11320000000000001
-0.5, 0.6) # did we improve or not? msePlot(
## 0.5631599999999998
-0.5, 0.4) # 0.6 was worse, so try this instead msePlot(
## 0.03051999999999999
# now modify beta0
-0.4, 0.4) # feel like we should lift the line a bit msePlot(
## 0.016119999999999995
# Now we are pretty happy with the results
A.7.1 Linear Regression in python: statsmodels.formula.api
and sklearn
A.7.1.1 Solution: Explain virginica sepal width using linear regression with sklearn
from sklearn.linear_model import LinearRegression
## /usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
## warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
= pd.read_csv("../data/iris.csv.bz2", sep="\t")
iris = iris[iris.Species == "virginica"]
virginica ## extract explanatory variables as X
= virginica[["Sepal.Length", "Petal.Width", "Petal.Length"]]
X ## call the dependent variable 'y' for consistency
= virginica[["Sepal.Width"]]
y = LinearRegression()
m = m.fit(X, y) _
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## compute R2
m.score(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
A.7.1.2 Solution: Male wages as function of schooling and ethnicity
from sklearn.linear_model import LinearRegression
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= pd.read_csv("../data/males.csv.bz2", sep="\t") males
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## extract explanatory variables as X
= males[["school", "ethn"]] X
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
2) # to see what it is X.head(
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
For a check, let’s see what are the possible values of ethn:
males.ethn.unique()
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We have three different values. Hence we should have three coefficients as well–one for school and two for ethn as we drop one category as the reference.
Now we convert \(\mathsf{X}\) to dummies and fit the model
= pd.get_dummies(X, drop_first=True) X
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## For quick check
2) X.sample(
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## call the dependent variable 'y' for consistency
= males[["wage"]] y
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= LinearRegression() m
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= m.fit(X, y) _
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Finally, check how many coefficients do we have:
m.coef_
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Indeed, we have three coefficients. Compute \(R^2\):
m.score(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
A.7.2 Model Goodness
A.7.2.1 Solution: downward sloping, and well-aligned point clouds
Create downward trending point cloud: larger \(x\) must correspond to smaller \(y\), hence \(\beta_1\) should be negative. Choose \(\beta_1 = -1\):
= 100 n
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= 1, -1 # choose beta1 negative beta0, beta1
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= np.random.normal(size=n) x
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= np.random.normal(size=n) eps
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= beta0 + beta1*x + eps y
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= plt.scatter(x, y, edgecolor="k")
_ = plt.xlabel("x")
_ = plt.ylabel("y")
_ = plt.show() _
scale
defines the variation of the random values, the default value
is 1. As we add
\(\epsilon\) when computing \(y\), it describes how much the \(y\) values
are off from the perfect alignment:
= 100
n = 1, -1
beta0, beta1 = np.random.normal(size=n)
x = np.random.normal(scale=0.1, size=n) # make eps small
eps = beta0 + beta1*x + eps
y = plt.scatter(x, y, edgecolor="k")
_ = plt.xlabel("x")
_ = plt.ylabel("y")
_ = plt.show() _
A.7.2.2 Solution: Manually compute \(R^2\) for regression
import statsmodels.formula.api as smf
# find TSS
= np.sum((versicolor.pw - np.mean(versicolor.pw))**2)
TSS # find predicted values
= smf.ols("pw ~ pl", data=versicolor).fit()
m = m.params[0], m.params[1]
beta0, beta1 = beta0 + beta1*versicolor.pl
pwhat # find ESS
= np.sum((versicolor.pw - pwhat)**2)
ESS = 1 - ESS/TSS
R2 R2
## 0.6188466815001422
A.7.2.3 Solution: Create a large and a small \(R^2\) value
Small variance (scale
parameter) of \(\epsilon\) results in
well-aligned datapoints and hence high \(R^2\). E.g.
= 100
n = 1, -1 # choose beta1 negative
beta0, beta1 = np.random.normal(size=n)
x = np.random.normal(scale=0.05, size=n)
eps = beta0 + beta1*x + eps
y ="k") plt.scatter(x, y, edgecolor
## <matplotlib.collections.PathCollection object at 0x7a346510fb50>
"x") plt.xlabel(
## Text(0.5, 0, 'x')
"y") plt.ylabel(
## Text(0, 0.5, 'y')
plt.show()
= pd.DataFrame({"x":x, "y":y})
randomDF = smf.ols("y ~ x", data=randomDF).fit() # fit the model
m m.rsquared
## 0.9976016958673664
In contrary, a large variance of the error term makes points very weakly aligned with the regression line:
= np.random.normal(scale=5, size=n)
eps = beta0 + beta1*x + eps
y ="k") plt.scatter(x, y, edgecolor
## <matplotlib.collections.PathCollection object at 0x7a34651ef910>
"x") plt.xlabel(
## Text(0.5, 0, 'x')
"y") plt.ylabel(
## Text(0, 0.5, 'y')
plt.show()
= pd.DataFrame({"x":x, "y":y})
randomDF = smf.ols("y ~ x", data=randomDF).fit() # fit the model
m m.rsquared
## 0.11010626058941486
A.8 Logistic Regression
A.8.1 Logistic Regression in python: statsmodels.formula.api and sklearn
A.8.1.1 Solution: Predict treatment using all variables
First we load data, and convert all non-numeric variables to dummies.
= pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment 2) treatment.head(
## treat age educ ethn married re74 re75 re78 u74 u75
## 0 True 37 11 black True 0.0 0.0 9930.05 True True
## 1 True 30 12 black False 0.0 0.0 24909.50 True True
= treatment.treat
y ## There are categoricals: convert these to dummies
= pd.get_dummies(treatment.drop("treat", axis=1), drop_first=True)
X ## get a glimpse of X--does it look good?
3) X.sample(
## age educ married re74 re75 re78 u74 u75 ethn_hispanic ethn_other
## 645 27 11 True 10897.4 3938.71 7684.18 False False 0 0
## 1274 54 12 True 15674.1 25064.50 0.00 False False 0 1
## 448 50 12 True 29193.1 17903.20 25121.40 False False 0 0
The design matrix looks reasonable. Note that logical values are kept as they are and not transformed to dummies–logical values are essentially dummies already. Now fit the model:
from sklearn.linear_model import LogisticRegression
= LogisticRegression()
m = m.fit(X, y) _
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
m.score(X, y)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Indeed, score is now better than 93%.
A.9 Predictions and Model Goodness
A.9.1 Predicting using linear regression
A.9.1.1 Solution: manually predict distance to galaxies using modern data
= pd.read_csv("../data/hubble.csv", sep="\t") hubble
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
## fit with modern values
= smf.ols("Rmodern ~ vModern", data=hubble).fit() m
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
## assign estimated parameter values
= m.params beta0, beta1
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= hubble.vModern.iloc[17] v7331
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
# velocity v7331
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
+ beta1*v7331 # distance beta0
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= np.array([500, 1000, 2000]) vs
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
+ beta1*vs # 8 to 27 Mpc beta0
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
A.9.1.2 Solution: predict petal width
from sklearn.linear_model import LinearRegression
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= pd.read_csv("../data/iris.csv.bz2", sep="\t") iris
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= iris[iris.Species == "versicolor"] versicolor
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= versicolor[["Petal.Width"]] y
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= versicolor[["Sepal.Length", "Sepal.Width"]] X
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
# note: length first, width second
## get a glimpse of X--does it look good?
3) X.sample(
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
## initialize and fit the model
= LinearRegression() m
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= m.fit(X, y) _
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## predict
= m.predict(X) yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= np.sqrt(np.mean((y - yhat)**2)) rmse
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
rmse
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
RMSE is rather small, 0.1391612cm. This is because the range of petal width is small, only 0.8 cm.
In order to predict the flower of given size, we can create a matrix with the requested numbers:
= np.array([[5, 2]]) # length first, width second newX
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
m.predict(newX)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
A.9.1.3 Solution: predict hispanic male wage in service sector
As the first step we should load data check how is industry coded:
= pd.read_csv("../data/males.csv.bz2", sep="\t") males
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
males.industry.unique()
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We see that finance is coded just as “Finance”. Now we can follow the example in Section :
As second step we can fit the linear regression. We call the original
data matrix with categorical Xc
in order to distinguish it from the
final design matrix X
below:
from sklearn.linear_model import LinearRegression
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## Extract relevant variables in original categorical form
## we need it below
= males[["school", "ethn", "industry"]] Xc
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= males.wage y
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= pd.get_dummies(Xc, drop_first=True) X
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= LinearRegression() m
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= m.fit(X, y) _
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Now m
is a fitted model that we can use for prediction. Next, we
design the data frame for prediction using the categories and attach
it on top of the original categorical data matrix Xc
:
## design prediction
= pd.DataFrame({"school":16, "ethn":"hisp",
newDf "industry":"Finance"},
=["c1"]) index
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## merge new on top of original
= pd.concat((newDf, Xc), axis=0, sort=False) newDf
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## Does it look good?
3) newDf.head(
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## create dummies of everything
= pd.get_dummies(newDf, drop_first=True) allX
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## do dummies look reasonable?
3) allX.head(
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Finally, we extract the relevant cases from the extended design matrix and predict:
## extract the topmost part from X
= allX.loc[["c1"]] newX
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
# looks good newX
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
m.predict(newX)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Predicted hourly salary is $2.16.
A.9.2 Predicting with Logistic Regression
A.9.2.1 Solution: Predict Titanic survival for two persons using statsmodels
Load data and fit the model:
= pd.read_csv("../data/titanic.csv.bz2") titanic
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= smf.logit("survived ~ sex + C(pclass)", data=titanic).fit() m
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Next, create data for the desired cases. A simple way is to use dict of lists. We put the man at the first position and the woman at the second position:
## create new data (a dict)
## man in 1st class, woman in 3rd class
= {"sex":["male", "female"], "pclass":[1, 3]} newx
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Now we can predict using these data:
= m.predict(newx) phat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
# first for the man, second for the woman phat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= phat > 0.5 yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
As the results indicate, a 1st class man has approximately 40% percent chance to survive, while a 3rd class woman has 60%. Hence we predict that the man died while the woman survived.
A.9.2.2 Solution: Test different thresholds with statsmodels
Load data and fit the model:
= pd.read_csv("../data/titanic.csv.bz2") titanic
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= smf.logit("survived ~ sex + C(pclass)", data=titanic).fit() m
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Predict the probabilities:
= m.predict() phat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Now try a few different thresholds:
= phat > 0.5 yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
== titanic.survived) np.mean(yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= phat > 0.8 yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
== titanic.survived) np.mean(yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= phat > 0.3 yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
== titanic.survived) np.mean(yhat
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
So far, 0.5 seems to be the best option.
For the plot, we can just create an array of thresholds and compute accuracy for each of these in a loop:
= np.linspace(0, 1, 100) thresholds
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= np.empty(100) accuracies
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
for i in range(100):
= phat > thresholds[i]
yhat = np.mean(yhat == titanic.survived) accuracies[i]
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
= plt.plot(thresholds, accuracies)
_ = plt.xlabel("Threshold")
_ = plt.ylabel("Accuracy")
_ = plt.show() _
The plot shows that the best accuracy is achieved at threshold value 0.6..0.7. This is, obviously, data dependent.
A.9.2.3 Solution: Row of four zeros
The object is a pandas data frame, and hence the first column–the first zero–is its index. This means we are using the first row of data.
The second column corresponds to dummy sex_male
. Hence the person in
the first row is a female.
The 3rd and 4th columns are dummies, describing the passenger class. Hence together they indicate that the passenger was traveling in the 1st class.
To conclude: the zeros show that the passenger in the first line of data was a woman who was traveling in first class.
A.9.2.4 Solution: Predict treatment status using all variables
First load data and convert all non-numeric variables to dummies:
= pd.read_csv("../data/treatment.csv.bz2", sep="\t")
treatment = treatment.treat
y ## drop outcome (treat) and convert all categoricals
## to dummmies
= pd.get_dummies(treatment.drop("treat", axis=1), drop_first=True)
X ## get a glimpse of X--does it look good?
3) X.sample(
## age educ married re74 re75 re78 u74 u75 ethn_hispanic ethn_other
## 2667 29 10 True 41144.6 8951.61 31032.3 False False 0 1
## 1233 28 13 True 15282.3 14322.60 14038.4 False False 0 1
## 1346 22 12 True 17633.4 17903.20 14777.3 False False 1 0
The design matrix looks reasonable, in particular we can see that the logical married is still logical, but ethn with three categories is converted to two dummies (ethn_hispanic and ethnc_other). Now fit the model:
from sklearn.linear_model import LogisticRegression
= LogisticRegression()
m = m.fit(X, y) _
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Now we can extract the desired cases and predict values for those only:
= X.loc[[0, 9, 99]] # extract desired rows newx
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
1] # note: only output treatment probability m.predict_proba(newx)[:,
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
As the probabilities suggest, the first case is predicted not to participate, and two latter cases are predicted to participate.
A.9.3 Confusion Matrix–Based Model Goodness Measures
A.9.3.1 Solution: Titanic Suvival Confusion Matrix with Child Variable
We can create child with a simple logical operation:
= titanic.age < 14 child
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Note that we do not have to include child in the data frame, a variable in the working environment will do. Now we model and predict
= smf.logit("survived ~ sex + C(pclass) + child", data=titanic).fit() m
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= m.predict(titanic) > 0.5 yhat
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
The confusion matrices, computed in two ways:
= pd.crosstab(titanic.survived, yhat) cm1
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
cm1
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
from sklearn.metrics import confusion_matrix
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= confusion_matrix(titanic.survived, yhat) cm2
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
cm2
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Indeed, both ways to get the matrix yield the same result.
And finally we extract true negatives:
0,0] cm1.iloc[
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
0,0] cm2[
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
This is the same number as in the original model in Section 12.3.2. But now we have slightly larger number of true positives.
A.9.3.2 Solution: Titanic Survival with Naive Model
We can create a naive model with formula "survived ~ 1"
but let’s do
it manually here. We just predict everyone to the class that is more
common:
titanic.survived.value_counts()
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
As one can see, surival 0
is the more common value, hence we predict
everyone died:
= pd.Series(0, index=titanic.index) yhat
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In order to better understand the results, let’s also output the confusion matrix:
pd.crosstab(titanic.survived, yhat)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
As one can see, pd.crosstab
only outputs a single column here, as we
do not predict anyone to survive. Essentially the survivor column is
all zeros but pd.crosstab
does not show this (confusion_matrix
will show this column).
We can compute the goodness values using the sklearn.metrics
:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
accuracy_score(titanic.survived, yhat)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
precision_score(titanic.survived, yhat)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
recall_score(titanic.survived, yhat)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
f1_score(titanic.survived, yhat)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
The results give us two errors: as we do not predict any survivors, we cannot compute precision as 0 out of 0 predicted survivors are correct. As \(F\)-score is (partly) based on precision, we cannot compute that either. Additionally, \(R=0\) as we do not capture any survivors.
A.10 Trees and Forests
A.10.1 Regression Trees
A.10.1.1 Solution:
First load and prepare data:
from sklearn.tree import DecisionTreeRegressor
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= pd.read_csv("../data/boston.csv.bz2", sep="\t") boston
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
## define outcome (medv) and design matrix (X)
= boston.medv y
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= boston[["rm", "age"]] X
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Now define and fit the model:
= DecisionTreeRegressor(max_depth=2) m
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
= m.fit(X, y) _
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
Compute \(R^2\) on training data:
m.score(X, y)
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
This is a slight improvement, the original \(R^2\), using only rm, was 0.58. Hence the trees found it useful to include age in the decision-making. As the depth-2 tree is a very simple model, we are not really afraid of overfitting here.
Here is the tree as seen by graphviz.
Indeed, now the tree considers age in the left node, with newer neighborhoods being somewhat more expensive.
Include all features:
= boston.drop("medv", axis=1) X
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
= m.fit(X, y) _
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
m.score(X, y)
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
\(R^2\) improved noticeably. This hints that the tree was able to pick at least one other feature that was better to describe the price than rm and age.
Here is the tree when using all available features. rm is still the first feature to split data, but the other one is not lstat, the percentage of lower-status population.
A.11 Natural Language Processing
A.11.1 Categorize text
A.11.1.1 Solution: find poems with TF-IDF
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
We basically just swap CountVectorizer
out for TfidfVectorizer
.
Everything else will remain the same:
from sklearn.feature_extraction.text import TfidfVectorizer
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
= TfidfVectorizer() vrizer
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
= vrizer.fit_transform([rudaki1, rudaki2, phu1, phu2]) X
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
= vrizer.transform([newdoc]) newX
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
from sklearn.neighbors import NearestNeighbors
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
= NearestNeighbors(n_neighbors=1, metric="cosine") m
DecisionTreeRegressor(max_depth=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
= m.fit(X) _
NearestNeighbors(metric='cosine', n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NearestNeighbors(metric='cosine', n_neighbors=1)
= m.kneighbors(newX, n_neighbors=4) d
NearestNeighbors(metric='cosine', n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NearestNeighbors(metric='cosine', n_neighbors=1)
d
NearestNeighbors(metric='cosine', n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NearestNeighbors(metric='cosine', n_neighbors=1)
The closest document based on cosine distance is document 2 at distance 0.445.
To be more precise, they are not the riches, but those with the largest income in this data.↩︎