A Solutions
## ModuleNotFoundError: No module named 'pandas'
## ModuleNotFoundError: No module named 'matplotlib'
A.1 Python
A.1.1 Operators
A.1.3 Collections
A.1.3.2 Combining lists
friends = ['Paulina', 'Severo']
others = []
others.append("Lai Ming")
others.append("Lynn")
people = friends + others
print("all people", people)## all people ['Paulina', 'Severo', 'Lai Ming', 'Lynn']
A.1.3.3 Assign people to seats
Consider two lists:
names = ["Adam", "Ashin", "Inukai", "Tanaka", "Ikki"]
seats = [33, 12, 45, 2, 17]
assigneds = []
for i in range(len(names)):
a = names[i] + ": " + str(seats[i])
assigneds.append(a)
assigneds ## ['Adam: 33', 'Ashin: 12', 'Inukai: 45', 'Tanaka: 2', 'Ikki: 17']
A.1.3.4 List comprehension
- make list of not
ibut ofi + 1:
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
- To make pizzas with different toppings you can just
use string concatenation with
+:
## ['pizza with mushrooms', 'pizza with mozarella', 'pizza with ananas']
- Here you can just loop over the lists, and each time pick out
element
[1]from the list:
## [2, 'b', <built-in function print>]
A.1.3.5 Dict mapping coordinates to cities
placenames = {(31.228611, 121.474722): "Shanghai",
(23.763889, 90.388889): "Dhaka",
(13.7525, 100.494167): "Bangkok"
}
print(placenames)## {(31.228611, 121.474722): 'Shanghai', (23.763889, 90.388889): 'Dhaka', (13.7525, 100.494167): 'Bangkok'}
Note that such lookup is usually not what we want because it only returns the correct city name if given the exact coordinates. If coordinates are just a tiny bit off, the dict cannot find the city. If you consider cities to be continuous areas, you should do geographic lookup, not dictionary lookup.
A.1.3.6 Dict of dicts
tsinghua = {"house":30, "street":"Shuangqing Rd",
"district":"Haidan", "city":"Beijing",
"country":"CH"}
kenyatta = {"pobox":"43844-00100", "city":"Nairobi",
"country":"KE"}
places = {"Tsinghua":tsinghua, "Kenyatta":kenyatta}
places["Teski Refugio"] = {"house":256,
"street":"Volcan Osorno",
"city":"La Ensenada",
"region":"Los Lagos",
"country":"CL"}
print(places)## {'Tsinghua': {'house': 30, 'street': 'Shuangqing Rd', 'district': 'Haidan', 'city': 'Beijing', 'country': 'CH'}, 'Kenyatta': {'pobox': '43844-00100', 'city': 'Nairobi', 'country': 'KE'}, 'Teski Refugio': {'house': 256, 'street': 'Volcan Osorno', 'city': 'La Ensenada', 'region': 'Los Lagos', 'country': 'CL'}}
A.1.4 Language Constructs
A.1.4.1 Solution: odd or even
## 1 odd
## 2 even
## 3 odd
## 4 even
## 5 odd
## 6 even
## 7 odd
## 8 even
## 9 odd
## 10 even
A.1.5 Modules
A.1.5.1 Solution: access your file system
Below is the solution by importing the whole libraries. You can also
use, e.g. from os import getcwd, listdir, or other approaches.
## '/home/otoomet/tyyq/lecturenotes/machinelearning-py'
## ['index.md', 'images.rmd', 'Makefile', 'neural-nets.rmd', 'ml-techniques.rmd~', 'python.rmd', 'neural-nets.md', 'web-scraping.rmd', 'ml-workflow.md', 'text.rmd', 'descriptive-statistics.rmd', 'python.md', 'titanic-tree.png', 'predictions.md', 'solutions.rmd', '_bookdown.yml', 'datasets.md', 'descriptive-statistics.md', 'ml-techniques.md', 'index.rmd', 'predictions.rmd', 'build', '.cache', 'regularization.rmd', 'trees-forests.rmd', 'images.md', 'linear-regression.rmd', 'overfitting-validation.rmd', 'linear-algebra.md', 'linear-algebra.rmd', 'keras-cats-vs-dogs.py', 'svm.rmd', 'overfitting-validation.md', 'cleaning-data.md', 'numpy-pandas.rmd', 'plotting.rmd', '_output.yml', 'figs', 'logistic-regression.rmd', 'keras-color-spiral.py', 'descriptive-analysis.rmd', 'ml-techniques.rmd', 'linear-regression.md', 'cleaning-data.rmd', 'ml-workflow.rmd', 'unsupervised-learning.rmd', 'numpy-pandas.md', 'logistic-regression.md', 'files', 'descriptive-analysis.md', 'text.md', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'regularization.md', 'plotting.md']
A.2 Numpy and Pandas
A.2.1 Numpy
A.2.1.1 Solution: concatenate arrays
## <string>:1: DeprecationWarning: `row_stack` alias is deprecated. Use `np.vstack` directly.
## array([[-1., -1., -1., -1.],
## [ 0., 0., 0., 0.],
## [ 2., 2., 2., 2.]])
Obviously, you can use np.ones and np.zeros directly in
np.row_stack and skip creating the temporary variables.
A.2.1.2 Solution: create matrix of even numbers
## array([[ 2, 4, 6, 8, 10],
## [12, 14, 16, 18, 20],
## [22, 24, 26, 28, 30],
## [32, 34, 36, 38, 40]])
A.2.1.3 Solution: create matrix, play with columns
## [14 24 34 44]
## array([[10, 12, 14, 16, 18],
## [20, 22, 24, 26, 28],
## [30, 32, 34, 36, 38],
## [ 1, 2, 3, 4, 5]])
A.2.2 Pandas
A.2.2.1 Solution: series of capital cities
cities = pd.Series(["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
index=["Congo", "Gabon", "Equatorial Guinea", "Cameroon"])## NameError: name 'pd' is not defined
## NameError: name 'cities' is not defined
A.2.2.2 Solution: extract capital cities
## NameError: name 'cities' is not defined
## NameError: name 'cities' is not defined
## NameError: name 'cities' is not defined
A.2.2.3 Solution: create city dataframe
To keep code clean, we create data as a separate dict, and index as a list. Thereafter we make a dataframe out of these two:
data = {"capital":["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
"population":[1696, 703, 297, 2765]}
countries = ["Congo", "Gabon", "Equatorial Guinea", "Cameroon"]
cityDF = pd.DataFrame(data, index=countries)## NameError: name 'pd' is not defined
## NameError: name 'cityDF' is not defined
A.2.2.4 Solution: Variables in G.W.Bush approval data frame
Pandas prints five columns: index, and four variables— date, approve, disapprove, dontknow.
A.2.2.5 Show files in current/parent folder
Remember that parent folder is denoted as double dots ... Current
folder is a single dot ., but usually it is not needed.
Current working directory:
## '/home/otoomet/tyyq/lecturenotes/machinelearning-py'
Files in the current folder
## ['index.md', 'images.rmd', 'Makefile', 'neural-nets.rmd', 'ml-techniques.rmd~', 'python.rmd', 'neural-nets.md', 'web-scraping.rmd', 'ml-workflow.md', 'text.rmd', 'descriptive-statistics.rmd', 'python.md', 'titanic-tree.png', 'predictions.md', 'solutions.rmd', '_bookdown.yml', 'datasets.md', 'descriptive-statistics.md', 'ml-techniques.md', 'index.rmd', 'predictions.rmd', 'build', 'trees-forests.md', '.cache', 'regularization.rmd', 'trees-forests.rmd', 'unsupervised-learning.md', 'images.md', 'linear-regression.rmd', 'overfitting-validation.rmd', 'web-scraping.md', 'linear-algebra.md', 'linear-algebra.rmd', 'keras-cats-vs-dogs.py', 'svm.md', 'svm.rmd', 'overfitting-validation.md', 'cleaning-data.md', 'numpy-pandas.rmd', 'plotting.rmd', '_output.yml', 'figs', 'logistic-regression.rmd', 'keras-color-spiral.py', 'descriptive-analysis.rmd', 'ml-techniques.rmd', 'linear-regression.md', 'cleaning-data.rmd', 'ml-workflow.rmd', 'unsupervised-learning.rmd', 'numpy-pandas.md', 'logistic-regression.md', 'files', 'descriptive-analysis.md', 'text.md', 'datasets.rmd', 'machinelearning-py.rds', '.fig', 'regularization.md', 'plotting.md']
Files in the parent folder
## ['machineLearning.mtc3', 'machineLearning.mtc17', 'machineLearning.mtc8', 'machineLearning.mtc2', 'data', 'Makefile', 'machineLearning.pdf', 'machineLearning.mtc14', 'machineLearning.mtc20', 'lecturenotes.bib.bak', 'machinelearning-R.html', 'machineLearning.mtc10', 'latexmkrc', 'tex', 'auto', '.RData', 'machineLearning.mtc9', 'preamble.tex', 'scripts', 'machineLearning.mtc18', 'machineLearning.exc', 'machineLearning.mtc13', 'intro-to-stats.rnw', 'machineLearning.mtc21', 'machineLearning.mtc19', 'machineLearning.mtc16', 'machineLearning.mtc5', '.cache', 'material', 'machineLearning.tex', 'solutions.rnw', 'machineLearning.maf', 'machineLearning.mtc0', 'machineLearning.mtc11', 'ml-models.rnw', 'machineLearning.mtc12', 'machinelearning-py', 'kbordermatrix.sty', 'machinelearning-R', 'solutions.tex', 'datascience-intro', 'machinelearning-common', 'machineLearning.chs', 'test', 'machineLearning.mtc1', 'machineLearning.exm', 'machineLearning.mtc23', 'machineLearning.mtc15', 'figs', 'machineLearning.mtc7', '.Rprofile', '.git', 'intro-to-stats.tex', 'machineLearning.xdv', 'machineLearning.mtc4', 'machineLearning.rnw', 'img', 'machineLearning.mtc', 'machineLearning.bbl', 'README.md', 'files', 'machineLearning.mtc22', '.Rhistory', 'videos', '.gitignore', 'machineLearning.rip', '.fig', 'ml-models.tex', 'literature', 'machineLearning.mtc6']
Obviously, everyone has different files in these folders.
A.2.2.6 Solution: presidents approval 88%
## NameError: name 'pd' is not defined
## NameError: name 'approval' is not defined
## NameError: name 'approval' is not defined
A.2.2.7 Solution: change index, convert to variable
The original data frame was created as
capitals = pd.DataFrame(
{"capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],
"population":[32.7, 267.7, 15.3]}, # in millions
index=["MY", "ID", "KH"])## NameError: name 'pd' is not defined
We can modify it as
## NameError: name 'capitals' is not defined
## NameError: name 'capitals' is not defined
## NameError: name 'capitals' is not defined
A.2.2.8 Solution: create city dataframe
cityM = np.array([[3.11, 5282, 19800],
[18.9, 306, 46997],
[4.497, 1886, 22000]])
names = ["Chittagong", "Dhaka", "Kolkata"]
vars = ["population", "area", "density"]
cityDF = pd.DataFrame(cityM, index=names,
columns=vars)## NameError: name 'pd' is not defined
## NameError: name 'cityDF' is not defined
A.2.2.9 Solution: extract city data
## array([19800., 46997., 22000.])
## NameError: name 'cityDF' is not defined
## array([4.497e+00, 1.886e+03, 2.200e+04])
## NameError: name 'cityDF' is not defined
## NameError: name 'cityDF' is not defined
## np.float64(306.0)
## NameError: name 'cityDF' is not defined
## NameError: name 'cityDF' is not defined
A.2.2.10 Solution: Titanic line 1000
Extract the 1000th row (note: index 999!) as data frame, and thereafter extract name, survival status, and age. We have to extract twice as we cannot extract by row number and column names at the same time.
## NameError: name 'pd' is not defined
## NameError: name 'titanic' is not defined
A.2.2.11 Solution: Titanic male/female age distribution
The two relevant variables are clearly sex and age. If you already have loaded data then you may just presereve these variables:
## NameError: name 'titanic' is not defined
Alternatively, you may also just read these two columns only:
## NameError: name 'pd' is not defined
A.3 Descriptive analysis with Pandas
A.3.1 What are the values?
A.3.1.1 Solution: which methods can be applied to the whole data frame
There is no better way than just to try:
## NameError: name 'titanic' is not defined
## NameError: name 'titanic' is not defined
## NameError: name 'titanic' is not defined
## NameError: name 'titanic' is not defined
The mathematical functions work, but may skip the non-numeric
variables. .min however, find the first value when ordered
alphabetically.
It is not immediately clear why .unique does not work while
.nunique works. It may be because there is a different number of
unique elements for each variables, but hey, you can just use a more
complex data structure and still compute that.
A.4 Cleaning and Manipulating Data
A.4.1 Missing Observations
A.4.1.1 Solution: Missings in fare in Titanic Data
We can count the NaN-s as
## NameError: name 'titanic' is not defined
There is just a single missing value. About non-reasonable values: first we should check its data type:
## NameError: name 'titanic' is not defined
We see it is coded as numeric (64-bit float), so we can query its range:
## NameError: name 'titanic' is not defined
While 512 pounds seems a reasonable ticket price, value 0 for minimum is suspicious. We do not know if any passengers really did not pay a fare, or more likely, it just means that the data collectors did not have the information. So it is just a missing value, coded as 0.
A.4.2 Converting Variables
A.4.2.1 Solution: convert males’ dataset school to categories
Load the males dataset. It is instructive to test the code first on a sequence of years of schooling before converting actual data. In particular, we want to ensure that we get the boundaries right, “12” should be HS while “13” should be “some college” and so on.
We choose to specify the right boundary at integer values, e.g. “HS”
is interval \([12, 13)\). In order to ensure that “12” belong to this
interval while “13” does not we tell right=False, i.e. remove the
right boundary for the interval (and hence include it to the interval
above):
# test on years of schooling 10-17
school = np.arange(10,18)
# convert to categories
categories = pd.cut(school,
bins=[-np.Inf, 12, 13, 16, np.Inf],
labels=["Less than HS", "HS", "Some college", "College"],
right=False)## NameError: name 'pd' is not defined
# print in a way that years and categories are next to each other
pd.Series(categories, index=school)## NameError: name 'pd' is not defined
Now we see it works correctly, e.g. 11 years of schooling is “Less than HS” while 12 years is “HS”. Now we can do the actual conversion:
## NameError: name 'pd' is not defined
pd.cut(males.school,
bins=[-np.Inf, 12, 13, 16, np.Inf],
labels=["Less than HS", "HS", "Some college", "College"],
right=False)## NameError: name 'pd' is not defined
A.4.2.2 Solution: Convert Males’ dataset residence to dummies
Rest of the task pretty much repeats the examples in Converting
categorical variables to
dummies, just you have to
find the prefix_sep argument to remove the underscore between the
prefix “R” and the category name. The code might look like
## NameError: name 'pd' is not defined
## NameError: name 'pd' is not defined
## NameError: name 'residence' is not defined
A.4.2.3 Solution: Convert Titanic’s age categories, sex, pclass to dummies
## NameError: name 'pd' is not defined
## NameError: name 'titanic' is not defined
titanic["age"] = pd.cut(titanic.age,
bins=[0, 14, 50, np.inf],
labels=["0-13", "14-49", "50-"],
right=False)## NameError: name 'pd' is not defined
## NameError: name 'pd' is not defined
## NameError: name 'd' is not defined
One may also want to drop one of the dummy levels with drop_first argument.
A.4.2.4 Solution: Convert Males’ residence, ethn to dummies and concatenate
We create the dummies separately for residence and ethn and give them corresponding prefix to make the variable names more descriptive.
## NameError: name 'pd' is not defined
## NameError: name 'pd' is not defined
## NameError: name 'residence' is not defined
## NameError: name 'residence' is not defined
## convert and remove using chaining
ethn = pd.get_dummies(males.ethn, prefix="ethn")\
.drop("ethn_other", axis=1)## NameError: name 'pd' is not defined
## NameError: name 'ethn' is not defined
## NameError: name 'pd' is not defined
## NameError: name 'd' is not defined
A.5 Descriptive Statistics
A.5.1 Inequality
A.5.1.1 Solution: 80-20 ratio of income
First load the data and compute total income:
## NameError: name 'pd' is not defined
## NameError: name 'treatment' is not defined
## NameError: name 'income' is not defined
Next, we can just start trying with the uppermost 1% (lowermost 99):
## NameError: name 'treatment' is not defined
The income share of the richest 1% is
## NameError: name 'income' is not defined
## NameError: name 'share' is not defined