Chapter 22 Dataset description
This section gives a brief overview of the datasets that are used in the book.
22.1 Boston housing
This is a popular dataset for machine learning, available from various sources. In repo as boston.csv.bz2. This version is copied from R’s MASS package, but it is identical to other versions. It has 506 rows, 14 numeric variables and no missings. Each row contains data for one neighborhood (town/tract). The central variable is to be analyzed is typically medv, median value of single-family homes in that neighborhood. Variables:
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- indus: proportion of non-retail business acres per town.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: weighted mean of distances to five Boston employment centres.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.
- ptratio: pupil-teacher ratio by town.
- black: \(1000(B_{k} - 0.63)^2\) where \(B_{k}\) is the proportion of blacks by town.
- lstat: lower status population (percent)
- medv: median value of owner-occupied homes in $1000s.
Example:
## Error: <text>:2:25: unexpected symbol
## 1: boston[sample(.N, 4), .(medv, age, rm, zn, crim, indus, nox, dis, rad
## 2: tax
## ^
22.2 Titanic
In repo as titanic.csv.bz2.
List of RMS Titanic passengers, their name, age and some more data, and whether they survived the shipwreck. It was collected by the investigation committee, and contains most of the passengers on the boat. The dataset is available in various sources, e.g. at kaggle. The variables are
- pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- survived Survival (0 = No; 1 = Yes)
- name Name
- sex Sex
- age Age
- sibsp Number of Siblings/Spouses Aboard
- parch Number of Parents/Children Aboard
- ticket Ticket Number
- fare Passenger Fare
- cabin Cabin
- embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- boat Lifeboat code (if survived)
- body Body number (if did not survive and body was recovered)
- home.dest The home/final destination of passenger
22.3 Yin-Yang
In repo as yin-yang.csv.bz2
This is a random point cloud of two categories, broadly resembling the well-known yin-yang pattern with a little noisy boundary. Variables:
- x, y: location on place
- c: color, 0 or 1.
Example values:
x | y | c |
---|---|---|
0.6320557 | -1.5409452 | 0 |
0.6300944 | -0.6141974 | 0 |
0.2226683 | 2.1691457 | 1 |

The point cloud of yin-yang data.