Chapter 14 Dataset Description
Here is a brief description of the datasets that are used in this book.
14.1 Hubble
In repo as hubble.csv
Distance and radial velocity data data of 24 galaxies from the 1929 publication by E. Hubble A Relation Between Distance And Radial Velocity Among Extra-Galactic Nebulae, Proceedings of the National Academy of Sciences, 1929, 15, 168-173.
Variables are- object: name of the galaxy
- ms: magnitude (brightness) of brightest stars in that galaxy. Magnitude of a few closest objects are denoted by “..”, this is so even in the original paper.
- R: distance (Mpc)
- v: radial velocity (km/s), negative values are toward us
- mt: visual magnitude (brightness)
- Mt: absolute magnitude
- D
- Rmodern: modern distance estimate (km/s)
- vModern: modern radial velocity estimate (km/s)
The modern estimates are obtained from wikipedia.
Data example:
read_delim("../data/hubble.csv")
hubble <-%>%
hubble sample_n(4)
## # A tibble: 4 × 9
## object ms R v mt Mt D Rmodern vModern
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 S.Mag. .. 0.032 170 1.5 -16 0.03 0.0617 158.
## 2 5194 17.3 0.5 270 7.4 -16.1 0.5 9.51 463.
## 3 4449 17.8 0.63 200 9.5 -14.5 0.63 3.6 204
## 4 7331 19 1.1 500 10.4 -14.8 1.1 12.2 816
The data was used for the famous Hubble diagram (Figure 1 in the 1929 paper):
%>%
hubble ggplot(aes(R, v)) +
geom_point(size = 2) +
geom_smooth(method = "lm",
se = FALSE)
(The original figure corrects for solar motion though.)
14.2 Iris
Iris dataset is collected by Ronald Fisher 1936. It contains sepal and petal measures (see the figure for explanation) of 150 iris flowers of three species–setosa, versicolor and virginica (50 of each).
It is an R built-in dataset and can be loaded with
data(iris)
The variables are
- Sepal.Length: sepal length, in cm
- Sepal.Width
- Petal.Length
- Petal.Width
- Species: setosa/versicolor/virginica
The widths and lengths are measured with precision of one millimeter, and hence there are quite a few overlapping values.
Data example:
%>%
iris sample_n(4)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.0 3.4 1.5 0.2 setosa
## 2 6.0 3.4 4.5 1.6 versicolor
## 3 5.7 2.8 4.1 1.3 versicolor
## 4 5.2 3.4 1.4 0.2 setosa
14.3 Orange trees
In repo as orange-trees.csv
Orange tree size (circumference) as a function of age for five different trees.
This is a version of R built-in dataset Orange. However, it is stored as csv file in order to enforce the tree id to be an integer.
Variables:
- tree: tree id (1-5)
- age: in days (days since 1968/12/31)
- circumference: trunk circumferences (mm). This is probably “circumference at breast height”, a standard measurement in forestry.
Data example:
read_delim("../data/orange-trees.csv")
orange <-%>%
orange slice(c(1,2, 8,9))
## # A tibble: 4 × 3
## tree age circumference
## <dbl> <dbl> <dbl>
## 1 1 118 30
## 2 1 484 58
## 3 2 118 33
## 4 2 484 69
14.4 Titanic
In repo as titanic.csv.bz2.
List of RMS Titanic passengers, their name, age and some more data, and whether they survived the shipwreck. It was collected by the investigation committee, and contains most of the passengers on the boat. The dataset is available in various sources, e.g. at kaggle. The variables are
- pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- survived Survival (0 = No; 1 = Yes)
- name Name
- sex Sex
- age Age
- sibsp Number of Siblings/Spouses Aboard
- parch Number of Parents/Children Aboard
- ticket Ticket Number
- fare Passenger Fare
- cabin Cabin
- embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- boat Lifeboat code (if survived)
- body Body number (if did not survive and body was recovered)
- home.dest The home/final destination of passenger
14.5 Treatment
In repo as treatment.csv.bz2. Originates from R package Ecdat. A U.S. dataset from 1974, used for evaluating treatment effect of training on earnings.
- treat: treated, participated in the job training program (TRUE/FALSE)
- age: age
- educ: education in years
- ethn: three categories: “other”, “black”, “hispanic”
- married: married (TRUE/FALSE)
- re74: real annual earnings in 1974 (USD, pre-treatment)
- re75: real annual earnings in 1975 (USD, pre-treatment)
- re78: real annual earnings in 1978 (USD, post-treatment)
- u74: unemployed in 1974 (TRUE/FALSE)
- u75: unemployed in 1975 (TRUE/FALSE)
Example
read_delim("../data/treatment.csv.bz2")
treatment <-%>%
treatment sample_n(5)
## # A tibble: 5 × 10
## treat age educ ethn married re74 re75 re78 u74 u75
## <lgl> <dbl> <dbl> <chr> <lgl> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 FALSE 44 3 black TRUE 25470. 21484. 26599. FALSE FALSE
## 2 TRUE 26 11 black TRUE 0 2755. 26372. FALSE TRUE
## 3 FALSE 50 7 black TRUE 13715. 12532. 16255 FALSE FALSE
## 4 FALSE 22 13 other FALSE 7739. 3760. 16255 FALSE FALSE
## 5 FALSE 34 16 other TRUE 35267. 34016. 35465. FALSE FALSE