Chapter 14 Dataset Description

Here is a brief description of the datasets that are used in this book.

14.1 Hubble

In repo as hubble.csv

Distance and radial velocity data data of 24 galaxies from the 1929 publication by E. Hubble A Relation Between Distance And Radial Velocity Among Extra-Galactic Nebulae, Proceedings of the National Academy of Sciences, 1929, 15, 168-173.

Variables are
  • object: name of the galaxy
  • ms: magnitude (brightness) of brightest stars in that galaxy. Magnitude of a few closest objects are denoted by “..”, this is so even in the original paper.
  • R: distance (Mpc)
  • v: radial velocity (km/s), negative values are toward us
  • mt: visual magnitude (brightness)
  • Mt: absolute magnitude
  • D
  • Rmodern: modern distance estimate (km/s)
  • vModern: modern radial velocity estimate (km/s)

The modern estimates are obtained from wikipedia.

Data example:

hubble <- read_delim("../data/hubble.csv")
hubble %>%
   sample_n(4)
## # A tibble: 4 × 9
##   object ms        R     v    mt    Mt     D Rmodern vModern
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>
## 1 S.Mag. ..    0.032   170   1.5 -16    0.03  0.0617    158.
## 2 5194   17.3  0.5     270   7.4 -16.1  0.5   9.51      463.
## 3 4449   17.8  0.63    200   9.5 -14.5  0.63  3.6       204 
## 4 7331   19    1.1     500  10.4 -14.8  1.1  12.2       816
plot of chunk hubble-diagram

Hubble diagram, based on his 1929 data.

The data was used for the famous Hubble diagram (Figure 1 in the 1929 paper):

hubble %>%
   ggplot(aes(R, v)) +
   geom_point(size = 2) +
   geom_smooth(method = "lm",
               se = FALSE)

(The original figure corrects for solar motion though.)

14.2 Iris

Virginica flower. The image shows large sepals and smaller petals. Sepals typically function as support and cover for flowers in bud and are green, while petals are brightly colored to attract pollinators. Iris flowers, however, have both types of leaves colored in a similar fashion.

Source: Eric Hunt, CC BY-SA 4.0, via Wikimedia Commons.

Iris dataset is collected by Ronald Fisher 1936. It contains sepal and petal measures (see the figure for explanation) of 150 iris flowers of three species–setosa, versicolor and virginica (50 of each).

It is an R built-in dataset and can be loaded with

data(iris)

The variables are

  • Sepal.Length: sepal length, in cm
  • Sepal.Width
  • Petal.Length
  • Petal.Width
  • Species: setosa/versicolor/virginica

The widths and lengths are measured with precision of one millimeter, and hence there are quite a few overlapping values.

Data example:

iris %>%
   sample_n(4)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.0         3.4          1.5         0.2     setosa
## 2          6.0         3.4          4.5         1.6 versicolor
## 3          5.7         2.8          4.1         1.3 versicolor
## 4          5.2         3.4          1.4         0.2     setosa

14.3 Orange trees

In repo as orange-trees.csv

Orange tree size (circumference) as a function of age for five different trees.

This is a version of R built-in dataset Orange. However, it is stored as csv file in order to enforce the tree id to be an integer.

Variables:

  • tree: tree id (1-5)
  • age: in days (days since 1968/12/31)
  • circumference: trunk circumferences (mm). This is probably “circumference at breast height”, a standard measurement in forestry.

Data example:

orange <- read_delim("../data/orange-trees.csv")
orange %>%
   slice(c(1,2, 8,9))
## # A tibble: 4 × 3
##    tree   age circumference
##   <dbl> <dbl>         <dbl>
## 1     1   118            30
## 2     1   484            58
## 3     2   118            33
## 4     2   484            69

14.4 Titanic

In repo as titanic.csv.bz2.

List of RMS Titanic passengers, their name, age and some more data, and whether they survived the shipwreck. It was collected by the investigation committee, and contains most of the passengers on the boat. The dataset is available in various sources, e.g. at kaggle. The variables are

  • pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • survived Survival (0 = No; 1 = Yes)
  • name Name
  • sex Sex
  • age Age
  • sibsp Number of Siblings/Spouses Aboard
  • parch Number of Parents/Children Aboard
  • ticket Ticket Number
  • fare Passenger Fare
  • cabin Cabin
  • embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • boat Lifeboat code (if survived)
  • body Body number (if did not survive and body was recovered)
  • home.dest The home/final destination of passenger

14.5 Treatment

In repo as treatment.csv.bz2. Originates from R package Ecdat. A U.S. dataset from 1974, used for evaluating treatment effect of training on earnings.

  • treat: treated, participated in the job training program (TRUE/FALSE)
  • age: age
  • educ: education in years
  • ethn: three categories: “other”, “black”, “hispanic”
  • married: married (TRUE/FALSE)
  • re74: real annual earnings in 1974 (USD, pre-treatment)
  • re75: real annual earnings in 1975 (USD, pre-treatment)
  • re78: real annual earnings in 1978 (USD, post-treatment)
  • u74: unemployed in 1974 (TRUE/FALSE)
  • u75: unemployed in 1975 (TRUE/FALSE)

Example

treatment <- read_delim("../data/treatment.csv.bz2")
treatment %>%
   sample_n(5)
## # A tibble: 5 × 10
##   treat   age  educ ethn  married   re74   re75   re78 u74   u75  
##   <lgl> <dbl> <dbl> <chr> <lgl>    <dbl>  <dbl>  <dbl> <lgl> <lgl>
## 1 FALSE    44     3 black TRUE    25470. 21484. 26599. FALSE FALSE
## 2 TRUE     26    11 black TRUE        0   2755. 26372. FALSE TRUE 
## 3 FALSE    50     7 black TRUE    13715. 12532. 16255  FALSE FALSE
## 4 FALSE    22    13 other FALSE    7739.  3760. 16255  FALSE FALSE
## 5 FALSE    34    16 other TRUE    35267. 34016. 35465. FALSE FALSE