B Dataset Description

Here is a brief description of the datasets that are used in this book.

B.1 Benefits

It is a dataset about unemployment benefits in the U.S. in 1990s. The dataset is included in the R package Ecdat, so you need to install the package (use install.packages("Ecdat")) in order to use it.

The content seems to be an subsample of blue-collar workers only from McCall (1995) The Impact of Unemployment Insurance Benefit Levels on Recipiency, Journal of Business & Economic Statistics, 13, pp 189-198.

Some of the codes are missing or not explained, e.g. what are the state codes. Also, logical variables are coded as “yes”/“no”, not TRUE/FALSE.

  • stateur: state unemployment rate (in %)
  • statemb: state maximum benefit level
  • state: state of residence code. Not sure which code, it is not fips.
  • age: age in years
  • tenure: years of tenure in job lost
  • joblost: a factor with levels (slack_work,position_abolished,seasonal_job_ended,other)
  • nwhite: non-white ?
  • school12: more than 12 years of school ?
  • sex: a factor with levels (male,female)
  • bluecol: blue collar worker? Only “yes” answers.
  • smsa: lives in SMSA ?
  • married: married ?
  • dkids: has kids ?
  • dykids: has young kids (0-5 yrs) ?
  • yrdispl: year of job displacement (1982=1,…, 1991=10)
  • rr: replacement rate
  • head: is head of household ?
  • ui: applied for (and received) UI benefits? (“yes”/“no”)

A small sample of data:

##      stateur statemb state age tenure    joblost nwhite school12    sex bluecol
## 4573     5.2     145    86  22      1 slack_work     no       no   male     yes
## 649      6.8     232    23  25      3 slack_work     no      yes female     yes
## 2134     4.4     200    35  30      1 slack_work     no       no   male     yes
## 1380     6.9     215    71  29      1 slack_work     no      yes   male     yes
##      smsa married dkids dykids yrdispl        rr head  ui
## 4573   no      no   yes     no       8 0.5230770   no yes
## 649   yes      no    no     no       5 0.5204082  yes yes
## 2134  yes      no    no     no       8 0.3339731   no  no
## 1380   no     yes   yes    yes       9 0.5000000  yes yes

B.2 HadCRUT

In repo.

HadCRUT (Hadley Centre/Climatic Research Unit Temperature) is temperature data collected by maintained by the UK Met Office, avaiable as use HadCRUT5.0 data, It is one of the “big” global temperature dataset, and is based on temperature measurements at ground level on both land and sea. It covers years from 1850 onward, although the earlier results are less precise. The full dataset includes monthly measures at different geographic locations (5 degree squares), this dataset only contains global average.

The variables are (note the complex names)

  • Time year
  • Anomaly (deg C): relative to 1961-1990 average
  • Lower confidence limit (2.5%)
  • Upper confidence limit (97.5%)

A sample from data is

## # A tibble: 4 × 4
##    Time `Anomaly (deg C)` `Lower confidence limit (2.5%)` Upper confidence lim…¹
##   <dbl>             <dbl>                           <dbl>                  <dbl>
## 1  1878           -0.0113                          -0.131                0.109  
## 2  1860           -0.390                           -0.539               -0.241  
## 3  1975           -0.111                           -0.151               -0.0702 
## 4  1948           -0.125                           -0.259                0.00981
## # ℹ abbreviated name: ¹​`Upper confidence limit (97.5%)`

B.3 Heart attack

In repo.

Originates from kaggle, claimed to be CC0: Public Domain but requires to sign in for downloading.

Variables:

  • Age: Age of the patient
  • Sex: Sex of the patient
  • exang: exercise induced angina (1 = yes; 0 = no)
  • ca: number of major vessels (0-3)
  • cp: Chest Pain type chest pain type
    • Value 1: typical angina
    • Value 2: atypical angina
    • Value 3: non-anginal pain
    • Value 4: asymptomatic
  • trtbps: resting blood pressure (in mm Hg)
  • chol: cholestoral in mg/dl fetched via BMI sensor
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • rest_ecg: resting electrocardiographic results
    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach: maximum heart rate achieved
  • output: 0 no heart attack, 1 heart attack

No information is provided how the data is collected. Not all actual values correspond to the documentation. The results are weird, e.g. women and young people have higher chances for heart attack.

B.4 Ice extent

Sea ice extent shows the seaa ice extent and are in the northern and southern hemisphere, based on satellite measurements. Original from University of Colorado webpage, the current file can be downloaded from the repository.

It contains the following variables:

  • year: 1978–
  • month: 1–12
  • data-type: looks like the name of the satellite or another info provider
  • region: “N” for northern, “S” for southern hemisphere
  • extent: sea ice extent, in M km2. Extent is the sea surface are where the ice concentration is at least 15%. This is easier to measure from satellites than area.
  • area: sea ice surface area, M km2
  • time: a continuous time variable, made of year and month, suitable for visualization.

A sample from the dataset:

## # A tibble: 5 × 7
##    year month `data-type` region extent  area  time
##   <dbl> <dbl> <chr>       <chr>   <dbl> <dbl> <dbl>
## 1  2021     5 Goddard     N       12.7  10.9  2021.
## 2  2007     4 Goddard     N       13.8  11.8  2007.
## 3  2016    12 Goddard     S        8.28  5.51 2017.
## 4  1990    11 Goddard     N       11.1   9.64 1991.
## 5  2002     1 Goddard     S        4.74  2.95 2002.

B.5 Iris

Iris dataset is collected by Ronal Fisher 1936. It contains sepal and petal measures of 150 iris flowers of species setosa, versicolor and virginica (50 of each). It is an R built-in dataset and can be loaded with

The variables are

  • Sepal.Length: sepal length, in cm
  • Sepal.Width
  • Petal.Length
  • Petal.Width
  • Species: setosa/versicolor/virginica

B.6 Ncbirths: births in North Caroline

In repo: ncbirths.csv

Can be downloaded from Openintro webpage

In 2004, the state of North Carolina released to the public a large data set containing information on births recorded in this state. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from this data set.

Variables:

  • fage Father’s age in years.
  • mage Mother’s age in years.
  • mature Maturity status of mother.
  • weeks Length of pregnancy in weeks.
  • premie Whether the birth was classified as premature (premie) or full-term.
  • visits Number of hospital visits during pregnancy.
  • gained Weight gained by mother during pregnancy (lb).
  • weight Weight of the baby at birth (lb)
  • lowbirthweight Whether baby was classified as low birthweight (low) or not (not low).
  • gender Gender of the baby, female or male.
  • habit Status of the mother as a nonsmoker or a smoker.
  • marital Whether mother is married or not married at birth.
  • whitemom Whether mom is white or not white.

And example of the data looks like:

## # A tibble: 4 × 13
##    fage  mage mature    weeks premie visits marital gained weight lowbirthweight
##   <dbl> <dbl> <chr>     <dbl> <chr>   <dbl> <chr>    <dbl>  <dbl> <chr>         
## 1    27    25 younger …    40 full …     15 married     32   8.38 not low       
## 2    26    27 younger …    39 full …      8 married     20   6.63 not low       
## 3    31    25 younger …    41 full …     14 married     27   7.38 not low       
## 4    38    31 younger …    37 full …     11 married     34   6.31 not low       
## # ℹ 3 more variables: gender <chr>, habit <chr>, whitemom <chr>

B.7 Titanic

In repo.

List of RMS Titanic passengers, their name, age and some more data, and whether they survived the shipwreck. It was collected by the investigation committee, and contains most of the passengers on the boat. The dataset is available in various sources, e.g. at kaggle. The variables are

  • pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • survived Survival (0 = No; 1 = Yes)
  • name Name
  • sex Sex
  • age Age
  • sibsp Number of Siblings/Spouses Aboard
  • parch Number of Parents/Children Aboard
  • ticket Ticket Number
  • fare Passenger Fare
  • cabin Cabin
  • embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • boat Lifeboat code (if survived)
  • body Body number (if did not survive and body was recovered)
  • home.dest The home/final destination of passenger