B Dataset Description
Here is a brief description of the datasets that are used in this book.
B.1 Benefits
It is a dataset about unemployment benefits in the U.S. in 1990s. The
dataset is included in the R package Ecdat, so you need to install
the package (use install.packages("Ecdat")
) in order to use it.
The content seems to be an subsample of blue-collar workers only from McCall (1995) The Impact of Unemployment Insurance Benefit Levels on Recipiency, Journal of Business & Economic Statistics, 13, pp 189-198.
Some of the codes are missing or not explained, e.g. what are the state codes. Also, logical variables are coded as “yes”/“no”, not TRUE/FALSE.
- stateur: state unemployment rate (in %)
- statemb: state maximum benefit level
- state: state of residence code. Not sure which code, it is not fips.
- age: age in years
- tenure: years of tenure in job lost
- joblost: a factor with levels (slack_work,position_abolished,seasonal_job_ended,other)
- nwhite: non-white ?
- school12: more than 12 years of school ?
- sex: a factor with levels (male,female)
- bluecol: blue collar worker? Only “yes” answers.
- smsa: lives in SMSA ?
- married: married ?
- dkids: has kids ?
- dykids: has young kids (0-5 yrs) ?
- yrdispl: year of job displacement (1982=1,…, 1991=10)
- rr: replacement rate
- head: is head of household ?
- ui: applied for (and received) UI benefits? (“yes”/“no”)
A small sample of data:
## stateur statemb state age tenure joblost nwhite school12 sex bluecol
## 4573 5.2 145 86 22 1 slack_work no no male yes
## 649 6.8 232 23 25 3 slack_work no yes female yes
## 2134 4.4 200 35 30 1 slack_work no no male yes
## 1380 6.9 215 71 29 1 slack_work no yes male yes
## smsa married dkids dykids yrdispl rr head ui
## 4573 no no yes no 8 0.5230770 no yes
## 649 yes no no no 5 0.5204082 yes yes
## 2134 yes no no no 8 0.3339731 no no
## 1380 no yes yes yes 9 0.5000000 yes yes
B.2 HadCRUT
In repo.
HadCRUT (Hadley Centre/Climatic Research Unit Temperature) is temperature data collected by maintained by the UK Met Office, avaiable as use HadCRUT5.0 data, It is one of the “big” global temperature dataset, and is based on temperature measurements at ground level on both land and sea. It covers years from 1850 onward, although the earlier results are less precise. The full dataset includes monthly measures at different geographic locations (5 degree squares), this dataset only contains global average.
The variables are (note the complex names)
- Time year
- Anomaly (deg C): relative to 1961-1990 average
- Lower confidence limit (2.5%)
- Upper confidence limit (97.5%)
A sample from data is
## # A tibble: 4 × 4
## Time `Anomaly (deg C)` `Lower confidence limit (2.5%)` Upper confidence lim…¹
## <dbl> <dbl> <dbl> <dbl>
## 1 1878 -0.0113 -0.131 0.109
## 2 1860 -0.390 -0.539 -0.241
## 3 1975 -0.111 -0.151 -0.0702
## 4 1948 -0.125 -0.259 0.00981
## # ℹ abbreviated name: ¹`Upper confidence limit (97.5%)`
B.3 Heart attack
In repo.
Originates from kaggle, claimed to be CC0: Public Domain but requires to sign in for downloading.
Variables:
- Age: Age of the patient
- Sex: Sex of the patient
- exang: exercise induced angina (1 = yes; 0 = no)
- ca: number of major vessels (0-3)
- cp: Chest Pain type chest pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic
- trtbps: resting blood pressure (in mm Hg)
- chol: cholestoral in mg/dl fetched via BMI sensor
- fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- rest_ecg: resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
- thalach: maximum heart rate achieved
- output: 0 no heart attack, 1 heart attack
No information is provided how the data is collected. Not all actual values correspond to the documentation. The results are weird, e.g. women and young people have higher chances for heart attack.
B.4 Ice extent
Sea ice extent shows the seaa ice extent and are in the northern and southern hemisphere, based on satellite measurements. Original from University of Colorado webpage, the current file can be downloaded from the repository.
It contains the following variables:
- year: 1978–
- month: 1–12
- data-type: looks like the name of the satellite or another info provider
- region: “N” for northern, “S” for southern hemisphere
- extent: sea ice extent, in M km2. Extent is the sea surface are where the ice concentration is at least 15%. This is easier to measure from satellites than area.
- area: sea ice surface area, M km2
- time: a continuous time variable, made of year and month, suitable for visualization.
A sample from the dataset:
## # A tibble: 5 × 7
## year month `data-type` region extent area time
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2021 5 Goddard N 12.7 10.9 2021.
## 2 2007 4 Goddard N 13.8 11.8 2007.
## 3 2016 12 Goddard S 8.28 5.51 2017.
## 4 1990 11 Goddard N 11.1 9.64 1991.
## 5 2002 1 Goddard S 4.74 2.95 2002.
B.5 Iris
Iris dataset is collected by Ronal Fisher 1936. It contains sepal and petal measures of 150 iris flowers of species setosa, versicolor and virginica (50 of each). It is an R built-in dataset and can be loaded with
The variables are
- Sepal.Length: sepal length, in cm
- Sepal.Width
- Petal.Length
- Petal.Width
- Species: setosa/versicolor/virginica
B.6 Ncbirths: births in North Caroline
In repo: ncbirths.csv
Can be downloaded from Openintro webpage
In 2004, the state of North Carolina released to the public a large data set containing information on births recorded in this state. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from this data set.
Variables:
- fage Father’s age in years.
- mage Mother’s age in years.
- mature Maturity status of mother.
- weeks Length of pregnancy in weeks.
- premie Whether the birth was classified as premature (premie) or full-term.
- visits Number of hospital visits during pregnancy.
- gained Weight gained by mother during pregnancy (lb).
- weight Weight of the baby at birth (lb)
- lowbirthweight Whether baby was classified as low birthweight (low) or not (not low).
- gender Gender of the baby, female or male.
- habit Status of the mother as a nonsmoker or a smoker.
- marital Whether mother is married or not married at birth.
- whitemom Whether mom is white or not white.
And example of the data looks like:
## # A tibble: 4 × 13
## fage mage mature weeks premie visits marital gained weight lowbirthweight
## <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 27 25 younger … 40 full … 15 married 32 8.38 not low
## 2 26 27 younger … 39 full … 8 married 20 6.63 not low
## 3 31 25 younger … 41 full … 14 married 27 7.38 not low
## 4 38 31 younger … 37 full … 11 married 34 6.31 not low
## # ℹ 3 more variables: gender <chr>, habit <chr>, whitemom <chr>
B.7 Titanic
In repo.
List of RMS Titanic passengers, their name, age and some more data, and whether they survived the shipwreck. It was collected by the investigation committee, and contains most of the passengers on the boat. The dataset is available in various sources, e.g. at kaggle. The variables are
- pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- survived Survival (0 = No; 1 = Yes)
- name Name
- sex Sex
- age Age
- sibsp Number of Siblings/Spouses Aboard
- parch Number of Parents/Children Aboard
- ticket Ticket Number
- fare Passenger Fare
- cabin Cabin
- embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- boat Lifeboat code (if survived)
- body Body number (if did not survive and body was recovered)
- home.dest The home/final destination of passenger