I Dataset Description

Here is a brief description of the datasets that are used in this book.

I.1 Alcohol disorders

In the book repo: alcohol-disorders.csv.

Share of males and females, suffering from alcohol use disorders (pct). Alcohol dependence is defined by the International Classification of Diseases as the presence of three or more indicators of dependence for at least a month within the previous year. This is given as the age-standardized prevalence which assumes a constant age structure allowing for comparison by sex, country and through time.

IHME, Global Burden of Disease Study (2019) – processed by Our World in Data.
Dowloaded from OWiD

Variables:

  • Entity: country, only Argentina, Kenya, Taiwan, Ukraine and the U.S. are included.
  • Code: 3-letter country code
  • Year: 2015–2019 (only a subset of the original)
  • disordersM: number of cases of alcohol use disorders per 100 people, in males, age-standardized
  • disordersF: number of cases of alcohol use disorders per 100 people, in females, age-standardized,
  • population: Population (historical estimates),

Example:

read_delim("data/alcohol-disorders.csv") %>%
   sample_n(4)
## # A tibble: 4 × 6
##   country   Code   Year disordersM disordersF population
##   <chr>     <chr> <dbl>      <dbl>      <dbl>      <dbl>
## 1 Ukraine   UKR    2018      3.53       1.33    44446952
## 2 Taiwan    TWN    2017      0.886      0.260   23665028
## 3 Argentina ARG    2015      3.07       1.17    43257064
## 4 Ukraine   UKR    2015      3.90       1.43    44982568

I.2 Babynames

R package babynames contains a dataset babynames. It includes ass baby names given in the U.S. between 1880-2017 at least 5 times each year for each sex. Data originates from U.S. Social Security Administration.

You can load it with library(babynames), that loads a single data frame babynames.

Example:

library(babynames)
babynames %>%
   sample_n(3)
## # A tibble: 3 × 5
##    year sex   name          n       prop
##   <dbl> <chr> <chr>     <int>      <dbl>
## 1  1982 M     Santino      38 0.0000201 
## 2  2001 M     Collis        5 0.00000242
## 3  1961 F     Kimberlea    17 0.00000819

Variables:

  • year: 1880-2017
  • name: the name
  • sex: “F” or “M”
  • n: how many babies got this name (withing year/sex)
  • prop: proportion of babies who got this name in the given year (within year/sex).

I.3 Country-concept similarity

In the book repo: country-concept-similarity.csv.bz2. This dataset shows the similarity between country names and a set of different words, and it is calculated based on texts that were scraped from internet around 2015. The dataset looks like

similarity <- read_delim("data/country-concept-similarity.csv.bz2")
similarity %>%
   head(2)
## # A tibble: 2 × 12
##   country     terrorism nuclear  trade battery  regime
##   <chr>           <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
## 1 aruba          0.0891  -0.011 0.0504 -0.01   -0.0356
## 2 afghanistan    0.447    0.220 0.109   0.0578  0.180 
##   volcano  palm    fir  flood drought mountain
##     <dbl> <dbl>  <dbl>  <dbl>   <dbl>    <dbl>
## 1   0.166 0.293 0.0965 0.0158  0.0581    0.107
## 2   0.129 0.116 0.129  0.159   0.160     0.161

One can see that “Afghanistan” and “terrorism” are much more similar (similarity 0.447) than e.g. “Afghanistan” and “trade” (similarity 0.109). We do not go into details here about how the similarity is measured, but broadly, it means how frequently are these words used in a similar context as the corresponding country names.

I.4 Covid in Scandinavia

In the book repo: covid-scandinavia.csv.bz2

Data dowloaded from github data

Extracted a subset of Scandinavian countries, only national level, only deaths, confirmed cases. Added per capita counts and daily growth numbers.

Variables:

  • code2: 2-letter country code
  • country: country name
  • state: federal state, just NA in case of Scandinavian countries
  • date: date of the count
  • type: count type
  • count: how many persons have confirmed covid/died
  • lockdown: date where major lockdowns began
  • population: country population (only one number)
  • countPC: count per capita
  • growth: growth in count
  • growthPC: growth in count per capita

I.5 Diamonds

It is a built-in dataset in ggplot2 library, so it is already loaded when you load the library. It contains price, shape, color and other information for 53940 diamonds. A sample of it looks

diamonds %>%
   sample_n(5)
## # A tibble: 5 × 10
##   carat cut       color clarity depth table price     x     y
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
## 1  1.14 Very Good E     SI2      61.5    58  5236  6.66  6.73
## 2  0.36 Very Good E     SI1      60.8    59   631  4.56  4.59
## 3  1.21 Ideal     F     VS2      61.5    56  8348  6.87  6.83
## 4  1.1  Ideal     G     SI2      62.1    58  5361  6.61  6.63
## 5  0.38 Ideal     E     VS1      62.2    55  1112  4.67  4.63
##       z
##   <dbl>
## 1  4.12
## 2  2.78
## 3  4.21
## 4  4.11
## 5  2.89
Variables:
  • carat: mass of diamonds in caracts (ct), 1 ct = 0.2g

  • cut: cut describes the shape of diamond. There are five different cuts: Ideal is the best and Fair is the worst in these data. Better cuts make diamonds that are more brilliant.

    Note: cut is an ordered factor, see Section 17.2.1.

  • color: There are 7 color levels, J (no color) is the best and D the worst, any color hue is considered not desirable.

  • clarity: measures the defects in diamonds, IF (internally flawless) is the best, and I1 is the worst.

  • depth, table: measures of the diamond shape

  • price: in $

  • x, y, z: diamond size, mm

I.6 Height-weight

In the book repo: height-weight.csv.

I.7 Icecream

It is located in package Ecdat. It contains 30 four-weekly observations of ice cream consumption in 1950s in the U.S. Example:

library(Ecdat)
Icecream %>%
   sample_n(4)
##     cons income price temp
## 23 0.284     94 0.277   32
## 12 0.298     85 0.270   26
## 5  0.406     76 0.272   69
## 13 0.329     86 0.272   32
Variables:
  • cons consumption of ice cream per head (in pints);
  • income average family income per week (in US Dollars);
  • price price of ice cream (per pint);
  • temp average temperature (in Fahrenheit);

I.8 Ice extent

TBD: an explanatory figure of area/extent

In the book repo: ice-extent.csv.bz2.

National Snow & Ice Data Center (NSIDC) data about sea ice extent and area. Downloaded from U Colorado

A sample of data:

read_delim("data/ice-extent.csv.bz2") %>%
   sample_n(5)
## # A tibble: 5 × 7
##    year month `data-type` region extent  area  time
##   <dbl> <dbl> <chr>       <chr>   <dbl> <dbl> <dbl>
## 1  2005     8 Goddard     S       17.9  14.0  2006.
## 2  1996     2 Goddard     S        2.98  1.77 1996.
## 3  1990     5 Goddard     N       13.2  10.9  1990.
## 4  2020     3 Goddard     N       14.7  13.0  2020.
## 5  1998     4 Goddard     S        6.85  5.12 1998.

I haven’t found description of the variables, but these are fairly self-explanatory:

  • year
  • month (1-12)
  • data-type: looks like the name of the satellite or another info provider
  • region: “N” for northern, “S” for southern hemisphere
  • extent: sea ice extent, in M km2. Extent is the sea surface area where the ice concentration is at least 15%.
  • area: sea ice surface area, M km2
  • time: a continuous time variable, made of year and month \(\mathit{time} = \mathit{year} + \mathit{month}/12 - 1/24\). This describes roughly the middle of each month as measured in years.

I.9 Iris

Iris dataset is collected by Ronald Fisher 1936. It contains sepal and petal measures of 150 iris flowers of species setosa, versicolor and virginica (50 of each). It is an R built-in dataset and does not even have to be loaded, you can just use variable iris.

The variables are

  • Sepal.Length: sepal length, in cm
  • Sepal.Width
  • Petal.Length
  • Petal.Width
  • Species: setosa/versicolor/virginica

A small example of it:

iris %>%
   sample_n(4)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          4.4         3.2          1.3         0.2
## 2          5.1         3.8          1.6         0.2
## 3          5.5         2.3          4.0         1.3
## 4          6.7         3.1          4.4         1.4
##      Species
## 1     setosa
## 2     setosa
## 3 versicolor
## 4 versicolor

I.10 Orange tree growth

It is an R built-in dataset, however, as that uses more complex data structures, a copy of it is in repo as a plain csv file: orange-trees.csv

Variables:

  • Tree: an ordered factor indicating the tree on which the measurement is made. The ordering is according to increasing maximum diameter.

  • age: a numeric vector giving the age of the tree (days since 1968/12/31)

  • circumference a numeric vector of trunk circumferences (mm). This is probably “circumference at breast height”, a standard measurement in forestry.

You normally do not need to load it (just use Orange), but other libraries (e.g. Ecdat) may override it. In that case you may use

data(Orange, package = "datasets")  # load 'Orange' from main R data package
Orange %>%
   sample_n(4)
##   Tree  age circumference
## 1    3  664            75
## 2    1 1582           145
## 3    2  484            69
## 4    1  664            87

I.11 US States

R has multiple small datasets about the US states. They are built-in variables, so you do not need to do anything special to load these. Examples:

## Full names of the states
state.name[1:5]
## [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
## 2-letter abbriviations
state.abb[1:5]
## [1] "AL" "AK" "AZ" "AR" "CA"

Importantly, all these vectors contain data in the same order, so you can use names to find the value for the corresponding state.

I.12 Titanic

In repo as titanic.csv.bz2.

List of RMS Titanic passengers, their name, age and some more data, and whether they survived the shipwreck. It was collected by the investigation committee, and contains most of the passengers on the boat. The dataset is available in various sources, e.g. at kaggle. The variables are

  • pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • survived Survival (0 = No; 1 = Yes)
  • name Name
  • sex Sex
  • age Age
  • sibsp Number of Siblings/Spouses Aboard
  • parch Number of Parents/Children Aboard
  • ticket Ticket Number
  • fare Passenger Fare
  • cabin Cabin
  • embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • boat Lifeboat code (if survived)
  • body Body number (if did not survive and body was recovered)
  • home.dest The home/final destination of passenger

A small example of it:

pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
1 1 Compton, Miss. Sara Rebecca female 39 1 1 PC 17756 83.1583 E49 C 14 NA Lakewood, NJ
3 0 Dennis, Mr. Samuel male 22 0 0 A/5 21172 7.2500 NA S NA NA NA
2 0 Banfield, Mr. Frederick James male 28 0 0 C.A./SOTON 34068 10.5000 NA S NA NA Plymouth, Dorset / Houghton, MI
2 1 Christy, Mrs. (Alice Frances) female 45 0 2 237789 30.0000 NA S 12 NA London

I.13 Ukraine’s regional population

In repo as ukraine-oblasts-population.csv. Copied from the Wikipedia table 2024-03-03. Population as of 2015.

Example:

read_delim("data/ukraine-oblasts-population.csv") %>%
   head(3)
## # A tibble: 3 × 4
##   Prefecture            Population `Urban population` `Rural population`
##   <chr>                      <dbl>              <dbl>              <dbl>
## 1 Donetsk Oblast           4387702            3973317             414385
## 2 Dnipropetrovsk Oblast    3258705            2724872             533833
## 3 Kyiv                     2900920            2900920                 NA

The variables are self-explanatory.

I.14 Ukraine with regions

In repo as ukraine-with-regions_1530.geojson. The national borders and regional (oblast) borders of Ukraine in geojson format. Provided by Cartography Vectors.

The map:
library(sf)
library(ggplot2)
map <- read_sf("data/ukraine-with-regions_1530.geojson")
ggplot(map) +
   geom_sf()
plot of chunk unnamed-chunk-13

National and regional (oblast) borders of Ukraine. Provided by Cartography Vectors.