I Dataset Description

Here is a brief description of the datasets that are used in this book.

I.1 Alcohol disorders

In the book repo: alcohol-disorders.csv.

Share of males and females, suffering from alcohol use disorders (percentage of population). Alcohol dependence is defined by the International Classification of Diseases as the presence of three or more indicators of dependence for at least a month within the previous year. This is given as the age-standardized prevalence which assumes a constant age structure allowing for comparison by sex, country and through time.

IHME, Global Burden of Disease Study (2019) – processed by Our World in Data.
Dowloaded from OWiD

Variables:

  • Entity: country, only Argentina, Kenya, Taiwan, Ukraine and the U.S. are included.
  • Code: 3-letter country code
  • Year: 2015–2019 (only a subset of the original)
  • disordersM: number of cases of alcohol use disorders per 100 people, in males, age-standardized
  • disordersF: number of cases of alcohol use disorders per 100 people, in females, age-standardized,
  • population: Population (historical estimates),

Example:

read_delim("data/alcohol-disorders.csv") %>%
   sample_n(4)
## # A tibble: 4 × 6
##   country Code   Year disordersM disordersF population
##   <chr>   <chr> <dbl>      <dbl>      <dbl>      <dbl>
## 1 Taiwan  TWN    2018      0.892      0.259   23726186
## 2 Taiwan  TWN    2016      0.889      0.261   23594476
## 3 Kenya   KEN    2018      0.752      0.662   49953300
## 4 Kenya   KEN    2017      0.742      0.658   48948140

I.2 Babynames

R package babynames contains a dataset babynames. It includes ass baby names given in the U.S. between 1880-2017 at least 5 times each year for each sex. Data originates from U.S. Social Security Administration.

You can load it with library(babynames), that loads a single data frame babynames.

Example:

library(babynames)
babynames %>%
   sample_n(3)
## # A tibble: 3 × 5
##    year sex   name        n       prop
##   <dbl> <chr> <chr>   <int>      <dbl>
## 1  2011 M     Mysean      5 0.00000246
## 2  1940 M     Burnes      5 0.00000422
## 3  2014 F     Alyanna   109 0.0000558
Variables:
  • year: 1880-2017
  • name: the name
  • sex: “F” or “M”
  • n: how many babies got this name (withing year/sex)
  • prop: proportion of babies who got this name in the given year (within year/sex).

I.3 Country-concept similarity

In the book repo: country-concept-similarity.csv.bz2. This dataset shows the similarity between country names and a set of different words, and it is calculated based on texts that were scraped from internet around 2015. The dataset looks like

similarity <- read_delim("data/country-concept-similarity.csv.bz2")
similarity %>%
   head(2)
## # A tibble: 2 × 12
##   country     terrorism nuclear  trade battery  regime volcano  palm    fir  flood
##   <chr>           <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>  <dbl>
## 1 aruba          0.0891  -0.011 0.0504 -0.01   -0.0356   0.166 0.293 0.0965 0.0158
## 2 afghanistan    0.447    0.220 0.109   0.0578  0.180    0.129 0.116 0.129  0.159 
##   drought mountain
##     <dbl>    <dbl>
## 1  0.0581    0.107
## 2  0.160     0.161

One can see that “Afghanistan” and “terrorism” are much more similar (similarity 0.447) than e.g. “Afghanistan” and “trade” (similarity 0.109). We do not go into details here about how the similarity is measured, but broadly, it means how frequently are these words used in a similar context as the corresponding country names.

I.4 Covid in Scandinavia

In the book repo: covid-scandinavia.csv.bz2

Data dowloaded from github data

Extracted a subset of Scandinavian countries, only national level, only deaths, confirmed cases. Added per capita counts and daily growth numbers. Example:

read_delim("data/covid-scandinavia.csv.bz2") %>%
   sample_n(4)
## # A tibble: 4 × 11
##   code2 country state date       type      count lockdown   population    countPC
##   <chr> <chr>   <lgl> <date>     <chr>     <dbl> <date>          <dbl>      <dbl>
## 1 DK    Denmark NA    2020-09-27 Confirmed 26637 2020-03-11    5837213 0.00456   
## 2 DK    Denmark NA    2020-03-20 Deaths        9 2020-03-11    5837213 0.00000154
## 3 DK    Denmark NA    2020-07-03 Deaths      606 2020-03-11    5837213 0.000104  
## 4 SE    Sweden  NA    2020-08-22 Confirmed 83114 NA           10377781 0.00801   
##   growth    growthPC
##    <dbl>       <dbl>
## 1    424 0.0000726  
## 2      3 0.000000514
## 3      0 0          
## 4    160 0.0000154
Variables:
  • code2: 2-letter country code
  • country: country name
  • state: federal state, just NA in case of Scandinavian countries
  • date: date of the count
  • type: count type: Confirmed/Deaths
  • count: how many persons have confirmed covid/died
  • lockdown: whether the country under a major lockdown (1/0)
  • population: country population (only one number)
  • countPC: count per capita
  • growth: growth in count
  • growthPC: growth in count per capita

I.5 CS-GO

Dataset about CS-GO (video game) reviews: each line is a review. It is scraped from Steam website by mulhod, see the original repo at GitHub. The dataset is not really documented, but you can guess based on the column names.

Sample:

read_delim("data/csgo-reviews.csv.bz2") %>%
   sample_n(3)
## # A tibble: 3 × 8
##   rating      nHelpful nFunny nScreenshots date                 hours nGames nReviews
##   <chr>          <dbl>  <dbl>        <dbl> <chr>                <dbl>  <dbl>    <dbl>
## 1 Recommended        5      0           64 Jun 4, 2014, 10:51AM 1497.     74        5
## 2 Recommended        6      0           13 Jul 13, 2014, 9:29PM  387.     14        1
## 3 Recommended        5      0          107 Aug 9, 2014, 11:03AM  998.     78        5
Variables:
  • rating”: Recommended/Not recommended
  • nHelpful”: number voted helpful
  • nFunny”: number found funny
  • nScreenshots”: number of screenshots
  • date”: date posted
  • hours”: total game hours by the reviewer
  • nGames”: number of games
  • nReviews”: number of reviews

TBD: anyone knows steam and can help here?

I.6 Diamonds

It is a built-in dataset in ggplot2 library, so it is already loaded when you load the library. It contains price, shape, color and other information for 53940 diamonds. A sample of it looks

diamonds %>%
   sample_n(5)
## # A tibble: 5 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.83 Ideal     F     SI1      62      55  3774  6.03  6.07  3.75
## 2  0.61 Very Good E     SI1      63.3    60  1726  5.36  5.29  3.37
## 3  1.03 Ideal     E     SI1      58.9    57  4873  6.62  6.56  3.88
## 4  1.07 Premium   H     SI2      62.2    59  4119  6.47  6.53  4.04
## 5  0.61 Good      F     SI2      62.5    65  3807  5.36  5.29  3.33
Variables:
  • carat: mass of diamonds in caracts (ct), 1 ct = 0.2g

  • cut: cut describes the shape of diamond. There are five different cuts: Ideal is the best and Fair is the worst in these data. Better cuts make diamonds that are more brilliant.

    Note: cut is an ordered factor, see Section 17.2.1.

  • color: There are 7 color levels, J (no color) is the best and D the worst, any color hue is considered not desirable.

  • clarity: measures the defects in diamonds, IF (internally flawless) is the best, and I1 is the worst.

  • depth, table: measures of the diamond shape

  • price: in $

  • x, y, z: diamond size, mm

I.7 Fatalities

The U.S. Traffic fatalities by state in 1980’s. This is a subset of dataset Fatalities in the AER package. Example:

read_delim("data/fatalities.csv") %>%
   sample_n(4)
## # A tibble: 4 × 4
##    year state fatal      pop
##   <dbl> <chr> <dbl>    <dbl>
## 1  1982 MN      571 4133009 
## 2  1983 OR      550 2659999 
## 3  1985 OR      559 2686996.
## 4  1984 WA      746 4348992
Variables:
  • year: 1982-1988
  • state: only MN, OR, WA
  • fatal: total number of traffic fatalities
  • pop: population

I.8 Height-weight

In the book repo: height-weight.csv.

Synthetic dataset of five lines to demonstrate certain data properties. Here is the whole dataset, the meaning of columns is obvious:

read_delim("data/height-weight.csv")
## # A tibble: 5 × 4
##   sex      age height weight
##   <chr>  <dbl>  <dbl>  <dbl>
## 1 Female    16    173   58.5
## 2 Female    17    165   56.7
## 3 Male      17    170   61.2
## 4 Male      16    163   54.4
## 5 Male      18    170   63.5

I.9 Icecream

It is located in package Ecdat. It contains 30 four-weekly observations of ice cream consumption in 1950s in the U.S. Example:

data(Icecream, package = "Ecdat")  # 'Ecdat' must be installed
Icecream %>%
   sample_n(4)
##     cons income price temp
## 2  0.374     79 0.282   56
## 10 0.256     79 0.277   24
## 22 0.307     87 0.287   40
## 26 0.359     96 0.265   33
Variables:
  • cons: consumption of ice cream per head (in pints);
  • income: average family income per week (in US Dollars);
  • price: price of ice cream (per pint);
  • temp: average temperature (in Fahrenheit);

I.10 Ice extent

TBD: an explanatory figure of area/extent

In the book repo: ice-extent.csv.bz2.

National Snow & Ice Data Center (NSIDC) data about sea ice extent and area. Downloaded from U Colorado

A sample of data:

read_delim("data/ice-extent.csv.bz2") %>%
   sample_n(5)
## # A tibble: 5 × 7
##    year month `data-type` region extent  area  time
##   <dbl> <dbl> <chr>       <chr>   <dbl> <dbl> <dbl>
## 1  2002    10 Goddard     N        8.16  6.24 2003.
## 2  2013     4 Goddard     S        7.62  5.75 2013.
## 3  1996     8 Goddard     S       17.7  13.9  1997.
## 4  1994     4 Goddard     S        7.22  5.45 1994.
## 5  1990    10 Goddard     S       18.0  13.8  1991.
I haven’t found description of the variables, but these are fairly self-explanatory:
  • year
  • month: (1-12)
  • data-type: looks like the name of the satellite or another info provider
  • region: “N” for northern, “S” for southern hemisphere
  • extent: sea ice extent, in M km2. Extent is the sea surface area where the ice concentration is at least 15%.
  • area: sea ice surface area, M km2
  • time: a continuous time variable, made of year and month \(\mathit{time} = \mathit{year} + \mathit{month}/12 - 1/24\). This describes roughly the middle of each month as measured in years.

I.11 Iris

iris species
iris species

Iris flowers are beautiful. setosa, virginica and versicolor.

Iris dataset is collected by Ronald Fisher 1936. It contains sepal and petal measures of 150 iris flowers of species setosa, versicolor and virginica (50 of each). It is an R built-in dataset and does not even have to be loaded, you can just use variable iris.

Sepal and petal definitons
Sepal and petal definitons

Petals and sepals are parts of the flower (virginica).

The variables are
  • Sepal.Length: sepal length, in cm
  • Sepal.Width
  • Petal.Length
  • Petal.Width
  • Species: setosa/versicolor/virginica

A small example of it:

iris %>%
   sample_n(4)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.4         3.4          1.7         0.2     setosa
## 2          6.4         3.1          5.5         1.8  virginica
## 3          5.0         3.2          1.2         0.2     setosa
## 4          5.4         3.0          4.5         1.5 versicolor

I.12 Orange tree growth

It is an R built-in dataset, however, as that uses more complex data structures, a copy of it is in repo as a plain csv file: orange-trees.csv

Variables:
  • Tree: an ordered factor indicating the tree on which the measurement is made. The ordering is according to increasing maximum diameter.
  • age: a numeric vector giving the age of the tree (days since 1968/12/31)
  • circumference: a numeric vector of trunk circumferences (mm). This is probably “circumference at breast height”, a standard measurement in forestry.

I.13 US States

R has multiple small datasets about the US states. They are built-in variables, so you do not need to do anything special to load these. Examples:

## Full names of the states
state.name[1:5]
## [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
## 2-letter abbriviations
state.abb[1:5]
## [1] "AL" "AK" "AZ" "AR" "CA"

Importantly, all these vectors contain data in the same order, so you can use names to find the value for the corresponding state.

I.14 Titanic

In repo as titanic.csv.bz2.

List of RMS Titanic passengers, their name, age and some more data, and whether they survived the shipwreck. It was collected by the investigation committee, and contains most of the passengers on the boat. The dataset is available in various sources, e.g. at kaggle. The variables are

  • pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • survived: Survival (0 = No; 1 = Yes)
  • name: Name
  • sex: Sex
  • age: Age
  • sibsp: Number of Siblings/Spouses Aboard
  • parch: Number of Parents/Children Aboard
  • ticket: Ticket Number
  • fare: Passenger Fare
  • cabin: Cabin
  • embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • boat: Lifeboat code (if survived)
  • body: Body number (if did not survive and body was recovered)
  • home.dest: The home/final destination of passenger

A small example of it:

pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
3 0 Patchett, Mr. George male 19 0 0 358585 14.500 NA S NA NA NA
3 0 Heininen, Miss. Wendla Maria female 23 0 0 STON/O2. 3101290 7.925 NA S NA NA NA
3 1 Drapkin, Miss. Jennie female 23 0 0 SOTON/OQ 392083 8.050 NA S NA NA London New York, NY
1 1 Bowerman, Miss. Elsie Edith female 22 0 1 113505 55.000 E33 S 6 NA St Leonards-on-Sea, England Ohio

I.15 Ukraine’s regional population

In repo as ukraine-oblasts-population.csv. Copied from the Wikipedia table 2024-03-03. Population as of 2015.

Example:

read_delim("data/ukraine-oblasts-population.csv") %>%
   head(3)
## # A tibble: 3 × 4
##   Prefecture            Population `Urban population` `Rural population`
##   <chr>                      <dbl>              <dbl>              <dbl>
## 1 Donetsk Oblast           4387702            3973317             414385
## 2 Dnipropetrovsk Oblast    3258705            2724872             533833
## 3 Kyiv                     2900920            2900920                 NA

The variables are self-explanatory.

I.16 Ukraine with regions

In repo as ukraine-with-regions_1530.geojson. The national borders and regional (oblast) borders of Ukraine in geojson format. Provided by Cartography Vectors.

The map:
library(sf)
library(ggplot2)
map <- read_sf("data/ukraine-with-regions_1530.geojson")
ggplot(map) +
   geom_sf()
plot of chunk unnamed-chunk-16
plot of chunk unnamed-chunk-16

National and regional (oblast) borders of Ukraine. Provided by Cartography Vectors.