I Dataset Description
Here is a brief description of the datasets that are used in this book.
I.1 Alcohol disorders
In the book repo: alcohol-disorders.csv.
Share of males and females, suffering from alcohol use disorders (pct). Alcohol dependence is defined by the International Classification of Diseases as the presence of three or more indicators of dependence for at least a month within the previous year. This is given as the age-standardized prevalence which assumes a constant age structure allowing for comparison by sex, country and through time.
IHME, Global Burden of Disease Study (2019) – processed by Our World
in Data.
Dowloaded from OWiD
Variables:
- Entity: country, only Argentina, Kenya, Taiwan, Ukraine and the U.S. are included.
- Code: 3-letter country code
- Year: 2015–2019 (only a subset of the original)
- disordersM: number of cases of alcohol use disorders per 100 people, in males, age-standardized
- disordersF: number of cases of alcohol use disorders per 100 people, in females, age-standardized,
- population: Population (historical estimates),
Example:
read_delim("data/alcohol-disorders.csv") %>%
sample_n(4)
## # A tibble: 4 × 6
## country Code Year disordersM disordersF population
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Ukraine UKR 2018 3.53 1.33 44446952
## 2 Taiwan TWN 2017 0.886 0.260 23665028
## 3 Argentina ARG 2015 3.07 1.17 43257064
## 4 Ukraine UKR 2015 3.90 1.43 44982568
I.2 Babynames
R package babynames contains a dataset babynames. It includes ass baby names given in the U.S. between 1880-2017 at least 5 times each year for each sex. Data originates from U.S. Social Security Administration.
You can load it with library(babynames)
, that loads a single data
frame babynames
.
Example:
library(babynames)
%>%
babynames sample_n(3)
## # A tibble: 3 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1982 M Santino 38 0.0000201
## 2 2001 M Collis 5 0.00000242
## 3 1961 F Kimberlea 17 0.00000819
Variables:
- year: 1880-2017
- name: the name
- sex: “F” or “M”
- n: how many babies got this name (withing year/sex)
- prop: proportion of babies who got this name in the given year (within year/sex).
I.3 Country-concept similarity
In the book repo: country-concept-similarity.csv.bz2. This dataset shows the similarity between country names and a set of different words, and it is calculated based on texts that were scraped from internet around 2015. The dataset looks like
read_delim("data/country-concept-similarity.csv.bz2")
similarity <-%>%
similarity head(2)
## # A tibble: 2 × 12
## country terrorism nuclear trade battery regime
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aruba 0.0891 -0.011 0.0504 -0.01 -0.0356
## 2 afghanistan 0.447 0.220 0.109 0.0578 0.180
## volcano palm fir flood drought mountain
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.166 0.293 0.0965 0.0158 0.0581 0.107
## 2 0.129 0.116 0.129 0.159 0.160 0.161
One can see that “Afghanistan” and “terrorism” are much more similar (similarity 0.447) than e.g. “Afghanistan” and “trade” (similarity 0.109). We do not go into details here about how the similarity is measured, but broadly, it means how frequently are these words used in a similar context as the corresponding country names.
I.5 Diamonds
It is a built-in dataset in ggplot2 library, so it is already loaded when you load the library. It contains price, shape, color and other information for 53940 diamonds. A sample of it looks
%>%
diamonds sample_n(5)
## # A tibble: 5 × 10
## carat cut color clarity depth table price x y
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1.14 Very Good E SI2 61.5 58 5236 6.66 6.73
## 2 0.36 Very Good E SI1 60.8 59 631 4.56 4.59
## 3 1.21 Ideal F VS2 61.5 56 8348 6.87 6.83
## 4 1.1 Ideal G SI2 62.1 58 5361 6.61 6.63
## 5 0.38 Ideal E VS1 62.2 55 1112 4.67 4.63
## z
## <dbl>
## 1 4.12
## 2 2.78
## 3 4.21
## 4 4.11
## 5 2.89
Variables:
carat: mass of diamonds in caracts (ct), 1 ct = 0.2g
cut: cut describes the shape of diamond. There are five different cuts: Ideal is the best and Fair is the worst in these data. Better cuts make diamonds that are more brilliant.
Note: cut is an ordered factor, see Section 17.2.1.
color: There are 7 color levels, J (no color) is the best and D the worst, any color hue is considered not desirable.
clarity: measures the defects in diamonds, IF (internally flawless) is the best, and I1 is the worst.
depth, table: measures of the diamond shape
price: in $
x, y, z: diamond size, mm
I.6 Height-weight
In the book repo: height-weight.csv.
I.7 Icecream
It is located in package Ecdat. It contains 30 four-weekly observations of ice cream consumption in 1950s in the U.S. Example:
library(Ecdat)
%>%
Icecream sample_n(4)
## cons income price temp
## 23 0.284 94 0.277 32
## 12 0.298 85 0.270 26
## 5 0.406 76 0.272 69
## 13 0.329 86 0.272 32
Variables:
- cons consumption of ice cream per head (in pints);
- income average family income per week (in US Dollars);
- price price of ice cream (per pint);
- temp average temperature (in Fahrenheit);
I.8 Ice extent
TBD: an explanatory figure of area/extent
In the book repo: ice-extent.csv.bz2.
National Snow & Ice Data Center (NSIDC) data about sea ice extent and area. Downloaded from U Colorado
A sample of data:
read_delim("data/ice-extent.csv.bz2") %>%
sample_n(5)
## # A tibble: 5 × 7
## year month `data-type` region extent area time
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2005 8 Goddard S 17.9 14.0 2006.
## 2 1996 2 Goddard S 2.98 1.77 1996.
## 3 1990 5 Goddard N 13.2 10.9 1990.
## 4 2020 3 Goddard N 14.7 13.0 2020.
## 5 1998 4 Goddard S 6.85 5.12 1998.
I haven’t found description of the variables, but these are fairly self-explanatory:
- year
- month (1-12)
- data-type: looks like the name of the satellite or another info provider
- region: “N” for northern, “S” for southern hemisphere
- extent: sea ice extent, in M km2. Extent is the sea surface area where the ice concentration is at least 15%.
- area: sea ice surface area, M km2
- time: a continuous time variable, made of year and month \(\mathit{time} = \mathit{year} + \mathit{month}/12 - 1/24\). This describes roughly the middle of each month as measured in years.
I.9 Iris
Iris dataset is collected by Ronald Fisher 1936. It contains sepal and petal measures of 150 iris flowers of species setosa, versicolor and virginica (50 of each). It is an R built-in dataset and does not even have to be loaded, you can just use variable iris.
The variables are
- Sepal.Length: sepal length, in cm
- Sepal.Width
- Petal.Length
- Petal.Width
- Species: setosa/versicolor/virginica
A small example of it:
%>%
iris sample_n(4)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 4.4 3.2 1.3 0.2
## 2 5.1 3.8 1.6 0.2
## 3 5.5 2.3 4.0 1.3
## 4 6.7 3.1 4.4 1.4
## Species
## 1 setosa
## 2 setosa
## 3 versicolor
## 4 versicolor
I.10 Orange tree growth
It is an R built-in dataset, however, as that uses more complex data structures, a copy of it is in repo as a plain csv file: orange-trees.csv
Variables:
Tree: an ordered factor indicating the tree on which the measurement is made. The ordering is according to increasing maximum diameter.
age: a numeric vector giving the age of the tree (days since 1968/12/31)
circumference a numeric vector of trunk circumferences (mm). This is probably “circumference at breast height”, a standard measurement in forestry.
You normally do not need to load it (just use Orange
), but other
libraries (e.g. Ecdat) may override it. In that case you may use
data(Orange, package = "datasets") # load 'Orange' from main R data package
%>%
Orange sample_n(4)
## Tree age circumference
## 1 3 664 75
## 2 1 1582 145
## 3 2 484 69
## 4 1 664 87
I.11 US States
R has multiple small datasets about the US states. They are built-in variables, so you do not need to do anything special to load these. Examples:
## Full names of the states
1:5] state.name[
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## 2-letter abbriviations
1:5] state.abb[
## [1] "AL" "AK" "AZ" "AR" "CA"
Importantly, all these vectors contain data in the same order, so you can use names to find the value for the corresponding state.
I.12 Titanic
In repo as titanic.csv.bz2.
List of RMS Titanic passengers, their name, age and some more data, and whether they survived the shipwreck. It was collected by the investigation committee, and contains most of the passengers on the boat. The dataset is available in various sources, e.g. at kaggle. The variables are
- pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- survived Survival (0 = No; 1 = Yes)
- name Name
- sex Sex
- age Age
- sibsp Number of Siblings/Spouses Aboard
- parch Number of Parents/Children Aboard
- ticket Ticket Number
- fare Passenger Fare
- cabin Cabin
- embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- boat Lifeboat code (if survived)
- body Body number (if did not survive and body was recovered)
- home.dest The home/final destination of passenger
A small example of it:
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Compton, Miss. Sara Rebecca | female | 39 | 1 | 1 | PC 17756 | 83.1583 | E49 | C | 14 | NA | Lakewood, NJ |
3 | 0 | Dennis, Mr. Samuel | male | 22 | 0 | 0 | A/5 21172 | 7.2500 | NA | S | NA | NA | NA |
2 | 0 | Banfield, Mr. Frederick James | male | 28 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NA | S | NA | NA | Plymouth, Dorset / Houghton, MI |
2 | 1 | Christy, Mrs. (Alice Frances) | female | 45 | 0 | 2 | 237789 | 30.0000 | NA | S | 12 | NA | London |
I.13 Ukraine’s regional population
In repo as ukraine-oblasts-population.csv. Copied from the Wikipedia table 2024-03-03. Population as of 2015.
Example:
read_delim("data/ukraine-oblasts-population.csv") %>%
head(3)
## # A tibble: 3 × 4
## Prefecture Population `Urban population` `Rural population`
## <chr> <dbl> <dbl> <dbl>
## 1 Donetsk Oblast 4387702 3973317 414385
## 2 Dnipropetrovsk Oblast 3258705 2724872 533833
## 3 Kyiv 2900920 2900920 NA
The variables are self-explanatory.
I.14 Ukraine with regions
In repo as ukraine-with-regions_1530.geojson. The national borders and regional (oblast) borders of Ukraine in geojson format. Provided by Cartography Vectors.
The map:library(sf)
library(ggplot2)
read_sf("data/ukraine-with-regions_1530.geojson")
map <-ggplot(map) +
geom_sf()
![fig: plot of chunk unnamed-chunk-13](.fig/datasets/unnamed-chunk-13-1.png)
National and regional (oblast) borders of Ukraine. Provided by Cartography Vectors.