Chapter 5 First steps with data: descriptive analysis
When you start working with a new dataset, the first task is to do some descriptive analysis. This is needed for different reasons, both to ensure that you loaded it correctly and to understand if it contains the information you need for the analysis.
5.1 Basic data description
The first steps usually involve the questions about number of rows and columns, names of the data variables, and basic summary information. One can use the basic data frame syntax (see Section 2.4).
The number of rows and columns can be queried with dim()
, nrow()
and ncol()
. Let’s demonstrate this with
titanic data:
read_delim("../data/titanic.csv.bz2")
titanic <-dim(titanic) # rows, columns
## [1] 1309 14
nrow(titanic) # rows
## [1] 1309
ncol(titanic) # columns
## [1] 14
So the dataset contains 1309 rows and 14 columns.
For quite interactive analysis, it may be best to use dim()
as one
quick command will give both rows and columns. But when you need
these figures in code, ncol()
or nrow()
may be better.
Next, we may want to know the variable names. This can be easily done
with names()
:
names(titanic)
## [1] "pclass" "survived" "name" "sex" "age" "sibsp"
## [7] "parch" "ticket" "fare" "cabin" "embarked" "boat"
## [13] "body" "home.dest"
But sometimes it is more useful not just to look at names but also
print a few lines of data. Here head()
, tail()
, and sample_n()
come in handy:
%>%
titanic head(2) # first few lines
## # A tibble: 2 × 14
## pclass survived name sex age sibsp parch ticket
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 1 Allen, Mi… fema… 29 0 0 24160
## 2 1 1 Allison, … male 0.917 1 2 113781
## # ℹ 6 more variables: fare <dbl>, cabin <chr>,
## # embarked <chr>, boat <chr>, body <dbl>, home.dest <chr>
%>%
titanic tail(2) # last few lines
## # A tibble: 2 × 14
## pclass survived name sex age sibsp parch ticket fare
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 3 0 Zaka… male 27 0 0 2670 7.22
## 2 3 0 Zimm… male 29 0 0 315082 7.88
## # ℹ 5 more variables: cabin <chr>, embarked <chr>,
## # boat <chr>, body <dbl>, home.dest <chr>
%>%
titanic sample_n(2) # random few lines
## # A tibble: 2 × 14
## pclass survived name sex age sibsp parch ticket fare
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 1 Brow… fema… 44 0 0 PC 17… 27.7
## 2 3 1 Lang… male 26 0 0 1601 56.5
## # ℹ 5 more variables: cabin <chr>, embarked <chr>,
## # boat <chr>, body <dbl>, home.dest <chr>
The exact output depends on the type of data frame: if load or
manipulated through dplyr functions, the results are data frames of
tibble flavor. These are printed in a more compact way, but
unfortunately leaving out some of the columns and shortening the
others. If you prefer the longer full output, you can force the
results into base-R data frames with as.data.frame()
before printing
the lines:
%>%
titanic as.data.frame() %>%
head(2)
## pclass survived name sex age sibsp parch
## 1 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0
## 2 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2
## ticket fare cabin embarked boat body home.dest
## 1 24160 211.3375 B5 S 2 NA St Louis, MO
## 2 113781 151.5500 C22 C26 S 11 NA Montreal, PQ / Chesterville, ON
Now the output includes all columns and full-length fields, but it is very wide, it is also typically wrapped into multiple lines.