Chapter 5 First steps with data: descriptive analysis
When you start working with a new dataset, the first task is to do some descriptive analysis. This is needed for different reasons, both to ensure that you loaded it correctly and to understand if it contains the information you need for the analysis.
5.1 Basic data description
The first steps usually involve the questions about number of rows and columns, names of the data variables, and basic summary information. One can use the basic data frame syntax (see Section 2.4).
The number of rows and columns can be queried with dim(), nrow()
and ncol(). Let’s demonstrate this with
titanic data:
titanic <- read_delim("../data/titanic.csv.bz2")
dim(titanic)  # rows, columns## [1] 1309   14
nrow(titanic)  # rows## [1] 1309
ncol(titanic)  # columns## [1] 14
So the dataset contains 1309 rows and 14 columns.
For quite interactive analysis, it may be best to use dim() as one
quick command will give both rows and columns. But when you need
these figures in code, ncol() or nrow() may be better.
Next, we may want to know the variable names. This can be easily done
with names():
names(titanic)##  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"    
##  [7] "parch"     "ticket"    "fare"      "cabin"     "embarked"  "boat"     
## [13] "body"      "home.dest"
But sometimes it is more useful not just to look at names but also
print a few lines of data. Here head(), tail(), and sample_n()
come in handy:
titanic %>%
   head(2)  # first few lines## # A tibble: 2 × 14
##   pclass survived name       sex      age sibsp parch ticket
##    <dbl>    <dbl> <chr>      <chr>  <dbl> <dbl> <dbl> <chr> 
## 1      1        1 Allen, Mi… fema… 29         0     0 24160 
## 2      1        1 Allison, … male   0.917     1     2 113781
## # ℹ 6 more variables: fare <dbl>, cabin <chr>,
## #   embarked <chr>, boat <chr>, body <dbl>, home.dest <chr>
titanic %>%
   tail(2)  # last few lines## # A tibble: 2 × 14
##   pclass survived name  sex     age sibsp parch ticket  fare
##    <dbl>    <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1      3        0 Zaka… male     27     0     0 2670    7.22
## 2      3        0 Zimm… male     29     0     0 315082  7.88
## # ℹ 5 more variables: cabin <chr>, embarked <chr>,
## #   boat <chr>, body <dbl>, home.dest <chr>
titanic %>%
   sample_n(2)  # random few lines## # A tibble: 2 × 14
##   pclass survived name  sex     age sibsp parch ticket  fare
##    <dbl>    <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1      3        0 Moor… male     NA     0     0 A4. 5…  8.05
## 2      3        0 Ande… fema…    11     4     2 347082 31.3 
## # ℹ 5 more variables: cabin <chr>, embarked <chr>,
## #   boat <chr>, body <dbl>, home.dest <chr>
The exact output depends on the type of data frame: if load or
manipulated through dplyr functions, the results are data frames of
tibble flavor. These are printed in a more compact way, but
unfortunately leaving out some of the columns and shortening the
others. If you prefer the longer full output, you can force the
results into base-R data frames with as.data.frame() before printing
the lines:
titanic %>%
   as.data.frame() %>%
   head(2)##   pclass survived                           name    sex     age sibsp parch
## 1      1        1  Allen, Miss. Elisabeth Walton female 29.0000     0     0
## 2      1        1 Allison, Master. Hudson Trevor   male  0.9167     1     2
##   ticket     fare   cabin embarked boat body                       home.dest
## 1  24160 211.3375      B5        S    2   NA                    St Louis, MO
## 2 113781 151.5500 C22 C26        S   11   NA Montreal, PQ / Chesterville, ON
Now the output includes all columns and full-length fields, but it is very wide, it is also typically wrapped into multiple lines.