Chapter 5 First steps with data: descriptive analysis

When you start working with a new dataset, the first task is to do some descriptive analysis. This is needed for different reasons, both to ensure that you loaded it correctly and to understand if it contains the information you need for the analysis.

5.1 Basic data description

The first steps usually involve the questions about number of rows and columns, names of the data variables, and basic summary information. One can use the basic data frame syntax (see Section 2.4).

The number of rows and columns can be queried with dim(), nrow() and ncol(). Let’s demonstrate this with titanic data:

titanic <- read_delim("../data/titanic.csv.bz2")
dim(titanic)  # rows, columns

## [1] 1309   14

nrow(titanic)  # rows

## [1] 1309

ncol(titanic)  # columns

## [1] 14

So the dataset contains 1309 rows and 14 columns.

For quite interactive analysis, it may be best to use dim() as one quick command will give both rows and columns. But when you need these figures in code, ncol() or nrow() may be better.

Next, we may want to know the variable names. This can be easily done with names():

names(titanic)

##  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"    
##  [7] "parch"     "ticket"    "fare"      "cabin"     "embarked"  "boat"     
## [13] "body"      "home.dest"

But sometimes it is more useful not just to look at names but also print a few lines of data. Here head(), tail(), and sample_n() come in handy:

titanic %>%
   head(2)  # first few lines

## # A tibble: 2 × 14
##   pclass survived name       sex      age sibsp parch ticket
##    <dbl>    <dbl> <chr>      <chr>  <dbl> <dbl> <dbl> <chr> 
## 1      1        1 Allen, Mi… fema… 29         0     0 24160 
## 2      1        1 Allison, … male   0.917     1     2 113781
## # ℹ 6 more variables: fare <dbl>, cabin <chr>,
## #   embarked <chr>, boat <chr>, body <dbl>, home.dest <chr>

titanic %>%
   tail(2)  # last few lines

## # A tibble: 2 × 14
##   pclass survived name  sex     age sibsp parch ticket  fare
##    <dbl>    <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1      3        0 Zaka… male     27     0     0 2670    7.22
## 2      3        0 Zimm… male     29     0     0 315082  7.88
## # ℹ 5 more variables: cabin <chr>, embarked <chr>,
## #   boat <chr>, body <dbl>, home.dest <chr>

titanic %>%
   sample_n(2)  # random few lines

## # A tibble: 2 × 14
##   pclass survived name  sex     age sibsp parch ticket  fare
##    <dbl>    <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1      3        0 Moor… male     NA     0     0 A4. 5…  8.05
## 2      3        0 Ande… fema…    11     4     2 347082 31.3 
## # ℹ 5 more variables: cabin <chr>, embarked <chr>,
## #   boat <chr>, body <dbl>, home.dest <chr>

The exact output depends on the type of data frame: if load or manipulated through dplyr functions, the results are data frames of tibble flavor. These are printed in a more compact way, but unfortunately leaving out some of the columns and shortening the others. If you prefer the longer full output, you can force the results into base-R data frames with as.data.frame() before printing the lines:

titanic %>%
   as.data.frame() %>%
   head(2)

##   pclass survived                           name    sex     age sibsp parch
## 1      1        1  Allen, Miss. Elisabeth Walton female 29.0000     0     0
## 2      1        1 Allison, Master. Hudson Trevor   male  0.9167     1     2
##   ticket     fare   cabin embarked boat body                       home.dest
## 1  24160 211.3375      B5        S    2   NA                    St Louis, MO
## 2 113781 151.5500 C22 C26        S   11   NA Montreal, PQ / Chesterville, ON

Now the output includes all columns and full-length fields, but it is very wide, it is also typically wrapped into multiple lines.