15 Summarizing Data

library(tidyverse) # Load tidyverse packages
library(palmerpenguins) # Load penguins data

Once you have wrangled and subset your data, you may want to compute summary information about that data. There are several quick and flexible ways to do so!

15.1 Table and Count

R has many tools to create quick and easy summaries and descriptive statistics for your data. You saw previously how n_distinct() could be used to get the number of distinct values for a variable. If you instead want the number of instances of each of those unique values, you can use table() (for a table output) or count() (for a df/tibble output).

penguins %>%
  pull(species) %>%
  table()
#> .
#>    Adelie Chinstrap    Gentoo 
#>       152        68       124
penguins %>%
  count(species)
#> # A tibble: 3 × 2
#>   species       n
#>   <fct>     <int>
#> 1 Adelie      152
#> 2 Chinstrap    68
#> 3 Gentoo      124

15.1.1 Proportions

If you wanted proportions instead of counts, there are several ways to get that and which method depends on the type of object you are working with (table vs df).

For tables, you simply add prop.table() to your pipe chain.

penguins %>%
  pull(species) %>%
  table() %>%
  prop.table()
#> .
#>    Adelie Chinstrap    Gentoo 
#> 0.4418605 0.1976744 0.3604651

For dfs, you have to manually compute the proportions and add them as a new variable.

penguins %>%
  count(species) %>%
  mutate(proportion = n / length(penguins$species))
#> # A tibble: 3 × 3
#>   species       n proportion
#>   <fct>     <int>      <dbl>
#> 1 Adelie      152      0.442
#> 2 Chinstrap    68      0.198
#> 3 Gentoo      124      0.360

The mutate call here is manually computing the proportion for each unique value of species. n is a column name generated automatically by count(). This is taking the number of entries of each unique value and dividing by the total number of ALL entries (which is why the original dataframe’s species must be used in the length() call!).

When computing proportions, it’s always important you check to make sure you are doing things correctly. If so, they should sum to 1:

penguins %>%
  pull(species) %>%
  table() %>%
  prop.table() %>%
  sum()
#> [1] 1
penguins %>%
  count(species) %>%
  mutate(proportion = n / length(penguins$species)) %>%
  pull(proportion) %>%
  sum()
#> [1] 1

15.2 Summary

If you want a snapshot of your data, you can use summary() for a quick quantitative summary of each variable. summary() works on a number of different data objects in R. When applied to a dataframe, summary() will give you summary statistics on each variable. For categorical variables, it will give a count of each value. For continuous variables, it will give basic summary statistics.

summary(penguins)
#>       species          island    bill_length_mm 
#>  Adelie   :152   Biscoe   :168   Min.   :32.10  
#>  Chinstrap: 68   Dream    :124   1st Qu.:39.23  
#>  Gentoo   :124   Torgersen: 52   Median :44.45  
#>                                  Mean   :43.92  
#>                                  3rd Qu.:48.50  
#>                                  Max.   :59.60  
#>                                  NA's   :2      
#>  bill_depth_mm   flipper_length_mm  body_mass_g  
#>  Min.   :13.10   Min.   :172.0     Min.   :2700  
#>  1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550  
#>  Median :17.30   Median :197.0     Median :4050  
#>  Mean   :17.15   Mean   :200.9     Mean   :4202  
#>  3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750  
#>  Max.   :21.50   Max.   :231.0     Max.   :6300  
#>  NA's   :2       NA's   :2         NA's   :2     
#>      sex           year     
#>  female:165   Min.   :2007  
#>  male  :168   1st Qu.:2007  
#>  NA's  : 11   Median :2008  
#>               Mean   :2008  
#>               3rd Qu.:2009  
#>               Max.   :2009  
#>

15.3 Summarize

This works, but is a little messy. You will likely want to be a little more specific with what you want. To get specific summaries of your data, use summarize() or summarise() (they are the equivalent, just depends on how you like to spell it). summarize() has a similar format to mutate(), in that you define new variables and how their values are computed. What summarize() does is return a new dataframe with one column for each name passed to it, and the value of that column will be the result of the R expression that name is equal to. Consider the example below that uses a subset of the penguins data:

penguins_example
#> # A tibble: 6 × 3
#>   species    mass flp_mm
#>   <fct>     <int>  <int>
#> 1 Adelie     3750    181
#> 2 Adelie     3800    186
#> 3 Chinstrap  3500    192
#> 4 Chinstrap  3900    196
#> 5 Gentoo     4500    211
#> 6 Gentoo     5700    230

penguins_example %>%
  summarize(flipper_m = mean(flp_mm),
            mass_m = mean(mass),
            ratio = mean(mass / flp_mm))
#> # A tibble: 1 × 3
#>   flipper_m mass_m ratio
#>       <dbl>  <dbl> <dbl>
#> 1      199.  4192.  20.9

species	mass	flp_mm
Adelie	3750	181
Adelie	3800	186
Chinstrap	3500	192
Chinstrap	3900	196
Gentoo	4500	211
Gentoo	5700	230

test

mass_m	flipper_m	ratio
4191.667	199.3333	20.89751

Above, the name of one column is flipper_m, and its value is the mean of the flp_mm variable. Likewise, the mass_m column is the mean of the mass variable. The third variable, ratio, is mean(mass / flp_mm). mass and flp_mm are vectors a quotient was computed for, so the quotient of the first value of each is found, then the second… ith, then the mean of that new set of numbers is taken. It should come as no surprise then that summary is designed to work with summary functions – those that output a single value.

To verify if that this is true:

firstCol = c(181, 186, 192, 196, 211, 230)
mean(firstCol)
#> [1] 199.3333
secondCol = c(3750, 3800, 3500, 3900, 4500, 5700)
mean(secondCol)
#> [1] 4191.667
thirdCol = c(3750/181, 3800/186, 3500/192, 3900/196,
             4500/211, 5700/230)
mean(thirdCol)
#> [1] 20.89751

15.4 Grouping Data

The power of summarize() comes from pairing it with group_by(). group_by() organizes your data into subgroups based on shared values. Compare the output of the following:

penguins %>%
  glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, …
#> $ island            <fct> Torgersen, Torgersen, Torgersen,…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3…
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6…
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650…
#> $ sex               <fct> male, female, female, NA, female…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 20…
penguins %>%
  class()
#> [1] "tbl_df"     "tbl"        "data.frame"

penguins %>%
  group_by(species) %>%
  glimpse()
#> Rows: 344
#> Columns: 8
#> Groups: species [3]
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, …
#> $ island            <fct> Torgersen, Torgersen, Torgersen,…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3…
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6…
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650…
#> $ sex               <fct> male, female, female, NA, female…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 20…
penguins %>%
  group_by(species) %>%
  class()
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

There are two things to note here. First, these two are slightly different classes. Applying group_by() converts your object to a grouped df/tibble, which basically just denotes that the data is organized into subgroups. Second, the additional line in the second output: Groups: species [3], denotes what that grouping is. There are 3 unique species in this dataset, so this line says each species value is separated into its own group and there are 3 of those groups. A df can be grouped by a single or multiple variables.

15.5 Grouped Summaries

group_by() allows you to perform operations group-wise and helps unlock to true power of summarize(). It tells R that you want to analyze your data separately according to the different levels of some grouping variable that you specify. The following example will again use a subset of the penguins data:

penguins_example %>%
  group_by(species) %>%
  summarize(m = mean(flp_mm))

#> # A tibble: 3 × 2
#>   species   Flp_m
#>   <fct>     <dbl>
#> 1 Adelie     184.
#> 2 Chinstrap  194 
#> 3 Gentoo     220.

species	flp
Adelie	181
Adelie	186
Chinstrap	192
Chinstrap	196
Gentoo	211
Gentoo	230

test

species	flp
Adelie	181
Adelie	186
Adelie	195
species______flp
Chinstrap	192
Chinstrap	196
Chinstrap	193
species______flp
Gentoo	211
Gentoo	230

test

species	flp_m
Adelie	183.5
Chinstrap	194.0
Gentoo	220.5

Put a different way, when summarize() gets passed a grouped df, it will:

Treat all groups of data as though they are a distinct dataset
Apply the code to each group individually, resulting in separate summary statistics for each
Combine the results into a new data frame.

You see can this process illustrated in the figure above.

Note: When you are not using summarize(), it is very important to remember to ungroup() your data when you are finished. Otherwise, subsequent functions will be unintentionally applied to individual groups rather than the entire dataset! This is not relevant when using summarize() because the resulting output will be a new dataframe.

summarize() has some handy functions it easily works with. For example n() will give you the number of values in the vector. This is particularly useful to find the number of observations in different groups:

penguins %>%
  group_by(island, species) %>%
  summarize(n = n())
#> `summarise()` has grouped output by 'island'. You can
#> override using the `.groups` argument.
#> # A tibble: 5 × 3
#> # Groups:   island [3]
#>   island    species       n
#>   <fct>     <fct>     <int>
#> 1 Biscoe    Adelie       44
#> 2 Biscoe    Gentoo      124
#> 3 Dream     Adelie       56
#> 4 Dream     Chinstrap    68
#> 5 Torgersen Adelie       52

This first groups by island then species within island, and finds the number of observations for each. From the output, you can see that not all islands have observations from each species. In fact, some islands only contain obsercvations from a single species!

15.5.1 Using `across()` with `summarize()`

In lieu of manual specification, you can use across() within a summarize() call and make use of all the helper functions introduced previously. This allows you to perform computations over several columns at once! For example, if you wanted to get grouped means on all the bill related variables, instead of calling mean(x) on each individually, like this:

penguins %>%
  drop_na() %>%
  group_by(species) %>%
  summarize(bill_length_mean = mean(bill_length_mm),
            bill_depth_mean = mean(bill_depth_mm))
#> # A tibble: 3 × 3
#>   species   bill_length_mean bill_depth_mean
#>   <fct>                <dbl>           <dbl>
#> 1 Adelie                38.8            18.3
#> 2 Chinstrap             48.8            18.4
#> 3 Gentoo                47.6            15.0

you could do the following:

penguins %>%
  group_by(species) %>%
  summarize(across(starts_with("bill"),
                   mean, na.rm = TRUE))
#> Warning: There was 1 warning in `summarize()`.
#> ℹ In argument: `across(starts_with("bill"), mean, na.rm =
#>   TRUE)`.
#> ℹ In group 1: `species = Adelie`.
#> Caused by warning:
#> ! The `...` argument of `across()` is deprecated as of
#>   dplyr 1.1.0.
#> Supply arguments directly to `.fns` through an anonymous
#> function instead.
#> 
#>   # Previously
#>   across(a:b, mean, na.rm = TRUE)
#> 
#>   # Now
#>   across(a:b, \(x) mean(x, na.rm = TRUE))
#> # A tibble: 3 × 3
#>   species   bill_length_mm bill_depth_mm
#>   <fct>              <dbl>         <dbl>
#> 1 Adelie              38.8          18.3
#> 2 Chinstrap           48.8          18.4
#> 3 Gentoo              47.5          15.0
# penguins %>%
#   group_by(species) %>%
#   summarize(across(starts_with("bill"), ~ mean(.x, na.rm = TRUE)))

# ^Long form of the same thing.

Note: Instead of using drop_na(), the na.rm argument was used. These both accomplish the same thing.

You can perform multiple summary functions at the same time by passing them as a list rather than just the function name:

penguins %>%
  group_by(island) %>%
  summarize(across(starts_with("bill"), 
                   list(mean = mean, sd = sd), na.rm = TRUE))
#> # A tibble: 3 × 5
#>   island    bill_length_mm_mean bill_length_mm_sd
#>   <fct>                   <dbl>             <dbl>
#> 1 Biscoe                   45.3              4.77
#> 2 Dream                    44.2              5.95
#> 3 Torgersen                39.0              3.03
#> # ℹ 2 more variables: bill_depth_mm_mean <dbl>,
#> #   bill_depth_mm_sd <dbl>

As another example, say you wanted to find the number of unique levels for the different factors in your data. This could be done on the entire dataset:

penguins %>%
  drop_na() %>%
  summarize(across(where(is.factor), n_distinct))
#> # A tibble: 1 × 3
#>   species island   sex
#>     <int>  <int> <int>
#> 1       3      3     2

Or by a grouping variable:

penguins %>% 
  drop_na() %>%
  group_by(species) %>% 
  summarize(across(where(is.factor), n_distinct))
#> # A tibble: 3 × 3
#>   species   island   sex
#>   <fct>      <int> <int>
#> 1 Adelie         3     2
#> 2 Chinstrap      1     2
#> 3 Gentoo         1     2

The combination of group_by(), summarize(), and across(), allow for some quick and powerful code that is relatively short. In just a few lines above, you were able to get some very specific and nuanced information about this dataset!

15.6 Extra Resources

dplyr documentation for row-wise and column-wise operations
across documentation
data wrangling cheatsheet

14 Advanced Wrangling

16 Text Cleaning