1 Introduction

  • Occasionally, we encounter environments like Microsoft Azure that do not support the default viewer mode.
  • Consequently, utilising packages like gtsummary becomes impractical.
  • Therefore, it would be preferable to develop user-defined functions for data summarisation.
penguins = penguins

1.1 Structure of the dataframe

skim_data <- function(df, vars=NULL) {
  if (is.null(vars) == TRUE) vars <- names(df)
  variable_type <- sapply(vars,
                          function(x) is(df[, x][[1]])[1])
  missing_count <- sapply(vars,
                          function(x) sum(!complete.cases(df[, x])))
  unique_count <- sapply(vars,
                         function(x) dplyr::n_distinct(df[, x]))
  data_count <- nrow(dplyr::as_tibble(df))
  Example <- sapply(vars,
                    function(x) (df[1, x]))
  dplyr::tibble(variables = vars, types = variable_type,
                example = Example,
                missing_count = missing_count,
                missing_percent = (missing_count / data_count) * 100,
                unique_count = unique_count,
                total_data = data_count - missing_count)
  • An example: Assess the structure of penguins data
skim_data(penguins) |> knitr::kable(caption = "Structure of the penguin species dataset")
Table 1.1: Structure of the penguin species dataset
variables types example missing_count missing_percent unique_count total_data
species factor 1 0 0.0000000 3 344
island factor 3 0 0.0000000 3 344
bill_length_mm numeric 39.1 2 0.5813953 165 342
bill_depth_mm numeric 18.7 2 0.5813953 81 342
flipper_length_mm integer 181 2 0.5813953 56 342
body_mass_g integer 3750 2 0.5813953 95 342
sex factor 2 11 3.1976744 3 333
year integer 2007 0 0.0000000 3 344

2 Summary tables for numeric variables

explore_numeric <- function(df, ...) {
  df %>%
      .cols = where(is.numeric), # checks if a variable si numeric
      .fns = list(Min = min, Max = max, Median = median, Mean = mean, SD = sd), na.rm = TRUE, 
      .names = "{col}_{fn}"
  • An example: Summarise penguins data.frame
(table1 = explore_numeric(penguins))
table1 |> knitr::kable(caption = 'Summary statistics for numerical variables in a DF for penguin species')
Table 2.1: Summary statistics for numerical variables in a DF for penguin species
bill_length_mm_Min bill_length_mm_Max bill_length_mm_Median bill_length_mm_Mean bill_length_mm_SD bill_depth_mm_Min bill_depth_mm_Max bill_depth_mm_Median bill_depth_mm_Mean bill_depth_mm_SD flipper_length_mm_Min flipper_length_mm_Max flipper_length_mm_Median flipper_length_mm_Mean flipper_length_mm_SD body_mass_g_Min body_mass_g_Max body_mass_g_Median body_mass_g_Mean body_mass_g_SD year_Min year_Max year_Median year_Mean year_SD
32.1 59.6 44.45 43.92193 5.459584 13.1 21.5 17.3 17.15117 1.974793 172 231 197 200.9152 14.06171 2700 6300 4050 4201.754 801.9545 2007 2009 2008 2008.029 0.8183559

3 Summary tables for categorical variables

explore_factors <- function(df, ...){
    tidyr::gather(., "variable", "variable_level") %>%
    dplyr::count(variable, variable_level) %>%
    dplyr::group_by(variable) %>%             
    dplyr::mutate(proportion = round(prop.table(n)*(100), digits=2))%>%
    mutate(propotion_count = paste(n,"(",proportion,"%)")) %>%
    dplyr::arrange(desc(n),.by_group = TRUE)%>%
    rename("frequency" = "n")
  • An example: Summarise penguins data.frame
(table2 = explore_factors(penguins, species, island, sex))
table2 |> knitr::kable(caption = 'Summary statistics for factor variables in a DF for penguin species', align = "c")
Table 3.1: Summary statistics for factor variables in a DF for penguin species
4 Combine the two summary tables

  • The utilisation of knitr::kable() is significant when it comes to conveniently visualizing datasets like these two tables in a platform like Microsoft Azure.
  list(table2, table1),
  caption = 'Summary statistics for penguins DF',
  booktabs = TRUE, valign = 't'
Table 4.1: Summary statistics for penguins DF
5 Other functions

  • Sometimes, it becomes necessary for us to determine the mode, like finding the most common International Statistical Classification of Diseases, 10th Revision (ICD-10) codes associated with a patient.
    • To accomplish this, we need to calculate the mode of the variable.
  • Regrettably, the default mode function is not available in R. - Therefore, creating our own custom function to calculate the mode becomes a solution.
getmode <- function(v) {
  uniqv <- unique({{v}})
  tab <- tabulate(match(v, uniqv))
  uniqv[tab == max(tab)]
  • An example: What is the common Petal.Length and Sepal.Length for the different species?
iris = iris
(mode_example = iris %>% 
  group_by(Species) %>% 
  summarise(sepal_length_mode = getmode(Sepal.Length), petal_length_mode = getmode(Petal.Length)) %>% 
  kable(caption = "Example of mode", align = "c"))
Table 5.1: Example of mode
Species sepal_length_mode petal_length_mode
setosa 5.1 1.4
setosa 5.0 1.5
versicolor 5.5 4.5
versicolor 5.7 4.5
versicolor 5.6 4.5
virginica 6.3 5.1

6 R package: summarytools

  • The function summarytools::dfSummary proves to be valuable in performing basic descriptive statistics for both numeric variables and categorical variables.
  • Additionally, it attempts to generate visual representations of the variable distributions, but

    these plots lack utility.

  • Furthermore, the function also identifies duplicate values and missing values within the dataset.

No need for Viewer mode! πŸ˜ƒ πŸ™Œ

# create a summary table using dfSummary function
(table_stat = dfSummary(penguins))
## Data Frame Summary  
## penguins  
## Dimensions: 344 x 8  
## Duplicates: 0  
## --------------------------------------------------------------------------------------------------------------------
## No   Variable            Stats / Values             Freqs (% of Valid)    Graph                 Valid      Missing  
## ---- ------------------- -------------------------- --------------------- --------------------- ---------- ---------
## 1    species             1. Adelie                  152 (44.2%)           IIIIIIII              344        0        
##      [factor]            2. Chinstrap                68 (19.8%)           III                   (100.0%)   (0.0%)   
##                          3. Gentoo                  124 (36.0%)           IIIIIII                                   
## 2    island              1. Biscoe                  168 (48.8%)           IIIIIIIII             344        0        
##      [factor]            2. Dream                   124 (36.0%)           IIIIIII               (100.0%)   (0.0%)   
##                          3. Torgersen                52 (15.1%)           III                                       
## 3    bill_length_mm      Mean (sd) : 43.9 (5.5)     164 distinct values       .     . :         342        2        
##      [numeric]           min < med < max:                                   . : : : : :         (99.4%)    (0.6%)   
##                          32.1 < 44.5 < 59.6                                 : : : : : :                             
##                          IQR (CV) : 9.3 (0.1)                               : : : : : : .                           
##                                                                           : : : : : : : : .                         
## 4    bill_depth_mm       Mean (sd) : 17.2 (2)       80 distinct values              :           342        2        
##      [numeric]           min < med < max:                                         : :           (99.4%)    (0.6%)   
##                          13.1 < 17.3 < 21.5                                 : . : : : .                             
##                          IQR (CV) : 3.1 (0.1)                             . : : : : : :                             
##                                                                           : : : : : : : . .                         
## 5    flipper_length_mm   Mean (sd) : 200.9 (14.1)   55 distinct values          :               342        2        
##      [integer]           min < med < max:                                     . :               (99.4%)    (0.6%)   
##                          172 < 197 < 231                                      : : :   . .                           
##                          IQR (CV) : 23 (0.1)                                . : : :   : : :                         
##                                                                             : : : : : : : : :                       
## 6    body_mass_g         Mean (sd) : 4201.8 (802)   94 distinct values        :                 342        2        
##      [integer]           min < med < max:                                   . :                 (99.4%)    (0.6%)   
##                          2700 < 4050 < 6300                                 : : : :                                 
##                          IQR (CV) : 1200 (0.2)                              : : : : : .                             
##                                                                           . : : : : : :                             
## 7    sex                 1. female                  165 (49.5%)           IIIIIIIII             333        11       
##      [factor]            2. male                    168 (50.5%)           IIIIIIIIII            (96.8%)    (3.2%)   
## 8    year                Mean (sd) : 2008 (0.8)     2007 : 110 (32.0%)    IIIIII                344        0        
##      [integer]           min < med < max:           2008 : 114 (33.1%)    IIIIII                (100.0%)   (0.0%)   
##                          2007 < 2008 < 2009         2009 : 120 (34.9%)    IIIIII                                    
##                          IQR (CV) : 2 (0)                                                                           
## --------------------------------------------------------------------------------------------------------------------

7 Acknowledgement

  1. Dr.Β Belay Birlie Yimer, Centre for Epidemiology VS Arthritis, UoM, major contributor in writing the functions skim_data, explore_numeric and explore_factors.

  2. Lana Bojanic, Centre for Mental Health and Safety, UoM, Manchester