R.U.M - Publication Ready Tables
User-defined functions to create summary tables
1 Introduction
- Occasionally, we encounter environments like Microsoft Azure that do not support the default viewer mode.
- Consequently, utilising packages like
gtsummarybecomes impractical. - Therefore, it would be preferable to develop user-defined functions for data summarisation.
penguins = penguins1.1 Structure of the dataframe
skim_data <- function(df, vars=NULL) {
df<-dplyr::as_tibble(df)
if (is.null(vars) == TRUE) vars <- names(df)
variable_type <- sapply(vars,
function(x) is(df[, x][[1]])[1])
missing_count <- sapply(vars,
function(x) sum(!complete.cases(df[, x])))
unique_count <- sapply(vars,
function(x) dplyr::n_distinct(df[, x]))
data_count <- nrow(dplyr::as_tibble(df))
Example <- sapply(vars,
function(x) (df[1, x]))
dplyr::tibble(variables = vars, types = variable_type,
example = Example,
missing_count = missing_count,
missing_percent = (missing_count / data_count) * 100,
unique_count = unique_count,
total_data = data_count - missing_count)
}- An example: Assess the structure of
penguinsdata
skim_data(penguins) |> knitr::kable(caption = "Structure of the penguin species dataset")| variables | types | example | missing_count | missing_percent | unique_count | total_data |
|---|---|---|---|---|---|---|
| species | factor | 1 | 0 | 0.0000000 | 3 | 344 |
| island | factor | 3 | 0 | 0.0000000 | 3 | 344 |
| bill_length_mm | numeric | 39.1 | 2 | 0.5813953 | 165 | 342 |
| bill_depth_mm | numeric | 18.7 | 2 | 0.5813953 | 81 | 342 |
| flipper_length_mm | integer | 181 | 2 | 0.5813953 | 56 | 342 |
| body_mass_g | integer | 3750 | 2 | 0.5813953 | 95 | 342 |
| sex | factor | 2 | 11 | 3.1976744 | 3 | 333 |
| year | integer | 2007 | 0 | 0.0000000 | 3 | 344 |
2 Summary tables for numeric variables
explore_numeric <- function(df, ...) {
df<-dplyr::as_tibble(df)
df %>%
summarise(across(
.cols = where(is.numeric), # checks if a variable si numeric
.fns = list(Min = min, Max = max, Median = median, Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}_{fn}"
))
}- An example: Summarise penguins data.frame
(table1 = explore_numeric(penguins))table1 |> knitr::kable(caption = 'Summary statistics for numerical variables in a DF for penguin species')| bill_length_mm_Min | bill_length_mm_Max | bill_length_mm_Median | bill_length_mm_Mean | bill_length_mm_SD | bill_depth_mm_Min | bill_depth_mm_Max | bill_depth_mm_Median | bill_depth_mm_Mean | bill_depth_mm_SD | flipper_length_mm_Min | flipper_length_mm_Max | flipper_length_mm_Median | flipper_length_mm_Mean | flipper_length_mm_SD | body_mass_g_Min | body_mass_g_Max | body_mass_g_Median | body_mass_g_Mean | body_mass_g_SD | year_Min | year_Max | year_Median | year_Mean | year_SD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32.1 | 59.6 | 44.45 | 43.92193 | 5.459584 | 13.1 | 21.5 | 17.3 | 17.15117 | 1.974793 | 172 | 231 | 197 | 200.9152 | 14.06171 | 2700 | 6300 | 4050 | 4201.754 | 801.9545 | 2007 | 2009 | 2008 | 2008.029 | 0.8183559 |
3 Summary tables for categorical variables
explore_factors <- function(df, ...){
df<-dplyr::as_tibble(df)
df%>%
dplyr::select(...)%>%
tidyr::gather(., "variable", "variable_level") %>%
dplyr::count(variable, variable_level) %>%
dplyr::group_by(variable) %>%
dplyr::mutate(proportion = round(prop.table(n)*(100), digits=2))%>%
mutate(propotion_count = paste(n,"(",proportion,"%)")) %>%
dplyr::group_by(variable)%>%
dplyr::arrange(desc(n),.by_group = TRUE)%>%
rename("frequency" = "n")
}- An example: Summarise penguins data.frame
(table2 = explore_factors(penguins, species, island, sex))table2 |> knitr::kable(caption = 'Summary statistics for factor variables in a DF for penguin species', align = "c")| variable | variable_level | frequency | proportion | propotion_count |
|---|---|---|---|---|
| island | Biscoe | 168 | 48.84 | 168 ( 48.84 %) |
| island | Dream | 124 | 36.05 | 124 ( 36.05 %) |
| island | Torgersen | 52 | 15.12 | 52 ( 15.12 %) |
| sex | male | 168 | 48.84 | 168 ( 48.84 %) |
| sex | female | 165 | 47.97 | 165 ( 47.97 %) |
| sex | NA | 11 | 3.20 | 11 ( 3.2 %) |
| species | Adelie | 152 | 44.19 | 152 ( 44.19 %) |
| species | Gentoo | 124 | 36.05 | 124 ( 36.05 %) |
| species | Chinstrap | 68 | 19.77 | 68 ( 19.77 %) |
4 Combine the two summary tables
- The utilisation of
knitr::kable()is significant when it comes to conveniently visualizing datasets like these two tables in a platform like Microsoft Azure.
knitr::kable(
list(table2, table1),
caption = 'Summary statistics for penguins DF',
booktabs = TRUE, valign = 't'
)
|
|
5 Other functions
- Sometimes, it becomes necessary for us to determine the mode, like finding the most common International Statistical Classification of Diseases, 10th Revision (ICD-10) codes associated with a patient.
- To accomplish this, we need to calculate the mode of the variable.
- Regrettably, the default mode function is not available in R. - Therefore, creating our own custom function to calculate the mode becomes a solution.
getmode <- function(v) {
uniqv <- unique({{v}})
tab <- tabulate(match(v, uniqv))
uniqv[tab == max(tab)]
}- An example: What is the common
Petal.LengthandSepal.Lengthfor the different species?
iris = iris(mode_example = iris %>%
group_by(Species) %>%
summarise(sepal_length_mode = getmode(Sepal.Length), petal_length_mode = getmode(Petal.Length)) %>%
kable(caption = "Example of mode", align = "c"))| Species | sepal_length_mode | petal_length_mode |
|---|---|---|
| setosa | 5.1 | 1.4 |
| setosa | 5.0 | 1.5 |
| versicolor | 5.5 | 4.5 |
| versicolor | 5.7 | 4.5 |
| versicolor | 5.6 | 4.5 |
| virginica | 6.3 | 5.1 |
6 R package: summarytools
- The function
summarytools::dfSummaryproves to be valuable in performing basic descriptive statistics for both numeric variables and categorical variables. - Additionally, it attempts to generate visual representations of the variable distributions, but
these plots lack utility.
- Furthermore, the function also identifies duplicate values and missing values within the dataset.
No need for Viewer mode! π π
# create a summary table using dfSummary function
(table_stat = dfSummary(penguins))## Data Frame Summary
## penguins
## Dimensions: 344 x 8
## Duplicates: 0
##
## --------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ------------------- -------------------------- --------------------- --------------------- ---------- ---------
## 1 species 1. Adelie 152 (44.2%) IIIIIIII 344 0
## [factor] 2. Chinstrap 68 (19.8%) III (100.0%) (0.0%)
## 3. Gentoo 124 (36.0%) IIIIIII
##
## 2 island 1. Biscoe 168 (48.8%) IIIIIIIII 344 0
## [factor] 2. Dream 124 (36.0%) IIIIIII (100.0%) (0.0%)
## 3. Torgersen 52 (15.1%) III
##
## 3 bill_length_mm Mean (sd) : 43.9 (5.5) 164 distinct values . . : 342 2
## [numeric] min < med < max: . : : : : : (99.4%) (0.6%)
## 32.1 < 44.5 < 59.6 : : : : : :
## IQR (CV) : 9.3 (0.1) : : : : : : .
## : : : : : : : : .
##
## 4 bill_depth_mm Mean (sd) : 17.2 (2) 80 distinct values : 342 2
## [numeric] min < med < max: : : (99.4%) (0.6%)
## 13.1 < 17.3 < 21.5 : . : : : .
## IQR (CV) : 3.1 (0.1) . : : : : : :
## : : : : : : : . .
##
## 5 flipper_length_mm Mean (sd) : 200.9 (14.1) 55 distinct values : 342 2
## [integer] min < med < max: . : (99.4%) (0.6%)
## 172 < 197 < 231 : : : . .
## IQR (CV) : 23 (0.1) . : : : : : :
## : : : : : : : : :
##
## 6 body_mass_g Mean (sd) : 4201.8 (802) 94 distinct values : 342 2
## [integer] min < med < max: . : (99.4%) (0.6%)
## 2700 < 4050 < 6300 : : : :
## IQR (CV) : 1200 (0.2) : : : : : .
## . : : : : : :
##
## 7 sex 1. female 165 (49.5%) IIIIIIIII 333 11
## [factor] 2. male 168 (50.5%) IIIIIIIIII (96.8%) (3.2%)
##
## 8 year Mean (sd) : 2008 (0.8) 2007 : 110 (32.0%) IIIIII 344 0
## [integer] min < med < max: 2008 : 114 (33.1%) IIIIII (100.0%) (0.0%)
## 2007 < 2008 < 2009 2009 : 120 (34.9%) IIIIII
## IQR (CV) : 2 (0)
## --------------------------------------------------------------------------------------------------------------------
7 Acknowledgement
Dr.Β Belay Birlie Yimer, Centre for Epidemiology VS Arthritis, UoM, major contributor in writing the functions
skim_data,explore_numericandexplore_factors.Lana Bojanic, Centre for Mental Health and Safety, UoM, Manchester