R.U.M - Publication Ready Tables
User-defined functions to create summary tables
1 Introduction
- Occasionally, we encounter environments like Microsoft Azure that do not support the default viewer mode.
- Consequently, utilising packages like
gtsummary
becomes impractical. - Therefore, it would be preferable to develop user-defined functions for data summarisation.
= penguins penguins
1.1 Structure of the dataframe
<- function(df, vars=NULL) {
skim_data <-dplyr::as_tibble(df)
dfif (is.null(vars) == TRUE) vars <- names(df)
<- sapply(vars,
variable_type function(x) is(df[, x][[1]])[1])
<- sapply(vars,
missing_count function(x) sum(!complete.cases(df[, x])))
<- sapply(vars,
unique_count function(x) dplyr::n_distinct(df[, x]))
<- nrow(dplyr::as_tibble(df))
data_count <- sapply(vars,
Example function(x) (df[1, x]))
::tibble(variables = vars, types = variable_type,
dplyrexample = Example,
missing_count = missing_count,
missing_percent = (missing_count / data_count) * 100,
unique_count = unique_count,
total_data = data_count - missing_count)
}
- An example: Assess the structure of
penguins
data
skim_data(penguins) |> knitr::kable(caption = "Structure of the penguin species dataset")
variables | types | example | missing_count | missing_percent | unique_count | total_data |
---|---|---|---|---|---|---|
species | factor | 1 | 0 | 0.0000000 | 3 | 344 |
island | factor | 3 | 0 | 0.0000000 | 3 | 344 |
bill_length_mm | numeric | 39.1 | 2 | 0.5813953 | 165 | 342 |
bill_depth_mm | numeric | 18.7 | 2 | 0.5813953 | 81 | 342 |
flipper_length_mm | integer | 181 | 2 | 0.5813953 | 56 | 342 |
body_mass_g | integer | 3750 | 2 | 0.5813953 | 95 | 342 |
sex | factor | 2 | 11 | 3.1976744 | 3 | 333 |
year | integer | 2007 | 0 | 0.0000000 | 3 | 344 |
2 Summary tables for numeric variables
<- function(df, ...) {
explore_numeric <-dplyr::as_tibble(df)
df%>%
df summarise(across(
.cols = where(is.numeric), # checks if a variable si numeric
.fns = list(Min = min, Max = max, Median = median, Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}_{fn}"
)) }
- An example: Summarise penguins data.frame
table1 = explore_numeric(penguins)) (
|> knitr::kable(caption = 'Summary statistics for numerical variables in a DF for penguin species') table1
bill_length_mm_Min | bill_length_mm_Max | bill_length_mm_Median | bill_length_mm_Mean | bill_length_mm_SD | bill_depth_mm_Min | bill_depth_mm_Max | bill_depth_mm_Median | bill_depth_mm_Mean | bill_depth_mm_SD | flipper_length_mm_Min | flipper_length_mm_Max | flipper_length_mm_Median | flipper_length_mm_Mean | flipper_length_mm_SD | body_mass_g_Min | body_mass_g_Max | body_mass_g_Median | body_mass_g_Mean | body_mass_g_SD | year_Min | year_Max | year_Median | year_Mean | year_SD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32.1 | 59.6 | 44.45 | 43.92193 | 5.459584 | 13.1 | 21.5 | 17.3 | 17.15117 | 1.974793 | 172 | 231 | 197 | 200.9152 | 14.06171 | 2700 | 6300 | 4050 | 4201.754 | 801.9545 | 2007 | 2009 | 2008 | 2008.029 | 0.8183559 |
3 Summary tables for categorical variables
<- function(df, ...){
explore_factors <-dplyr::as_tibble(df)
df
%>%
df::select(...)%>%
dplyr::gather(., "variable", "variable_level") %>%
tidyr::count(variable, variable_level) %>%
dplyr::group_by(variable) %>%
dplyr::mutate(proportion = round(prop.table(n)*(100), digits=2))%>%
dplyrmutate(propotion_count = paste(n,"(",proportion,"%)")) %>%
::group_by(variable)%>%
dplyr::arrange(desc(n),.by_group = TRUE)%>%
dplyrrename("frequency" = "n")
}
- An example: Summarise penguins data.frame
table2 = explore_factors(penguins, species, island, sex)) (
|> knitr::kable(caption = 'Summary statistics for factor variables in a DF for penguin species', align = "c") table2
variable | variable_level | frequency | proportion | propotion_count |
---|---|---|---|---|
island | Biscoe | 168 | 48.84 | 168 ( 48.84 %) |
island | Dream | 124 | 36.05 | 124 ( 36.05 %) |
island | Torgersen | 52 | 15.12 | 52 ( 15.12 %) |
sex | male | 168 | 48.84 | 168 ( 48.84 %) |
sex | female | 165 | 47.97 | 165 ( 47.97 %) |
sex | NA | 11 | 3.20 | 11 ( 3.2 %) |
species | Adelie | 152 | 44.19 | 152 ( 44.19 %) |
species | Gentoo | 124 | 36.05 | 124 ( 36.05 %) |
species | Chinstrap | 68 | 19.77 | 68 ( 19.77 %) |
4 Combine the two summary tables
- The utilisation of
knitr::kable()
is significant when it comes to conveniently visualizing datasets like these two tables in a platform like Microsoft Azure.
::kable(
knitrlist(table2, table1),
caption = 'Summary statistics for penguins DF',
booktabs = TRUE, valign = 't'
)
|
|
5 Other functions
- Sometimes, it becomes necessary for us to determine the mode, like finding the most common International Statistical Classification of Diseases, 10th Revision (ICD-10) codes associated with a patient.
- To accomplish this, we need to calculate the mode of the variable.
- Regrettably, the default mode function is not available in R. - Therefore, creating our own custom function to calculate the mode becomes a solution.
<- function(v) {
getmode <- unique({{v}})
uniqv <- tabulate(match(v, uniqv))
tab == max(tab)]
uniqv[tab }
- An example: What is the common
Petal.Length
andSepal.Length
for the different species?
= iris iris
mode_example = iris %>%
(group_by(Species) %>%
summarise(sepal_length_mode = getmode(Sepal.Length), petal_length_mode = getmode(Petal.Length)) %>%
kable(caption = "Example of mode", align = "c"))
Species | sepal_length_mode | petal_length_mode |
---|---|---|
setosa | 5.1 | 1.4 |
setosa | 5.0 | 1.5 |
versicolor | 5.5 | 4.5 |
versicolor | 5.7 | 4.5 |
versicolor | 5.6 | 4.5 |
virginica | 6.3 | 5.1 |
6 R package: summarytools
- The function
summarytools::dfSummary
proves to be valuable in performing basic descriptive statistics for both numeric variables and categorical variables. - Additionally, it attempts to generate visual representations of the variable distributions, but
these plots lack utility.
- Furthermore, the function also identifies duplicate values and missing values within the dataset.
No need for Viewer mode! π π
# create a summary table using dfSummary function
table_stat = dfSummary(penguins)) (
## Data Frame Summary
## penguins
## Dimensions: 344 x 8
## Duplicates: 0
##
## --------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ------------------- -------------------------- --------------------- --------------------- ---------- ---------
## 1 species 1. Adelie 152 (44.2%) IIIIIIII 344 0
## [factor] 2. Chinstrap 68 (19.8%) III (100.0%) (0.0%)
## 3. Gentoo 124 (36.0%) IIIIIII
##
## 2 island 1. Biscoe 168 (48.8%) IIIIIIIII 344 0
## [factor] 2. Dream 124 (36.0%) IIIIIII (100.0%) (0.0%)
## 3. Torgersen 52 (15.1%) III
##
## 3 bill_length_mm Mean (sd) : 43.9 (5.5) 164 distinct values . . : 342 2
## [numeric] min < med < max: . : : : : : (99.4%) (0.6%)
## 32.1 < 44.5 < 59.6 : : : : : :
## IQR (CV) : 9.3 (0.1) : : : : : : .
## : : : : : : : : .
##
## 4 bill_depth_mm Mean (sd) : 17.2 (2) 80 distinct values : 342 2
## [numeric] min < med < max: : : (99.4%) (0.6%)
## 13.1 < 17.3 < 21.5 : . : : : .
## IQR (CV) : 3.1 (0.1) . : : : : : :
## : : : : : : : . .
##
## 5 flipper_length_mm Mean (sd) : 200.9 (14.1) 55 distinct values : 342 2
## [integer] min < med < max: . : (99.4%) (0.6%)
## 172 < 197 < 231 : : : . .
## IQR (CV) : 23 (0.1) . : : : : : :
## : : : : : : : : :
##
## 6 body_mass_g Mean (sd) : 4201.8 (802) 94 distinct values : 342 2
## [integer] min < med < max: . : (99.4%) (0.6%)
## 2700 < 4050 < 6300 : : : :
## IQR (CV) : 1200 (0.2) : : : : : .
## . : : : : : :
##
## 7 sex 1. female 165 (49.5%) IIIIIIIII 333 11
## [factor] 2. male 168 (50.5%) IIIIIIIIII (96.8%) (3.2%)
##
## 8 year Mean (sd) : 2008 (0.8) 2007 : 110 (32.0%) IIIIII 344 0
## [integer] min < med < max: 2008 : 114 (33.1%) IIIIII (100.0%) (0.0%)
## 2007 < 2008 < 2009 2009 : 120 (34.9%) IIIIII
## IQR (CV) : 2 (0)
## --------------------------------------------------------------------------------------------------------------------
7 Acknowledgement
Dr.Β Belay Birlie Yimer, Centre for Epidemiology VS Arthritis, UoM, major contributor in writing the functions
skim_data
,explore_numeric
andexplore_factors
.Lana Bojanic, Centre for Mental Health and Safety, UoM, Manchester