[R] How to compute ggplot2 boxplot outliers after Reshape/Melt in LONG/stacked format (in aloop)

Diego Bellavia me@@@du@ @end|ng |rom y@hoo@|t
Tue Feb 22 09:51:29 CET 2022


Here it is a test Dataframe (just a subset of the real one), that is in LONG format (each subject / PID has several repeated measures over time)
dput(head(testdf))
structure(list(pid = c(1, 1, 1, 1, 1, 2), age = c(52, 52, 52, 
52, 52, 76), height = c(175, 175, 175, 175, 175, 164), weight = c(68, 
70, 68, 68, 68, 95.5), bsa = c(1.82549041118621, 1.84811901382349, 
1.82549041118621, 1.82549041118621, 1.82549041118621, 2.01198114104261
), lab_bnp_log = c(5.96100533962327, 5.29330482472449, 5.9597158487934, 
5.76205138278018, 5.03695260241363, 6.85583020611722), lab_tni_log = c(-5.49676830527187, 
-5.36019277026612, -5.80914299031403, -5.5727542122498, -5.59942245933196, 
-4.56594947283481), lab_crp_log = c(NA, 1.50407739677627, 0.955511445027436, 
1.54756250871601, 1.54756250871601, 0.8754687373539), lab_bun = c(57, 
47, 57, 49, 61, 52), lab_creat = c(1.2, 1, 1.3, 1.2, 1.4, 1.5
)), row.names = c(NA, 6L), class = "data.frame"
I am trying to correctly label boxplot outliers using GGPLOT2 package, as a step of my exploratory analysis.
To plot an Histogram and a Boxplot I use a For loop, that starts after having "melted" the dataset
library(ggplot2)
library(reshape2)
library(tidyverse)
library(egg)

myplots <- melt(testdf, id.vars = "pid", na.rm = TRUE)

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

pdf(paste("EXPL_QUANT.pdf",sep=""))
# Start the FOR Loop, where i = Variable name and value = variable value (number)
for(i in 1:length(levels(myplots$variable)))
{
  MyHisto <- ggplot(myplots[myplots$variable == levels(myplots$variable)[i],], aes(x=value)) + 
    geom_histogram(aes(y = ..density..),
                   binwidth = 0.7, # Modify it to have more/less bins in the histo
                   fill = 'yellow',
                   alpha = 0.7,
                   col = 'black') +
    geom_density(colour="#00000040", lwd = 1, fill="lightyellow", alpha=0.5) +
    geom_vline(aes(xintercept = summarised_value, color = stat), 
               size = 1,
               data = . %>% 
                 summarise(mean = mean(value), median = median(value)) %>% 
                 gather ("stat", "summarised_value", mean:median)) +
    scale_color_manual(values = c("blue", "red")) +
    theme_bw() +
    theme(axis.text.x=element_text(size=14), axis.title.x=element_text(size=16),
          axis.text.y=element_text(size=14), axis.title.y=element_text(size=16),
          legend.position = c(.95, .95),
          legend.justification = c(1,1),
          legend.background = element_rect(color = "black")) +
    ggtitle(paste0("Explorative Analysis of ",levels(myplots$variable)[i]))
  
  }
  
  # Plot the BoxPlot
  dat <- myplots[myplots$variable == levels(myplots$variable)[i],]  %>% 
                tibble::rownames_to_column(var="outlier") %>%  
                  mutate(outlier = ifelse(is_outlier(myplots$value[i]), myplots$value[i], as.numeric(NA)))
  dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
  MyBox <- ggplot(dat, aes(x = i, y = value)) +
    geom_boxplot(fill = "royalblue"
                 , colour = "black"
                 , outlier.colour = "red" 
                 , outlier.shape = 20 # 
                 , outlier.size = 1
    ) +
    coord_flip() + # Horizontal Boxplot 
    scale_y_continuous(name = levels(myplots$variable)[i]) + # Variable name on the Y Axis 
    scale_x_continuous(name = levels(myplots$variable)[i]) + # Variable name on the X Axis 
    geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05)
 

  egg::ggarrange(MyHisto, MyBox, heights = 2:1) # Combine and Line up Histogram and Boxplot (EGG Package)
  
}

dev.off()The loop complete with no errors and the graphs are programmatically built, I can see the outliers in the boxplots depicted as red dots (that is fine) BUT I cannot see the labels. I believe the problem is that outliers column contains just NAs for each variable, so of course I cannot label anything, and I think the following line

mutate(outlier = ifelse(is_outlier(myplots$value[i]), myplots$value[i], as.numeric(NA))should be modified so to provide the is_outlier() function with a proper vector of values, each grouped by PID and by variable, but, even if I am correct I do not know how to do that. 
Any help in understanding why the loop is not working and how to fix it would be greatly appreciated
Best, 
Diego 

	[[alternative HTML version deleted]]



More information about the R-help mailing list