[R] How to compute ggplot2 boxplot outliers after Reshape/Melt in LONG/stacked format (in aloop)
Diego Bellavia
me@@@du@ @end|ng |rom y@hoo@|t
Tue Feb 22 09:51:29 CET 2022
Here it is a test Dataframe (just a subset of the real one), that is in LONG format (each subject / PID has several repeated measures over time)
dput(head(testdf))
structure(list(pid = c(1, 1, 1, 1, 1, 2), age = c(52, 52, 52,
52, 52, 76), height = c(175, 175, 175, 175, 175, 164), weight = c(68,
70, 68, 68, 68, 95.5), bsa = c(1.82549041118621, 1.84811901382349,
1.82549041118621, 1.82549041118621, 1.82549041118621, 2.01198114104261
), lab_bnp_log = c(5.96100533962327, 5.29330482472449, 5.9597158487934,
5.76205138278018, 5.03695260241363, 6.85583020611722), lab_tni_log = c(-5.49676830527187,
-5.36019277026612, -5.80914299031403, -5.5727542122498, -5.59942245933196,
-4.56594947283481), lab_crp_log = c(NA, 1.50407739677627, 0.955511445027436,
1.54756250871601, 1.54756250871601, 0.8754687373539), lab_bun = c(57,
47, 57, 49, 61, 52), lab_creat = c(1.2, 1, 1.3, 1.2, 1.4, 1.5
)), row.names = c(NA, 6L), class = "data.frame"
I am trying to correctly label boxplot outliers using GGPLOT2 package, as a step of my exploratory analysis.
To plot an Histogram and a Boxplot I use a For loop, that starts after having "melted" the dataset
library(ggplot2)
library(reshape2)
library(tidyverse)
library(egg)
myplots <- melt(testdf, id.vars = "pid", na.rm = TRUE)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
pdf(paste("EXPL_QUANT.pdf",sep=""))
# Start the FOR Loop, where i = Variable name and value = variable value (number)
for(i in 1:length(levels(myplots$variable)))
{
MyHisto <- ggplot(myplots[myplots$variable == levels(myplots$variable)[i],], aes(x=value)) +
geom_histogram(aes(y = ..density..),
binwidth = 0.7, # Modify it to have more/less bins in the histo
fill = 'yellow',
alpha = 0.7,
col = 'black') +
geom_density(colour="#00000040", lwd = 1, fill="lightyellow", alpha=0.5) +
geom_vline(aes(xintercept = summarised_value, color = stat),
size = 1,
data = . %>%
summarise(mean = mean(value), median = median(value)) %>%
gather ("stat", "summarised_value", mean:median)) +
scale_color_manual(values = c("blue", "red")) +
theme_bw() +
theme(axis.text.x=element_text(size=14), axis.title.x=element_text(size=16),
axis.text.y=element_text(size=14), axis.title.y=element_text(size=16),
legend.position = c(.95, .95),
legend.justification = c(1,1),
legend.background = element_rect(color = "black")) +
ggtitle(paste0("Explorative Analysis of ",levels(myplots$variable)[i]))
}
# Plot the BoxPlot
dat <- myplots[myplots$variable == levels(myplots$variable)[i],] %>%
tibble::rownames_to_column(var="outlier") %>%
mutate(outlier = ifelse(is_outlier(myplots$value[i]), myplots$value[i], as.numeric(NA)))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
MyBox <- ggplot(dat, aes(x = i, y = value)) +
geom_boxplot(fill = "royalblue"
, colour = "black"
, outlier.colour = "red"
, outlier.shape = 20 #
, outlier.size = 1
) +
coord_flip() + # Horizontal Boxplot
scale_y_continuous(name = levels(myplots$variable)[i]) + # Variable name on the Y Axis
scale_x_continuous(name = levels(myplots$variable)[i]) + # Variable name on the X Axis
geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05)
egg::ggarrange(MyHisto, MyBox, heights = 2:1) # Combine and Line up Histogram and Boxplot (EGG Package)
}
dev.off()The loop complete with no errors and the graphs are programmatically built, I can see the outliers in the boxplots depicted as red dots (that is fine) BUT I cannot see the labels. I believe the problem is that outliers column contains just NAs for each variable, so of course I cannot label anything, and I think the following line
mutate(outlier = ifelse(is_outlier(myplots$value[i]), myplots$value[i], as.numeric(NA))should be modified so to provide the is_outlier() function with a proper vector of values, each grouped by PID and by variable, but, even if I am correct I do not know how to do that.
Any help in understanding why the loop is not working and how to fix it would be greatly appreciated
Best,
Diego
[[alternative HTML version deleted]]
More information about the R-help
mailing list