[R] about ECDF display in ggplot2

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Mon Jul 9 10:41:58 CEST 2018


Thank you for making the effort... but most attachments get stripped on 
the mailing list. Using the reprex package as I suggested and putting the 
result into the email is by far the safest approach. Since I received your 
email directly, I did get the attachments. Below is my reproducible 
example... to serve as an example for how you can get help from everyone 
on the list rather than just the few you are responding to.

My summary comment is that you have to decide whether the LENGTH values 
greater than 500 are relevant... and if they are, you REALLY SHOULD create 
a data set that is limited in this fashion. Then you won't have to create 
"fake" axes, and you won't get ggplot warnings.

Note: The reprex package allows you to confirm that the example is in fact 
reproducible, so technically it is not necessary to include the plot 
images in the question. However, reprex used to conveniently support 
putting the images on the imgur website, and for some reason it no longer 
does that, so just run the example interactively to see the graphs.

#######
############################################################
############################################################

library("ggplot2")

# "file" is the name of a very fundamental function in base R. Re-using
# that name for a data value is at best confusing to anyone reading your
# code and at worst will prevent you from using that function.
#file <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)

# Instead of giving us a file, keep the data within the example
# DF <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
# set.seed( 42 )
# also shrink the size of the data for the example... we almost
# never need all of it
# dput( DF[ sample( seq.int( nrow( DF ) ), size = 200 ), , drop=FALSE ] )
DF <- structure(list(LENGTH = c(6813L, 56035L, 123997L, 281L, 851L, 1072L,
           72196L, 21L, 304L, 110L, 198L, 5922L, 283L, 199348L, 109L,
           3317104L, 106L, 37642146L, 82641L, 20L, 125911L, 354L, 11625388L,
           330L, 9811711L, 18L, 35L, 39897L, 27L, 277L, 79L, 2657L, 17L,
           26L, 23L, 248L, 3634L, 21L, 324L, 206L, 328L, 42L, 286L, 6042409L,
           24L, 36L, 2879L, 18L, 301L, 90684L, 4296636L, 43L, 1222L, 4536L,
           3281L, 324L, 393L, 3754L, 98824541L, 459L, 18L, 1081L, 175L,
           970L, 17L, 219L, 235558L, 1167315L, 25L, 623L, 2517515L, 32L,
           217L, 29L, 17L, 1744L, 18L, 39L, 26L, 77L, 41L, 22L, 311L, 119015225L,
           146413L, 22L, 19L, 301L, 373L, 2240L, 6439L, 128L, 18L, 257L,
           783L, 5169L, 31608038L, 325L, 1533L, 25L, 69344L, 54L, 10651L,
           31L, 335062L, 1854019L, 7153L, 38605567L, 51L, 23L, 16L, 301L,
           79L, 313L, 18L, 29L, 39L, 22L, 17L, 306L, 67L, 280L, 324L, 158L,
           93L, 2561L, 302L, 134578L, 328L, 9002L, 969051L, 34L, 20L, 309L,
           355L, 28L, 9461327L, 18627013L, 305L, 64L, 18L, 2730L, 28L, 246L,
           911L, 28L, 241483L, 154691L, 58891L, 55L, 456362L, 281L, 276L,
           51L, 26L, 106821L, 313L, 78L, 29L, 400L, 61171382L, 200L, 101L,
           220331L, 128L, 325L, 28L, 22L, 325L, 2330L, 5879L, 24L, 36L,
           23L, 51L, 26L, 32584707L, 1672L, 13939L, 315L, 20L, 580785L,
           42795L, 49193543L, 695L, 48568156L, 55634L, 207L, 318L, 22056L,
           3670420L, 4815387L, 309L, 17L, 3143160L, 431L, 1164L, 33L, 5503L,
           4166L)), .Names = "LENGTH", row.names = c(8283L, 8484L, 2591L,
           7517L, 5808L, 4698L, 6665L, 1219L, 5944L, 6378L, 4140L, 6503L,
           8452L, 2310L, 4180L, 8497L, 8842L, 1062L, 4293L, 5063L, 8168L,
           1253L, 8932L, 8550L, 745L, 4643L, 3523L, 8177L, 4035L, 7545L,
           6657L, 7319L, 3502L, 6181L, 36L, 7513L, 67L, 1873L, 8174L, 5516L,
           3422L, 3928L, 338L, 8773L, 3891L, 8627L, 7997L, 5765L, 8745L,
           5573L, 3003L, 3122L, 3588L, 7064L, 351L, 6739L, 6095L, 1541L,
           2349L, 4628L, 6077L, 8839L, 6830L, 5094L, 7639L, 1704L, 2439L,
           7443L, 6230L, 2162L, 387L, 1262L, 1944L, 4306L, 1773L, 6460L,
           71L, 3371L, 4618L, 15L, 5220L, 1417L, 3222L, 5792L, 6960L, 5056L,
           2096L, 807L, 768L, 2737L, 5983L, 3L, 1870L, 8361L, 8294L, 6577L,
           2984L, 4614L, 6664L, 5545L, 5608L, 1945L, 1939L, 3482L, 8435L,
           8615L, 6621L, 6561L, 4793L, 21L, 5447L, 7484L, 6721L, 4048L,
           4790L, 4804L, 13L, 3179L, 5471L, 7407L, 3187L, 3669L, 5123L,
           5267L, 6427L, 3527L, 8207L, 8593L, 2085L, 6467L, 8065L, 5385L,
           5635L, 8363L, 7587L, 5172L, 7326L, 1015L, 6817L, 5560L, 1324L,
           716L, 4136L, 6945L, 6536L, 7281L, 1516L, 8415L, 2616L, 1328L,
           6406L, 2886L, 6933L, 3511L, 6040L, 6905L, 1672L, 259L, 1208L,
           6051L, 8315L, 4896L, 5351L, 1752L, 4759L, 1597L, 4017L, 2818L,
           1033L, 1654L, 6483L, 3659L, 3678L, 4266L, 3797L, 1212L, 7322L,
           5258L, 7052L, 6826L, 8147L, 7655L, 2813L, 2300L, 6584L, 6629L,
           8140L, 7034L, 1183L, 2551L, 1726L, 6950L, 1143L, 1144L, 641L,
           471L, 4712L, 995L, 6582L, 6476L), class = "data.frame")


############################# display with PLOT FUNCTION:


# saving files should be avoided in reproducible examples... especially files
# that cannot be transmitted through the R-help mailing list such as pdf files
#pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6, paper='special')

# Your original plot commands below create a fake impression of the data by
# falsifying the axes. If you really are only interested in data points less
# than 500, you should be explicit about creating a data set containing only
# such constrained values before plotting them.
plot(ecdf(DF$LENGTH), xlab="DEL SIZE",
                      ylab="fraction of DEL",
                      main="LENGTH of DEL",
                      xlim=c(0,500),
                      col = "dark red", axes = FALSE)
ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-1.png)

# my recommendation
DF500 <- subset( DF, LENGTH < 500 )
plot( ecdf( DF500$LENGTH )
     , xlab = "DEL SIZE"
     , ylab = "fraction of DEL"
     , main = "LENGTH of DEL"
     , col = "dark red"
     )

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-2.png)

# alternatively
plot( ecdf( DF$LENGTH )
     , xlab = "DEL SIZE"
     , ylab = "fraction of DEL"
     , main = "LENGTH of DEL"
     , col = "dark red"
     , xlim=c( 1, 1e9 )
     , log="x"
     )

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-3.png)



#dev.off()

############################# display in GGPLOT2 :

BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,
            1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)

barfill <- "#4271AE"
barlines <- "#1F3552"

#pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6, paper='special')

# ggplot's limits behavior is enabling your false representation of the data, but it
# warns you of the data removal
ggplot(DF, aes(LENGTH)) +
           stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
           scale_x_continuous(name = "LENGTH of DEL",
                              breaks = BREAKS,
                              limits=c(0, 500)
                              ) +
           scale_y_continuous(name = "FRACTION") +
           ggtitle("ECDF of LENGTH") +
           theme_bw() +
           theme(legend.position = "bottom", legend.direction = "horizontal",
                legend.box = "horizontal",
                legend.key.size = unit(1, "cm"),
                axis.title = element_text(size = 12),
                legend.text = element_text(size = 9),
                legend.title=element_text(face = "bold", size = 9))
#> Warning: Removed 80 rows containing non-finite values (stat_ecdf).

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-4.png)


# my recommendation
ggplot(DF500, aes(LENGTH)) +
   stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
   scale_x_continuous(name = "LENGTH of DEL",
                      breaks = BREAKS ) +
   scale_y_continuous(name = "FRACTION") +
   ggtitle("ECDF of LENGTH") +
   theme_bw() +
   theme(legend.position = "bottom", legend.direction = "horizontal",
         legend.box = "horizontal",
         legend.key.size = unit(1, "cm"),
         axis.title = element_text(size = 12),
         legend.text = element_text(size = 9),
         legend.title=element_text(face = "bold", size = 9))

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-5.png)

# or for the un-filtered data
ggplot(DF, aes(LENGTH)) +
   stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
   scale_x_log10( name = "LENGTH of DEL") +
   scale_y_continuous(name = "FRACTION") +
   ggtitle("ECDF of LENGTH") +
   theme_bw() +
   theme(legend.position = "bottom", legend.direction = "horizontal",
         legend.box = "horizontal",
         legend.key.size = unit(1, "cm"),
         axis.title = element_text(size = 12),
         legend.text = element_text(size = 9),
         legend.title=element_text(face = "bold", size = 9))

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-6.png)


#dev.off()

#' Created on 2018-07-09 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
#######

On Sun, 8 Jul 2018, Bogdan Tanasa wrote:

> Dear Jeff, 
> thank you for your email. 
> 
> Yes, in order to be more descriptive/comprehensive, please find attached to
> my email the following files (my apologies ... I am sending these as
> attachments, as I do not have a web server running at this moment) : 
> 
> -- the R script (R_script_display_ECDF.R) that reads the file "LENGTH" and
> outputs ECDF figure by using the standard R function or ggplot2.
> 
> -- the display of ECDF by using standard R function
> ("display.R.ecdf.LENGTH.pdf")
> 
> -- the display of ECDF by using ggplot2 ("display.ggplot2.ecdf.LENGTH.pdf")
> 
> The ECDF over xlim(0,500) looks very different (contrasting plot(ecdf) vs
> ggplot2).  Please would you advise why ? what shall I change in my ggplot2
> code ?
> 
> thanks a lot, 
> 
> - bogdan
> 
> ps : the R code is also written below :
>
>        library("ggplot2")
> 
>  
>       file <- read.delim("LENGTH", sep="\t", header=T,
>       stringsAsFactors=F) 
> 
>  
>       ############################# display with PLOT FUNCTION: 
> 
>  
>       pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6,
>       paper='special') 
> 
>  
>       plot(ecdf(file$LENGTH), xlab="DEL SIZE",  
>                            ylab="fraction of DEL", 
>                            main="LENGTH of DEL",  
>                            xlim=c(0,500), 
>                            col = "dark red", axes = FALSE)
> 
>  
>       ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
> 
>  
>       axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
> 
>  
>       ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
> 
>  
>       axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
> 
>  
>       dev.off()
> 
>  
>       ############################# display in GGPLOT2 : 
> 
>  
>       BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300,
>       400, 500, 
>                  1000, 10000, 100000, 1000000, 10000000, 100000000,
>       1000000000)
> 
>  
>       barfill <- "#4271AE"
>       barlines <- "#1F3552"
> 
>  
>       pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
>       paper='special') 
> 
>  
>       ggplot(file, aes(LENGTH)) + 
>                 stat_ecdf(geom = "point", colour = barlines, fill =
>       barfill) +
>                 scale_x_continuous(name = "LENGTH of DEL",
>                                    breaks = BREAKS,
>                                    limits=c(0, 500)) +
>                 scale_y_continuous(name = "FRACTION") +
>                 ggtitle("ECDF of LENGTH") + 
>                 theme_bw() +
>                 theme(legend.position = "bottom", legend.direction =
>       "horizontal",
>                      legend.box = "horizontal",
>                      legend.key.size = unit(1, "cm"),
>                      axis.title = element_text(size = 12),
>                      legend.text = element_text(size = 9),
>                      legend.title=element_text(face = "bold", size =
>       9))
> 
>  
>       dev.off()
> 
> 
> 
> 
>  
> 
> 
> On Sat, Jul 7, 2018 at 9:47 PM, Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
> wrote:
>       It is a feature of ggplot that points excluded by limits raise
>       warnings, while base graphics do not.
>
>       You may find that using coord_cartesian with the xlim=c(0,500)
>       argument works better with ggplot by showing the consequences of
>       points out of the limits on lines within the viewport.
>
>       There are other possible problems with your data that your
>       non-reproducible example does not show, and sending R code in
>       HTML-formatted email usually corrupts it.. so please follow the
>       recommendations in the Posting Guide next time you post.
>
>       On July 6, 2018 4:32:41 PM PDT, Bogdan Tanasa <tanasa using gmail.com>
>       wrote:
>       >Dear all,
>       >
>       >I would appreciate having your advice/suggestions/comments on
>       the
>       >following
>       >:
>       >
>       >1 -- starting from a vector that contains LENGTHS (numerically,
>       the
>       >values
>       >are from 1 to 10 000)
>       >
>       >2 -- shall I display the ECDF by using the R code and some
>       "limits" :
>       >
>       >BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200,
>       300, 400,
>       >500,
>       >         1000, 10000, 100000, 1000000, 10000000, 100000000,
>       1000000000)
>       >
>       >ggplot(x, aes(LENGTH)) +
>       >          stat_ecdf(geom = "point") +
>       >          scale_x_continuous(name = "LENGTH of DEL",
>       >                             breaks = BREAKS,
>       >                             limits=c(0, 500))
>       >
>       >3 -- I am getting the following warning message : "Warning
>       message:
>       >Removed
>       >109 rows containing non-finite values (stat_ecdf)."
>       >
>       >The question is : are these 109 values removed from
>       VISUALIZATION as i
>       >set
>       >up the "limits", or are these 109 values removed from
>       statistical
>       >CALCULATION?
>       >
>       >4 -- in contrast, shall I use the standard R functions
>       plot(ecdf),
>       >there is
>       >no "warning mesage"
>       >
>       >plot(ecdf(x$LENGTH), xlab="DEL LENGTH",
>       >                     ylab="Fraction of DEL", main="DEL",
>       xlim=c(0,500),
>       >                     col = "dark red")
>       >
>       >Thanks a lot !
>       >
>       >-- bogdan
>       >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
> 
> --
> Sent from my phone. Please excuse my brevity.
> 
> 
> 
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil using dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------


More information about the R-help mailing list