[R] about ECDF display in ggplot2

Bogdan Tanasa t@n@@@ @ending from gm@il@com
Mon Jul 9 16:09:18 CEST 2018


Dear Jeff,

thank you for all your time, and very precious help.

with best regards.

-- bogdan

On Mon, Jul 9, 2018 at 1:41 AM, Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
wrote:

> Thank you for making the effort... but most attachments get stripped on
> the mailing list. Using the reprex package as I suggested and putting the
> result into the email is by far the safest approach. Since I received your
> email directly, I did get the attachments. Below is my reproducible
> example... to serve as an example for how you can get help from everyone on
> the list rather than just the few you are responding to.
>
> My summary comment is that you have to decide whether the LENGTH values
> greater than 500 are relevant... and if they are, you REALLY SHOULD create
> a data set that is limited in this fashion. Then you won't have to create
> "fake" axes, and you won't get ggplot warnings.
>
> Note: The reprex package allows you to confirm that the example is in fact
> reproducible, so technically it is not necessary to include the plot images
> in the question. However, reprex used to conveniently support putting the
> images on the imgur website, and for some reason it no longer does that, so
> just run the example interactively to see the graphs.
>
> #######
> ############################################################
> ############################################################
>
> library("ggplot2")
>
> # "file" is the name of a very fundamental function in base R. Re-using
> # that name for a data value is at best confusing to anyone reading your
> # code and at worst will prevent you from using that function.
> #file <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
>
> # Instead of giving us a file, keep the data within the example
> # DF <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
> # set.seed( 42 )
> # also shrink the size of the data for the example... we almost
> # never need all of it
> # dput( DF[ sample( seq.int( nrow( DF ) ), size = 200 ), , drop=FALSE ] )
> DF <- structure(list(LENGTH = c(6813L, 56035L, 123997L, 281L, 851L, 1072L,
>           72196L, 21L, 304L, 110L, 198L, 5922L, 283L, 199348L, 109L,
>           3317104L, 106L, 37642146L, 82641L, 20L, 125911L, 354L, 11625388L,
>           330L, 9811711L, 18L, 35L, 39897L, 27L, 277L, 79L, 2657L, 17L,
>           26L, 23L, 248L, 3634L, 21L, 324L, 206L, 328L, 42L, 286L,
> 6042409L,
>           24L, 36L, 2879L, 18L, 301L, 90684L, 4296636L, 43L, 1222L, 4536L,
>           3281L, 324L, 393L, 3754L, 98824541L, 459L, 18L, 1081L, 175L,
>           970L, 17L, 219L, 235558L, 1167315L, 25L, 623L, 2517515L, 32L,
>           217L, 29L, 17L, 1744L, 18L, 39L, 26L, 77L, 41L, 22L, 311L,
> 119015225L,
>           146413L, 22L, 19L, 301L, 373L, 2240L, 6439L, 128L, 18L, 257L,
>           783L, 5169L, 31608038L, 325L, 1533L, 25L, 69344L, 54L, 10651L,
>           31L, 335062L, 1854019L, 7153L, 38605567L, 51L, 23L, 16L, 301L,
>           79L, 313L, 18L, 29L, 39L, 22L, 17L, 306L, 67L, 280L, 324L, 158L,
>           93L, 2561L, 302L, 134578L, 328L, 9002L, 969051L, 34L, 20L, 309L,
>           355L, 28L, 9461327L, 18627013L, 305L, 64L, 18L, 2730L, 28L, 246L,
>           911L, 28L, 241483L, 154691L, 58891L, 55L, 456362L, 281L, 276L,
>           51L, 26L, 106821L, 313L, 78L, 29L, 400L, 61171382L, 200L, 101L,
>           220331L, 128L, 325L, 28L, 22L, 325L, 2330L, 5879L, 24L, 36L,
>           23L, 51L, 26L, 32584707L, 1672L, 13939L, 315L, 20L, 580785L,
>           42795L, 49193543L, 695L, 48568156L, 55634L, 207L, 318L, 22056L,
>           3670420L, 4815387L, 309L, 17L, 3143160L, 431L, 1164L, 33L, 5503L,
>           4166L)), .Names = "LENGTH", row.names = c(8283L, 8484L, 2591L,
>           7517L, 5808L, 4698L, 6665L, 1219L, 5944L, 6378L, 4140L, 6503L,
>           8452L, 2310L, 4180L, 8497L, 8842L, 1062L, 4293L, 5063L, 8168L,
>           1253L, 8932L, 8550L, 745L, 4643L, 3523L, 8177L, 4035L, 7545L,
>           6657L, 7319L, 3502L, 6181L, 36L, 7513L, 67L, 1873L, 8174L, 5516L,
>           3422L, 3928L, 338L, 8773L, 3891L, 8627L, 7997L, 5765L, 8745L,
>           5573L, 3003L, 3122L, 3588L, 7064L, 351L, 6739L, 6095L, 1541L,
>           2349L, 4628L, 6077L, 8839L, 6830L, 5094L, 7639L, 1704L, 2439L,
>           7443L, 6230L, 2162L, 387L, 1262L, 1944L, 4306L, 1773L, 6460L,
>           71L, 3371L, 4618L, 15L, 5220L, 1417L, 3222L, 5792L, 6960L, 5056L,
>           2096L, 807L, 768L, 2737L, 5983L, 3L, 1870L, 8361L, 8294L, 6577L,
>           2984L, 4614L, 6664L, 5545L, 5608L, 1945L, 1939L, 3482L, 8435L,
>           8615L, 6621L, 6561L, 4793L, 21L, 5447L, 7484L, 6721L, 4048L,
>           4790L, 4804L, 13L, 3179L, 5471L, 7407L, 3187L, 3669L, 5123L,
>           5267L, 6427L, 3527L, 8207L, 8593L, 2085L, 6467L, 8065L, 5385L,
>           5635L, 8363L, 7587L, 5172L, 7326L, 1015L, 6817L, 5560L, 1324L,
>           716L, 4136L, 6945L, 6536L, 7281L, 1516L, 8415L, 2616L, 1328L,
>           6406L, 2886L, 6933L, 3511L, 6040L, 6905L, 1672L, 259L, 1208L,
>           6051L, 8315L, 4896L, 5351L, 1752L, 4759L, 1597L, 4017L, 2818L,
>           1033L, 1654L, 6483L, 3659L, 3678L, 4266L, 3797L, 1212L, 7322L,
>           5258L, 7052L, 6826L, 8147L, 7655L, 2813L, 2300L, 6584L, 6629L,
>           8140L, 7034L, 1183L, 2551L, 1726L, 6950L, 1143L, 1144L, 641L,
>           471L, 4712L, 995L, 6582L, 6476L), class = "data.frame")
>
>
> ############################# display with PLOT FUNCTION:
>
>
> # saving files should be avoided in reproducible examples... especially
> files
> # that cannot be transmitted through the R-help mailing list such as pdf
> files
> #pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6, paper='special')
>
> # Your original plot commands below create a fake impression of the data by
> # falsifying the axes. If you really are only interested in data points
> less
> # than 500, you should be explicit about creating a data set containing
> only
> # such constrained values before plotting them.
> plot(ecdf(DF$LENGTH), xlab="DEL SIZE",
>                      ylab="fraction of DEL",
>                      main="LENGTH of DEL",
>                      xlim=c(0,500),
>                      col = "dark red", axes = FALSE)
> ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
> axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
> ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
> axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-1.png)
>
> # my recommendation
> DF500 <- subset( DF, LENGTH < 500 )
> plot( ecdf( DF500$LENGTH )
>     , xlab = "DEL SIZE"
>     , ylab = "fraction of DEL"
>     , main = "LENGTH of DEL"
>     , col = "dark red"
>     )
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-2.png)
>
> # alternatively
> plot( ecdf( DF$LENGTH )
>     , xlab = "DEL SIZE"
>     , ylab = "fraction of DEL"
>     , main = "LENGTH of DEL"
>     , col = "dark red"
>     , xlim=c( 1, 1e9 )
>     , log="x"
>     )
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-3.png)
>
>
>
> #dev.off()
>
> ############################# display in GGPLOT2 :
>
> BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,
>            1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)
>
> barfill <- "#4271AE"
> barlines <- "#1F3552"
>
> #pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
> paper='special')
>
> # ggplot's limits behavior is enabling your false representation of the
> data, but it
> # warns you of the data removal
> ggplot(DF, aes(LENGTH)) +
>           stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
>           scale_x_continuous(name = "LENGTH of DEL",
>                              breaks = BREAKS,
>                              limits=c(0, 500)
>                              ) +
>           scale_y_continuous(name = "FRACTION") +
>           ggtitle("ECDF of LENGTH") +
>           theme_bw() +
>           theme(legend.position = "bottom", legend.direction =
> "horizontal",
>                legend.box = "horizontal",
>                legend.key.size = unit(1, "cm"),
>                axis.title = element_text(size = 12),
>                legend.text = element_text(size = 9),
>                legend.title=element_text(face = "bold", size = 9))
> #> Warning: Removed 80 rows containing non-finite values (stat_ecdf).
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-4.png)
>
>
> # my recommendation
> ggplot(DF500, aes(LENGTH)) +
>   stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
>   scale_x_continuous(name = "LENGTH of DEL",
>                      breaks = BREAKS ) +
>   scale_y_continuous(name = "FRACTION") +
>   ggtitle("ECDF of LENGTH") +
>   theme_bw() +
>   theme(legend.position = "bottom", legend.direction = "horizontal",
>         legend.box = "horizontal",
>         legend.key.size = unit(1, "cm"),
>         axis.title = element_text(size = 12),
>         legend.text = element_text(size = 9),
>         legend.title=element_text(face = "bold", size = 9))
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-5.png)
>
> # or for the un-filtered data
> ggplot(DF, aes(LENGTH)) +
>   stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
>   scale_x_log10( name = "LENGTH of DEL") +
>   scale_y_continuous(name = "FRACTION") +
>   ggtitle("ECDF of LENGTH") +
>   theme_bw() +
>   theme(legend.position = "bottom", legend.direction = "horizontal",
>         legend.box = "horizontal",
>         legend.key.size = unit(1, "cm"),
>         axis.title = element_text(size = 12),
>         legend.text = element_text(size = 9),
>         legend.title=element_text(face = "bold", size = 9))
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-6.png)
>
>
> #dev.off()
>
> #' Created on 2018-07-09 by the [reprex package](http://reprex.tidyver
> se.org) (v0.2.0).
> #######
>
>
> On Sun, 8 Jul 2018, Bogdan Tanasa wrote:
>
> Dear Jeff,
>> thank you for your email.
>>
>> Yes, in order to be more descriptive/comprehensive, please find attached
>> to
>> my email the following files (my apologies ... I am sending these as
>> attachments, as I do not have a web server running at this moment) :
>>
>> -- the R script (R_script_display_ECDF.R) that reads the file "LENGTH" and
>> outputs ECDF figure by using the standard R function or ggplot2.
>>
>> -- the display of ECDF by using standard R function
>> ("display.R.ecdf.LENGTH.pdf")
>>
>> -- the display of ECDF by using ggplot2 ("display.ggplot2.ecdf.LENGTH.
>> pdf")
>>
>> The ECDF over xlim(0,500) looks very different (contrasting plot(ecdf) vs
>> ggplot2).  Please would you advise why ? what shall I change in my ggplot2
>> code ?
>>
>> thanks a lot,
>>
>> - bogdan
>>
>> ps : the R code is also written below :
>>
>>        library("ggplot2")
>>
>>
>>       file <- read.delim("LENGTH", sep="\t", header=T,
>>       stringsAsFactors=F)
>>
>>
>>       ############################# display with PLOT FUNCTION:
>>
>>
>>       pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6,
>>       paper='special')
>>
>>
>>       plot(ecdf(file$LENGTH), xlab="DEL SIZE",
>>                            ylab="fraction of DEL",
>>                            main="LENGTH of DEL",
>>                            xlim=c(0,500),
>>                            col = "dark red", axes = FALSE)
>>
>>
>>       ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
>>
>>
>>       axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
>>
>>
>>       ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
>>
>>
>>       axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
>>
>>
>>       dev.off()
>>
>>
>>       ############################# display in GGPLOT2 :
>>
>>
>>       BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300,
>>       400, 500,
>>                  1000, 10000, 100000, 1000000, 10000000, 100000000,
>>       1000000000)
>>
>>
>>       barfill <- "#4271AE"
>>       barlines <- "#1F3552"
>>
>>
>>       pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
>>       paper='special')
>>
>>
>>       ggplot(file, aes(LENGTH)) +
>>                 stat_ecdf(geom = "point", colour = barlines, fill =
>>       barfill) +
>>                 scale_x_continuous(name = "LENGTH of DEL",
>>                                    breaks = BREAKS,
>>                                    limits=c(0, 500)) +
>>                 scale_y_continuous(name = "FRACTION") +
>>                 ggtitle("ECDF of LENGTH") +
>>                 theme_bw() +
>>                 theme(legend.position = "bottom", legend.direction =
>>       "horizontal",
>>                      legend.box = "horizontal",
>>                      legend.key.size = unit(1, "cm"),
>>                      axis.title = element_text(size = 12),
>>                      legend.text = element_text(size = 9),
>>                      legend.title=element_text(face = "bold", size =
>>       9))
>>
>>
>>       dev.off()
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Jul 7, 2018 at 9:47 PM, Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
>> wrote:
>>       It is a feature of ggplot that points excluded by limits raise
>>       warnings, while base graphics do not.
>>
>>       You may find that using coord_cartesian with the xlim=c(0,500)
>>       argument works better with ggplot by showing the consequences of
>>       points out of the limits on lines within the viewport.
>>
>>       There are other possible problems with your data that your
>>       non-reproducible example does not show, and sending R code in
>>       HTML-formatted email usually corrupts it.. so please follow the
>>       recommendations in the Posting Guide next time you post.
>>
>>       On July 6, 2018 4:32:41 PM PDT, Bogdan Tanasa <tanasa using gmail.com>
>>       wrote:
>>       >Dear all,
>>       >
>>       >I would appreciate having your advice/suggestions/comments on
>>       the
>>       >following
>>       >:
>>       >
>>       >1 -- starting from a vector that contains LENGTHS (numerically,
>>       the
>>       >values
>>       >are from 1 to 10 000)
>>       >
>>       >2 -- shall I display the ECDF by using the R code and some
>>       "limits" :
>>       >
>>       >BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200,
>>       300, 400,
>>       >500,
>>       >         1000, 10000, 100000, 1000000, 10000000, 100000000,
>>       1000000000)
>>       >
>>       >ggplot(x, aes(LENGTH)) +
>>       >          stat_ecdf(geom = "point") +
>>       >          scale_x_continuous(name = "LENGTH of DEL",
>>       >                             breaks = BREAKS,
>>       >                             limits=c(0, 500))
>>       >
>>       >3 -- I am getting the following warning message : "Warning
>>       message:
>>       >Removed
>>       >109 rows containing non-finite values (stat_ecdf)."
>>       >
>>       >The question is : are these 109 values removed from
>>       VISUALIZATION as i
>>       >set
>>       >up the "limits", or are these 109 values removed from
>>       statistical
>>       >CALCULATION?
>>       >
>>       >4 -- in contrast, shall I use the standard R functions
>>       plot(ecdf),
>>       >there is
>>       >no "warning mesage"
>>       >
>>       >plot(ecdf(x$LENGTH), xlab="DEL LENGTH",
>>       >                     ylab="Fraction of DEL", main="DEL",
>>       xlim=c(0,500),
>>       >                     col = "dark red")
>>       >
>>       >Thanks a lot !
>>       >
>>       >-- bogdan
>>       >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >PLEASE do read the posting guide
>> >http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>>
>>
>>
>>
> ------------------------------------------------------------
> ---------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil using dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ------------------------------------------------------------
> ---------------

	[[alternative HTML version deleted]]



More information about the R-help mailing list