[R] about ECDF display in ggplot2
Bogdan Tanasa
t@n@@@ @ending from gm@il@com
Mon Jul 9 16:09:18 CEST 2018
Dear Jeff,
thank you for all your time, and very precious help.
with best regards.
-- bogdan
On Mon, Jul 9, 2018 at 1:41 AM, Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
wrote:
> Thank you for making the effort... but most attachments get stripped on
> the mailing list. Using the reprex package as I suggested and putting the
> result into the email is by far the safest approach. Since I received your
> email directly, I did get the attachments. Below is my reproducible
> example... to serve as an example for how you can get help from everyone on
> the list rather than just the few you are responding to.
>
> My summary comment is that you have to decide whether the LENGTH values
> greater than 500 are relevant... and if they are, you REALLY SHOULD create
> a data set that is limited in this fashion. Then you won't have to create
> "fake" axes, and you won't get ggplot warnings.
>
> Note: The reprex package allows you to confirm that the example is in fact
> reproducible, so technically it is not necessary to include the plot images
> in the question. However, reprex used to conveniently support putting the
> images on the imgur website, and for some reason it no longer does that, so
> just run the example interactively to see the graphs.
>
> #######
> ############################################################
> ############################################################
>
> library("ggplot2")
>
> # "file" is the name of a very fundamental function in base R. Re-using
> # that name for a data value is at best confusing to anyone reading your
> # code and at worst will prevent you from using that function.
> #file <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
>
> # Instead of giving us a file, keep the data within the example
> # DF <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
> # set.seed( 42 )
> # also shrink the size of the data for the example... we almost
> # never need all of it
> # dput( DF[ sample( seq.int( nrow( DF ) ), size = 200 ), , drop=FALSE ] )
> DF <- structure(list(LENGTH = c(6813L, 56035L, 123997L, 281L, 851L, 1072L,
> 72196L, 21L, 304L, 110L, 198L, 5922L, 283L, 199348L, 109L,
> 3317104L, 106L, 37642146L, 82641L, 20L, 125911L, 354L, 11625388L,
> 330L, 9811711L, 18L, 35L, 39897L, 27L, 277L, 79L, 2657L, 17L,
> 26L, 23L, 248L, 3634L, 21L, 324L, 206L, 328L, 42L, 286L,
> 6042409L,
> 24L, 36L, 2879L, 18L, 301L, 90684L, 4296636L, 43L, 1222L, 4536L,
> 3281L, 324L, 393L, 3754L, 98824541L, 459L, 18L, 1081L, 175L,
> 970L, 17L, 219L, 235558L, 1167315L, 25L, 623L, 2517515L, 32L,
> 217L, 29L, 17L, 1744L, 18L, 39L, 26L, 77L, 41L, 22L, 311L,
> 119015225L,
> 146413L, 22L, 19L, 301L, 373L, 2240L, 6439L, 128L, 18L, 257L,
> 783L, 5169L, 31608038L, 325L, 1533L, 25L, 69344L, 54L, 10651L,
> 31L, 335062L, 1854019L, 7153L, 38605567L, 51L, 23L, 16L, 301L,
> 79L, 313L, 18L, 29L, 39L, 22L, 17L, 306L, 67L, 280L, 324L, 158L,
> 93L, 2561L, 302L, 134578L, 328L, 9002L, 969051L, 34L, 20L, 309L,
> 355L, 28L, 9461327L, 18627013L, 305L, 64L, 18L, 2730L, 28L, 246L,
> 911L, 28L, 241483L, 154691L, 58891L, 55L, 456362L, 281L, 276L,
> 51L, 26L, 106821L, 313L, 78L, 29L, 400L, 61171382L, 200L, 101L,
> 220331L, 128L, 325L, 28L, 22L, 325L, 2330L, 5879L, 24L, 36L,
> 23L, 51L, 26L, 32584707L, 1672L, 13939L, 315L, 20L, 580785L,
> 42795L, 49193543L, 695L, 48568156L, 55634L, 207L, 318L, 22056L,
> 3670420L, 4815387L, 309L, 17L, 3143160L, 431L, 1164L, 33L, 5503L,
> 4166L)), .Names = "LENGTH", row.names = c(8283L, 8484L, 2591L,
> 7517L, 5808L, 4698L, 6665L, 1219L, 5944L, 6378L, 4140L, 6503L,
> 8452L, 2310L, 4180L, 8497L, 8842L, 1062L, 4293L, 5063L, 8168L,
> 1253L, 8932L, 8550L, 745L, 4643L, 3523L, 8177L, 4035L, 7545L,
> 6657L, 7319L, 3502L, 6181L, 36L, 7513L, 67L, 1873L, 8174L, 5516L,
> 3422L, 3928L, 338L, 8773L, 3891L, 8627L, 7997L, 5765L, 8745L,
> 5573L, 3003L, 3122L, 3588L, 7064L, 351L, 6739L, 6095L, 1541L,
> 2349L, 4628L, 6077L, 8839L, 6830L, 5094L, 7639L, 1704L, 2439L,
> 7443L, 6230L, 2162L, 387L, 1262L, 1944L, 4306L, 1773L, 6460L,
> 71L, 3371L, 4618L, 15L, 5220L, 1417L, 3222L, 5792L, 6960L, 5056L,
> 2096L, 807L, 768L, 2737L, 5983L, 3L, 1870L, 8361L, 8294L, 6577L,
> 2984L, 4614L, 6664L, 5545L, 5608L, 1945L, 1939L, 3482L, 8435L,
> 8615L, 6621L, 6561L, 4793L, 21L, 5447L, 7484L, 6721L, 4048L,
> 4790L, 4804L, 13L, 3179L, 5471L, 7407L, 3187L, 3669L, 5123L,
> 5267L, 6427L, 3527L, 8207L, 8593L, 2085L, 6467L, 8065L, 5385L,
> 5635L, 8363L, 7587L, 5172L, 7326L, 1015L, 6817L, 5560L, 1324L,
> 716L, 4136L, 6945L, 6536L, 7281L, 1516L, 8415L, 2616L, 1328L,
> 6406L, 2886L, 6933L, 3511L, 6040L, 6905L, 1672L, 259L, 1208L,
> 6051L, 8315L, 4896L, 5351L, 1752L, 4759L, 1597L, 4017L, 2818L,
> 1033L, 1654L, 6483L, 3659L, 3678L, 4266L, 3797L, 1212L, 7322L,
> 5258L, 7052L, 6826L, 8147L, 7655L, 2813L, 2300L, 6584L, 6629L,
> 8140L, 7034L, 1183L, 2551L, 1726L, 6950L, 1143L, 1144L, 641L,
> 471L, 4712L, 995L, 6582L, 6476L), class = "data.frame")
>
>
> ############################# display with PLOT FUNCTION:
>
>
> # saving files should be avoided in reproducible examples... especially
> files
> # that cannot be transmitted through the R-help mailing list such as pdf
> files
> #pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6, paper='special')
>
> # Your original plot commands below create a fake impression of the data by
> # falsifying the axes. If you really are only interested in data points
> less
> # than 500, you should be explicit about creating a data set containing
> only
> # such constrained values before plotting them.
> plot(ecdf(DF$LENGTH), xlab="DEL SIZE",
> ylab="fraction of DEL",
> main="LENGTH of DEL",
> xlim=c(0,500),
> col = "dark red", axes = FALSE)
> ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
> axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
> ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
> axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-1.png)
>
> # my recommendation
> DF500 <- subset( DF, LENGTH < 500 )
> plot( ecdf( DF500$LENGTH )
> , xlab = "DEL SIZE"
> , ylab = "fraction of DEL"
> , main = "LENGTH of DEL"
> , col = "dark red"
> )
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-2.png)
>
> # alternatively
> plot( ecdf( DF$LENGTH )
> , xlab = "DEL SIZE"
> , ylab = "fraction of DEL"
> , main = "LENGTH of DEL"
> , col = "dark red"
> , xlim=c( 1, 1e9 )
> , log="x"
> )
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-3.png)
>
>
>
> #dev.off()
>
> ############################# display in GGPLOT2 :
>
> BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,
> 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)
>
> barfill <- "#4271AE"
> barlines <- "#1F3552"
>
> #pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
> paper='special')
>
> # ggplot's limits behavior is enabling your false representation of the
> data, but it
> # warns you of the data removal
> ggplot(DF, aes(LENGTH)) +
> stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
> scale_x_continuous(name = "LENGTH of DEL",
> breaks = BREAKS,
> limits=c(0, 500)
> ) +
> scale_y_continuous(name = "FRACTION") +
> ggtitle("ECDF of LENGTH") +
> theme_bw() +
> theme(legend.position = "bottom", legend.direction =
> "horizontal",
> legend.box = "horizontal",
> legend.key.size = unit(1, "cm"),
> axis.title = element_text(size = 12),
> legend.text = element_text(size = 9),
> legend.title=element_text(face = "bold", size = 9))
> #> Warning: Removed 80 rows containing non-finite values (stat_ecdf).
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-4.png)
>
>
> # my recommendation
> ggplot(DF500, aes(LENGTH)) +
> stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
> scale_x_continuous(name = "LENGTH of DEL",
> breaks = BREAKS ) +
> scale_y_continuous(name = "FRACTION") +
> ggtitle("ECDF of LENGTH") +
> theme_bw() +
> theme(legend.position = "bottom", legend.direction = "horizontal",
> legend.box = "horizontal",
> legend.key.size = unit(1, "cm"),
> axis.title = element_text(size = 12),
> legend.text = element_text(size = 9),
> legend.title=element_text(face = "bold", size = 9))
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-5.png)
>
> # or for the un-filtered data
> ggplot(DF, aes(LENGTH)) +
> stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
> scale_x_log10( name = "LENGTH of DEL") +
> scale_y_continuous(name = "FRACTION") +
> ggtitle("ECDF of LENGTH") +
> theme_bw() +
> theme(legend.position = "bottom", legend.direction = "horizontal",
> legend.box = "horizontal",
> legend.key.size = unit(1, "cm"),
> axis.title = element_text(size = 12),
> legend.text = element_text(size = 9),
> legend.title=element_text(face = "bold", size = 9))
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-6.png)
>
>
> #dev.off()
>
> #' Created on 2018-07-09 by the [reprex package](http://reprex.tidyver
> se.org) (v0.2.0).
> #######
>
>
> On Sun, 8 Jul 2018, Bogdan Tanasa wrote:
>
> Dear Jeff,
>> thank you for your email.
>>
>> Yes, in order to be more descriptive/comprehensive, please find attached
>> to
>> my email the following files (my apologies ... I am sending these as
>> attachments, as I do not have a web server running at this moment) :
>>
>> -- the R script (R_script_display_ECDF.R) that reads the file "LENGTH" and
>> outputs ECDF figure by using the standard R function or ggplot2.
>>
>> -- the display of ECDF by using standard R function
>> ("display.R.ecdf.LENGTH.pdf")
>>
>> -- the display of ECDF by using ggplot2 ("display.ggplot2.ecdf.LENGTH.
>> pdf")
>>
>> The ECDF over xlim(0,500) looks very different (contrasting plot(ecdf) vs
>> ggplot2). Please would you advise why ? what shall I change in my ggplot2
>> code ?
>>
>> thanks a lot,
>>
>> - bogdan
>>
>> ps : the R code is also written below :
>>
>> library("ggplot2")
>>
>>
>> file <- read.delim("LENGTH", sep="\t", header=T,
>> stringsAsFactors=F)
>>
>>
>> ############################# display with PLOT FUNCTION:
>>
>>
>> pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6,
>> paper='special')
>>
>>
>> plot(ecdf(file$LENGTH), xlab="DEL SIZE",
>> ylab="fraction of DEL",
>> main="LENGTH of DEL",
>> xlim=c(0,500),
>> col = "dark red", axes = FALSE)
>>
>>
>> ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
>>
>>
>> axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
>>
>>
>> ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
>>
>>
>> axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
>>
>>
>> dev.off()
>>
>>
>> ############################# display in GGPLOT2 :
>>
>>
>> BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300,
>> 400, 500,
>> 1000, 10000, 100000, 1000000, 10000000, 100000000,
>> 1000000000)
>>
>>
>> barfill <- "#4271AE"
>> barlines <- "#1F3552"
>>
>>
>> pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
>> paper='special')
>>
>>
>> ggplot(file, aes(LENGTH)) +
>> stat_ecdf(geom = "point", colour = barlines, fill =
>> barfill) +
>> scale_x_continuous(name = "LENGTH of DEL",
>> breaks = BREAKS,
>> limits=c(0, 500)) +
>> scale_y_continuous(name = "FRACTION") +
>> ggtitle("ECDF of LENGTH") +
>> theme_bw() +
>> theme(legend.position = "bottom", legend.direction =
>> "horizontal",
>> legend.box = "horizontal",
>> legend.key.size = unit(1, "cm"),
>> axis.title = element_text(size = 12),
>> legend.text = element_text(size = 9),
>> legend.title=element_text(face = "bold", size =
>> 9))
>>
>>
>> dev.off()
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Jul 7, 2018 at 9:47 PM, Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
>> wrote:
>> It is a feature of ggplot that points excluded by limits raise
>> warnings, while base graphics do not.
>>
>> You may find that using coord_cartesian with the xlim=c(0,500)
>> argument works better with ggplot by showing the consequences of
>> points out of the limits on lines within the viewport.
>>
>> There are other possible problems with your data that your
>> non-reproducible example does not show, and sending R code in
>> HTML-formatted email usually corrupts it.. so please follow the
>> recommendations in the Posting Guide next time you post.
>>
>> On July 6, 2018 4:32:41 PM PDT, Bogdan Tanasa <tanasa using gmail.com>
>> wrote:
>> >Dear all,
>> >
>> >I would appreciate having your advice/suggestions/comments on
>> the
>> >following
>> >:
>> >
>> >1 -- starting from a vector that contains LENGTHS (numerically,
>> the
>> >values
>> >are from 1 to 10 000)
>> >
>> >2 -- shall I display the ECDF by using the R code and some
>> "limits" :
>> >
>> >BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200,
>> 300, 400,
>> >500,
>> > 1000, 10000, 100000, 1000000, 10000000, 100000000,
>> 1000000000)
>> >
>> >ggplot(x, aes(LENGTH)) +
>> > stat_ecdf(geom = "point") +
>> > scale_x_continuous(name = "LENGTH of DEL",
>> > breaks = BREAKS,
>> > limits=c(0, 500))
>> >
>> >3 -- I am getting the following warning message : "Warning
>> message:
>> >Removed
>> >109 rows containing non-finite values (stat_ecdf)."
>> >
>> >The question is : are these 109 values removed from
>> VISUALIZATION as i
>> >set
>> >up the "limits", or are these 109 values removed from
>> statistical
>> >CALCULATION?
>> >
>> >4 -- in contrast, shall I use the standard R functions
>> plot(ecdf),
>> >there is
>> >no "warning mesage"
>> >
>> >plot(ecdf(x$LENGTH), xlab="DEL LENGTH",
>> > ylab="Fraction of DEL", main="DEL",
>> xlim=c(0,500),
>> > col = "dark red")
>> >
>> >Thanks a lot !
>> >
>> >-- bogdan
>> >
>> > [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >PLEASE do read the posting guide
>> >http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>>
>>
>>
>>
> ------------------------------------------------------------
> ---------------
> Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil using dcn.davis.ca.us> Basics: ##.#. ##.#. Live
> Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
> ------------------------------------------------------------
> ---------------
[[alternative HTML version deleted]]
More information about the R-help
mailing list