[R] ggplot2: multiple box plots, different tibbles/dataframes

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Thu Nov 11 19:26:26 CET 2021


As I replied to Rich privately for another message, I suggest that you may
well be able to fit what you need in memory, if careful. But my main point
is that when you have so much data, you do not need all of it to make a
representative graph. A boxplot made using 100,000 data points may well have
too many outliers to display resulting in a bushy tail and not be all that
much more accurate than one made using 10,000 randomly chosen data points
from it.

So the idea would be to read in df1 into memory, trimming away any columns
not needed, then use something like sample() to make a smaller version and
rm() the original and repeat by reading in the second and third and so on.

Now add a PLACE column to each of df1 through dfN and then cbind() them
together and again throw away anything no longer needed.

Finally, you can use factors as already discussed including as a way to use
less data as a factor is just an integer vector attached to a sort of
dictionary containing one copy of the text aspect of your data. 

Then call ggplot and ...

The results may vary depending on the size chosen and it may be wise to use
set.seed() to some value so it does the same thing each time you run it.

Your thought of going to make separate boxplots also can use as much memory
or more if you keep everything in memory as you go along.

And, BTW, for people using truly big data, there are approaches that get
them huge amounts of memory either within their own machines, or using web
services.

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Rich Shepard
Sent: Thursday, November 11, 2021 12:56 PM
To: R-help <r-help using r-project.org>
Subject: Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

On Thu, 11 Nov 2021, Bert Gunter wrote:

> You can always create a graphics layout  and then plot different 
> ggplot objects in the separate regions of the layout. See ?grid.layout 
> (since ggplots are grobs)  and ?plot.ggplot  . This also **may** be 
> useful by showing examples using grid.arrange()
>
> https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html
>
> Still, I suspect that Jeff Newmiller may be right about needing to 
> structure your data more appropriately for what you wish to do.

Bert,

For this plot I could create a new data set with only site_nbr, year and cfs
columns; it would be 3,016,005 rows long.

Or, I could create separate boxplots and arrange them in a row. That might
be the easiest.

Thanks,

Rich

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list