[R] splitting data

eliza botto eliza_botto at hotmail.com
Tue Sep 9 19:08:50 CEST 2014


Dear Davis, Dennis and John,
I am thankyou that you replied. I'll take care of it in future. 

Eliza

> Date: Mon, 8 Sep 2014 17:36:30 -0700
> Subject: Re: [R] splitting data
> From: djmuser at gmail.com
> To: eliza_botto at hotmail.com
> 
> Hi Eliza:
> 
> Here are a few potential solutions. Given that you have 100 years of
> monthly data, it's likely that a package such as dplyr or data.table
> would be significantly faster than some of the alternatives offered to
> date. I'm assuming the game is to generate the monthly sums for each
> of A, B, C below.
> 
> # Fake data set intended to replicate four years of data
> # To keep it simple, I only use 30 days a month.
> d <- data.frame(expand.grid(year = 1961:1964, month = seq(12), day = seq(30)),
>                 A = rpois(1440, 10), B = rpois(1440, 15), C = rpois(1440, 25))
> 
> 
> # plyr package solution - colwise() applies the same function to each
> # of the variables named within .()
> 
> library(plyr)
> ymsums <- ddply(d, .(year, month), colwise(sum, .(A, B, C)))
> 
> 
> 
> # dplyr package solution using the new piping operator %>% from the
> # magrittr package. (Think of  a %>% b as: take the data in a and
> # then call function b on it. This idea can be strung in sequence:
> # the term on the left of %>% supplies the input data for the
> # function call on the right.)
> 
> library(dplyr)
> # library(magrittr)
> 
> ymsums2 <- d %>% group_by(year, month) %>%
>                  summarise(Atot = sum(A), Btot = sum(B), Ctot = sum(C))
> 
> 
> 
> # data.table package solution
> 
> library(data.table)
> 
> dt <- data.table(d, key = c("year", "month"))
> ymsums3 <- dt[, list(Atot = sum(A), Btot = sum(B), Ctot = sum(C)),
>                 by = key(dt)]
> 
> head(ymsums)
> head(ymsums2)
> head(ymsums3)
> 
> dplyr was about 2.5 times faster than data.table and almost 30 times
> faster than plyr for this example. To be honest, though, I don't think
> I used the most efficient code for either of dplyr or data.table, so
> the relative timings may be somewhat misleading. OTOH, for this 1440
> line fake data set, dplyr processed it in 0.1 sec. with the code I
> used and data.table took 0.24 sec. If your data frame is 100 years in
> length, it should be approximately 25 times the length of mine, so
> we'd be talking about 2.5 sec with dplyr and somewhere between 3.5 - 5
> sec. with data.table, since the advantage of the way it sets keys
> improves processing speed in a relative sense as the size of the data
> set grows. That's not bad no matter which one you choose.
> 
> BTW, it's possible to do it with reshape2 as follows:
> 
> library(reshape2)
> 
> # stack variables A-C, producing the long form
> dm <- melt(d, id = c("year", "month", "day"))
> 
> # reshape
> drt <- dcast(dm, year + month ~ variable, fun.aggregate = sum,
>                  value.var = "value")
> head(drt)
> 
> This is approximately 4 times faster than the plyr solution and about
> 3 times slower than data.table. This is about as fast as you can get
> it in reshape2.
> 
> HTH,
> Dennis
> 
> PS: I agree with David about the HTML postings. You've been on this
> list long enough to know what is expected. All it takes is a change or
> two in the settings of your mailing client. I use gmail, and one
> change of setting is all it took for me...five years ago, the one and
> only time I was admonished to do so.
> 
> On Mon, Sep 8, 2014 at 12:08 PM, eliza botto <eliza_botto at hotmail.com> wrote:
> > Dear R members,
> >
> > I have this data frame of 100 years in the following format
> >
> > year            month       day         A           B           C         D
> >
> > where  A,B,C and D are item number sold each day. I am trying
> >
> > 1-split the data w.r.t the monthly values for each year
> >
> > 2-then, sum them up
> >
> > I am pasting here just a part of data to make it more clearer
> >
> > structure(list(year = c(1961, 1961, 1961, 1961, 1961, 1961, 1961,
> > 1961, 1961, 1961, 1961, 1961), month = c(1, 1, 1, 1, 1, 1, 1,
> > 1, 1, 1, 1, 1), day = 1:12, A = 1:12, B = 3:14, C = 6:17, D = 16:27), .Names = c("year",
> > "month", "day", "A", "B", "C", "D"), row.names = c(NA, 12L), class = "data.frame")
> >
> > I initially tried to use "dcast" command but for no use.
> >
> > Your kind help is needed.
> >
> > Thanks in advance
> >
> > Eliza
> >
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
 		 	   		  
	[[alternative HTML version deleted]]



More information about the R-help mailing list