[R] Selecting ranges of dates from a dataframe
David Winsemius
dwinsemius at comcast.net
Fri Mar 11 14:58:52 CET 2011
On Mar 11, 2011, at 8:41 AM, Benjamin Stier wrote:
> Hi Francisco,
>
> Thanks for your solution. It runs pretty fast compared to my for
> loop. Here
> is a comparison of system.time():
>
> system.time(splitVals <- by(serv, dates, aggregateDf ))
> user system elapsed
> 1.129 0.218 1.348
>
> system.time(... my long for loop...)
> user system elapsed
> 276.987 1.544 278.698
>
>
> I also tried Davids solution with "aggregate", but I can't get it to
> work
> because I have to add as.numeric() into the sum(), since the data is
> very big.
This comment doesn't make any sense. Unless you have character vectors
that because of malformed values need coercion (which was NOT part of
the example posed) then `sum` should not need any pre-processing or
post-processing with `as.numeric`.
> serv <- read.delim("cut.inp")
> serv$datum <- strptime(serv$datum, "%Y-%m-%d %H:%M:%S")
> dates.serv <- unique(strptime(serv$datum, format="%Y-%m-%d"))
> aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y-
%m-%d")), sum)
Group.1 read write
1 2011-01-29 1021439 11726356
2 2011-01-30 1089534 4634910
Perhaps what you really needed was to read the file with colClasses to
define the date-time and numeric fields properly. Try this:
serv <- read.delim("cut.inp", colClasses=c("POSIXct", "integer",
"integer", "numeric","numeric") )
aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y-%m-
%d")), sum)
> I will now try to understand how the by()-function works and what it
> does.
> Thanks again for helping me!
If you read the help(tapply) page you are told that both `by` and
`aggregate` are just convenience functions using tapply "under the
hood".
>
> Regards,
>
> Benjamin
>
>
> On Thu, Mar 10, 2011 at 04:26:57PM +0000, Francisco Gochez wrote:
>> Benjamin,
>>
>> A more elegant "R-style" solution would be to use one of R's "apply"/
>> aggregation routines, of which there are many. For example, the
>> "by" function
>> can split a data.frame by some factor/categorical variable(s), and
>> then apply a
>> function to each "slice". The result can then be pieced back
>> together. See
>> below for an example in which this factor is simply a parallel
>> vector of pure
>> dates:
>>
>> # extract pure date component of time and date
>> dates <- format(serv$datum, "%Y-%m-%d")
>>
>> # write auxilliary function to aggregate a "slice" of the data.frame
>> # x will be a "slice" of data from a single day
>> aggregateDf <- function(x)
>> {
>> # return a one-row data.frame
>> data.frame(datum = format(x$datum[1], "%Y-%m-%d"), write = sum(x
>> $write),
>> read = sum(x$read) )
>> }
>>
>> # now process each "slice" of the serv data.frame using "by"
>> splitVals <- by(serv, dates, aggregateDf )
>>
>> # bind back into a single data.frame
>> values <- do.call(rbind, splitVals)
>>
>>
>> The difference in execution speed is pretty negligible on my
>> machine, so it's a
>> more concise solution but I don't know if it is much faster.
>>
>> HTH,
>>
>> Francisco
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list