[R] exercise in frustration: applying a function to subsamples

jim holtman jholtman at gmail.com
Mon Jul 12 22:02:30 CEST 2010


try 'drop=TRUE' on the split function call.  This will prevent the
NULL set from being sent to the function.

On Mon, Jul 12, 2010 at 3:10 PM, Ted Byers <r.ted.byers at gmail.com> wrote:
> >From the documentation I have found, it seems that one of the functions from
> package plyr, or a combination of functions like split and lapply would
> allow me to have a really short R script to analyze all my data (I have
> reduced it to a couple hundred thousand records with about half a dozen
> records.
>
> I get the same result from ddply and split/lapply:
>
>> ddply(moreinfo,c("m_id","sale_year","sale_week"),
>> +       function(df) data.frame(res = fitdist(df$elapsed_time,"exp"),est =
>> res$estimate,sd = res$sd))
>> Error in fitdist(df$elapsed_time, "exp") :
>>   data must be a numeric vector of length greater than 1
>>
>
> and
>
>>
>> lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)),
>> +       function(df) fitdist(df$elapsed_time,"exp"))
>> Error in fitdist(df$elapsed_time, "exp") :
>>   data must be a numeric vector of length greater than 1
>>
>
> Now, in retrospect, unless I misunderstood the properties of a data.frame, I
> suppose a data.frame might not have been entirely appropriate as the m_id
> samples start and end on very different dates, but I would have thought a
> list data structure should have been able to handle that.  It would seem
> that split is making groups that have the same start and end dates (or that
> if, for example, I have sale data for precisely the last year, split would
> insist on both 2009 and 2010 having weeks from 0 through 52 instead of just
> the weeks in each year that actually have data: 26 through 52 for last year
> and 1 through 25 for this year).  I don't see how else the data passed to
> fitdist could have a sample size of 0.
>
> I'd appreciate understanding how to resolve this.  However, it isn't s show
> stopper as it now seems trivial to just break it out into a loop (followed
> by a lapply/split combo using only sale year and sale month).
>
> While I am asking, is there a better way to split such temporally ordered
> data into weekly samples that respective the year in which the sample is
> taken as well as the week in which it is taken?
>
> Thanks
>
> Ted
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list