[R] User-defined functions in dplyr

Tue Nov 3 03:07:51 CET 2015

Nice example of the issue Bill. Thank you.

Is this a known issue? Plans to be fixed?

Thanks again,
Axel.

> On Nov 2, 2015, at 8:58 PM, William Dunlap <wdunlap at tibco.com> wrote:
> 
> dplyr::mutate does not collapse factor variables well.  They seem to get their levels from the levels
> computed for the first group and mutate does not check for them having different levels.
> 
> > data.frame(group=rep(c("A","B","C"),each=2), value=rep(c("X","Y","Z"),3:1)) %>% dplyr::group_by(group) %>% dplyr::mutate(fv=factor(value)) 
> Source: local data frame [6 x 3]
> Groups: group [3]
> 
>    group  value     fv
>   (fctr) (fctr) (fctr)
> 1      A      X      X
> 2      A      X      X
> 3      B      X      X
> 4      B      Y     NA
> 5      C      Y      X
> 6      C      Z     NA
> > levels(.Last.value$fv)
> [1] "X"
> 
> 
> 
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com/>
> On Mon, Nov 2, 2015 at 5:38 PM, Axel Urbiz <axel.urbiz at gmail.com <mailto:axel.urbiz at gmail.com>> wrote:
> Actually, the results are not the same. Looks like in the code below (see "using dplyr”), the function create_bins2 is not being applied separately to each "group_by" variable. That is surprising to me, or I'm misunderstanding dplyr.
> 
> ### Create some data
> 
> set.seed(4)
> df <- data.frame(pred = rnorm(100), models = gl(2, 50, 100, labels = c("model1", "model2")))
> 
> ### This is the code using plyr, which I'd like to change using dplyr
> 
> create_bins <- function(x, nBins) {
>   Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
>   dfB <-  data.frame(pred = x$pred,
>                                 bin = cut(x$pred, breaks = Breaks, include.lowest = TRUE))
>   dfB
> }
> 
> nBins = 10
> res_plyr <- plyr::ddply(df, plyr::.(models), create_bins, nBins)
> head(res_plyr)
> 
> ### Attempt using dplyr
> 
> create_bins2 <- function (pred, nBins) {
>   Breaks <- unique(quantile(pred, probs = seq(0, 1, 1/nBins)))
>   bin <- cut(pred, breaks = Breaks, include.lowest = TRUE)
>   bin
> }
> 
> res_dplyr <- dplyr::mutate(dplyr::group_by(df, models),
>                                           bin=create_bins2(pred, nBins))
> 
> 
> identical(res_plyr, as.data.frame(res_dplyr))
> [1] FALSE
> #levels(res_dplyr$bin) == levels(res_plyr$bin)
> 
> Thanks,
> Axel.
> 
> 
> 
>> On Oct 30, 2015, at 12:19 PM, William Dunlap <wdunlap at tibco.com <mailto:wdunlap at tibco.com>> wrote:
>> 
>> dplyr::mutate is probably what you want instead of dplyr::summarize:
>> 
>> create_bins3 <- function (xpred, nBins) 
>> {
>>     Breaks <- unique(quantile(xpred, probs = seq(0, 1, 1/nBins)))
>>     bin <- cut(xpred, breaks = Breaks, include.lowest = TRUE)
>>     bin
>> }
>> dplyr::group_by(df, models) %>% dplyr::mutate(Bin=create_bins3(pred,nBins))
>> #Source: local data frame [100 x 3]
>> #Groups: models [2]
>> #
>> #         pred models               Bin
>> #        (dbl) (fctr)            (fctr)
>> #1   0.2167549 model1     (0.167,0.577]
>> #2  -0.5424926 model1   (-0.869,-0.481]
>> ...
>> 
>> 
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com <http://tibco.com/>
>> On Fri, Oct 30, 2015 at 9:06 AM, William Dunlap <wdunlap at tibco.com <mailto:wdunlap at tibco.com>> wrote:
>> The error message is not very helpful and the stack trace is pretty inscrutable as well
>> > dplyr::group_by(df, models) %>% dplyr::summarize(create_bins)
>> Error: not a vector
>> > traceback()
>> 14: stop(list(message = "not a vector", call = NULL, cppstack = NULL))
>> 13: .Call("dplyr_summarise_impl", PACKAGE = "dplyr", df, dots)
>> 12: summarise_impl(.data, dots)
>> 11: summarise_.tbl_df(.data, .dots = lazyeval::lazy_dots(...))
>> 10: summarise_(.data, .dots = lazyeval::lazy_dots(...))
>> 9: dplyr::summarize(., create_bins)
>> 8: function_list[[k]](value)
>> 7: withVisible(function_list[[k]](value))
>> 6: freduce(value, `_function_list`)
>> 5: `_fseq`(`_lhs`)
>> 4: eval(expr, envir, enclos)
>> 3: eval(quote(`_fseq`(`_lhs`)), env, env)
>> 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
>> 1: dplyr::group_by(df, models) %>% dplyr::summarize(create_bins)
>> 
>> 
>> It does not mean that your function, create_bins, does not return a vector --
>> the sum function gives the same result. help(summarize,package="dplyr")
>> says:
>>      ...: Name-value pairs of summary functions like ‘min()’, ‘mean()’,
>>           ‘max()’ etc.
>> It apparently means calls to summary functions, not summary functions
>> themselves.  The examples in the help file show the proper usage.
>> 
>> Use a call to your function and you will see it works better
>>    > dplyr::group_by(df, models) %>% dplyr::summarize(create_bins(pred,nBins))
>>    Error: $ operator is invalid for atomic vectors
>> The traceback again is not very useful, because the call information was
>> stripped by dplyr (by the call=NULL in the call to stop()):  
>>   > traceback()
>>   14: stop(list(message = "$ operator is invalid for atomic vectors", 
>>           call = NULL, cppstack = NULL))
>>   13: .Call("dplyr_summarise_impl", PACKAGE = "dplyr", df, dots)
>> However it is clear that the fault is in your function, which is expecting a
>> data.frame x with a column called pred but gets pred itself.  Change x to xpred
>> in the argument list and x$pred to xpred in the body of the function.
>> 
>> You will run into more problems because your function returns a vector
>> the length of its input but summarize expects a summary function - one
>> that returns a scalar for any size vector input.
>> 
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com <http://tibco.com/>
>> 
>> On Fri, Oct 30, 2015 at 4:04 AM, Axel Urbiz <axel.urbiz at gmail.com <mailto:axel.urbiz at gmail.com>> wrote:
>> So in this case, "create_bins" returns a vector and I still get the same
>> error.
>> 
>> 
>> create_bins <- function(x, nBins)
>> {
>>   Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
>>   bin <- cut(x$pred, breaks = Breaks, include.lowest = TRUE)
>>   bin
>> }
>> 
>> 
>> ### Using dplyr (fails)
>> nBins = 10
>> by_group <- dplyr::group_by(df, models)
>> res_dplyr <- dplyr::summarize(by_group, create_bins, nBins)
>> Error: not a vector
>> 
>> On Thu, Oct 29, 2015 at 8:28 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us <mailto:jdnewmil at dcn.davis.ca.us>>
>> wrote:
>> 
>> > You are jumping the gun (your other email did get through) and you are
>> > posting using HTML (which does not come through on the list). Some time
>> > (re)reading the Posting Guide mentioned at the bottom of all emails on this
>> > list seems to be in order.
>> >
>> > The error is actually quite clear. You should return a vector from your
>> > function, not a data frame.
>> > ---------------------------------------------------------------------------
>> > Jeff Newmiller                        The     .....       .....  Go Live...
>> > DCN:<jdnewmil at dcn.davis.ca.us <mailto:jdnewmil at dcn.davis.ca.us>>        Basics: ##.#.       ##.#.  Live
>> > Go...
>> >                                       Live:   OO#.. Dead: OO#..  Playing
>> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> > /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>> > ---------------------------------------------------------------------------
>> > Sent from my phone. Please excuse my brevity.
>> >
>> > On October 29, 2015 4:55:19 PM MST, Axel Urbiz <axel.urbiz at gmail.com <mailto:axel.urbiz at gmail.com>>
>> > wrote:
>> > >Hello,
>> > >
>> > >Sorry, resending this question as the prior was not sent properly.
>> > >
>> > >I’m using the plyr package below to add a variable named "bin" to my
>> > >original data frame "df" with the user-defined function "create_bins".
>> > >I'd
>> > >like to get similar results using dplyr instead, but failing to do so.
>> > >
>> > >set.seed(4)
>> > >df <- data.frame(pred = rnorm(100), models = gl(2, 50, 100, labels =
>> > >c("model1", "model2")))
>> > >
>> > >
>> > >### Using plyr (works fine)
>> > >create_bins <- function(x, nBins)
>> > >{
>> > >  Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
>> > >  dfB <-  data.frame(pred = x$pred,
>> > >                    bin = cut(x$pred, breaks = Breaks, include.lowest =
>> > >TRUE))
>> > >  dfB
>> > >}
>> > >
>> > >nBins = 10
>> > >res_plyr <- plyr::ddply(df, plyr::.(models), create_bins, nBins)
>> > >head(res_plyr)
>> > >
>> > >### Using dplyr (fails)
>> > >
>> > >by_group <- dplyr::group_by(df, models)
>> > >res_dplyr <- dplyr::summarize(by_group, create_bins, nBins)
>> > >Error: not a vector
>> > >
>> > >
>> > >Any help would be much appreciated.
>> > >
>> > >Best,
>> > >Axel.
>> > >
>> > >       [[alternative HTML version deleted]]
>> > >
>> > >______________________________________________
>> > >R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
>> > >https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>
>> > >PLEASE do read the posting guide
>> > >http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>
>> > >and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>> 
>>         [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
> 
> 

	[[alternative HTML version deleted]]