[R] dplyr: producing a good old data frame

John Posner john.posner at MJBIOSTAT.COM
Mon Feb 23 20:54:24 CET 2015


I'm using the dplyr package to perform one-row-at-a-time processing of a data frame:

> rnd6 = function() sample(1:300, 6)
> frm = data.frame(AA=rnd6(), BB=rnd6(), CC=rnd6())

> frm
   AA  BB  CC
1 123  50  45
2  12  30 231
3 127 147 100
4 133  32 129
5  66 235  71
6  38 264 261

The interface is nice and straightforward:

> library(dplyr)
> dplyr_result = frm %>% rowwise() %>% do(MM=max(as.numeric(.)))

I've gotten used to the fact that dplyr_result is not a good old "vanilla" data frame. The as.data.frame() function *seems* to do the trick:

> dplyr_result_2 = as.data.frame(dplyr_result)
> dplyr_result_2
   MM
1 123
2 231
3 147
4 133
5 235
6 264

... but there's trouble ahead:

> mean(dplyr_result_2$MM)
[1] NA
Warning message:
In mean.default(dplyr_result_2$MM) :
  argument is not numeric or logical: returning NA

I need to enlist unlist() to get me to my destination:

> mean(unlist(dplyr_result_2$MM))
[1] 188.8333

[NOTE: dplyr's as_data_frame() function does a better job than as.data.frame() of indicating that I was headed for trouble. ]

By contrast, the plyr package's adply() function *does* produce a vanilla data frame:

 > library(plyr)
> plyr_result = adply(frm, .margins=1, function(onerowfrm) max(as.numeric(onerowfrm[1,])))
> mean(plyr_result$V1)
[1] 188.8333

Is there a good reason for dplyr to require the extra processing? My (naïve ?) recommendation would be to have as_data_frame() produce a vanilla data frame.

Tx,
John



More information about the R-help mailing list