[R] dplyr: producing a good old data frame
John Posner
john.posner at MJBIOSTAT.COM
Mon Feb 23 20:54:24 CET 2015
I'm using the dplyr package to perform one-row-at-a-time processing of a data frame:
> rnd6 = function() sample(1:300, 6)
> frm = data.frame(AA=rnd6(), BB=rnd6(), CC=rnd6())
> frm
AA BB CC
1 123 50 45
2 12 30 231
3 127 147 100
4 133 32 129
5 66 235 71
6 38 264 261
The interface is nice and straightforward:
> library(dplyr)
> dplyr_result = frm %>% rowwise() %>% do(MM=max(as.numeric(.)))
I've gotten used to the fact that dplyr_result is not a good old "vanilla" data frame. The as.data.frame() function *seems* to do the trick:
> dplyr_result_2 = as.data.frame(dplyr_result)
> dplyr_result_2
MM
1 123
2 231
3 147
4 133
5 235
6 264
... but there's trouble ahead:
> mean(dplyr_result_2$MM)
[1] NA
Warning message:
In mean.default(dplyr_result_2$MM) :
argument is not numeric or logical: returning NA
I need to enlist unlist() to get me to my destination:
> mean(unlist(dplyr_result_2$MM))
[1] 188.8333
[NOTE: dplyr's as_data_frame() function does a better job than as.data.frame() of indicating that I was headed for trouble. ]
By contrast, the plyr package's adply() function *does* produce a vanilla data frame:
> library(plyr)
> plyr_result = adply(frm, .margins=1, function(onerowfrm) max(as.numeric(onerowfrm[1,])))
> mean(plyr_result$V1)
[1] 188.8333
Is there a good reason for dplyr to require the extra processing? My (naïve ?) recommendation would be to have as_data_frame() produce a vanilla data frame.
Tx,
John
More information about the R-help
mailing list