[R] [Rd] split() is slow on data.frame (PR#14123)
Peng Yu
pengyu.ut at gmail.com
Thu Dec 10 04:38:33 CET 2009
Sorry. I sent this to r-help by mistake. Could somebody help delete it
from the archive?
On Wed, Dec 9, 2009 at 9:29 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
> I make a version for matrix. Because, it would be more efficient to
> split each column of a matrix than to convert a matrix to a data.frame
> then call split() on the data.frame. Note that the version for a
> matrix and a data.frame is slightly different. Would somebody add this
> in R as well?
>
> split.matrix<-function(x,f) {
> #print('processing matrix')
> v=lapply(
> 1:dim(x)[[2]]
> , function(i) {
> base:::split.default(x[,i],f)#the difference is here
> }
> )
>
> w=lapply(
> seq(along=v[[1]])
> , function(i) {
> result=do.call(
> cbind
> , lapply(v,
> function(vj) {
> vj[[i]]
> }
> )
> )
> colnames(result)=colnames(x)
> return(result)
> }
> )
> names(w)=names(v[[1]])
> return(w)
> }
>
>
> On Wed, Dec 9, 2009 at 5:44 PM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
>> On Wed, 9 Dec 2009, William Dunlap wrote:
>>
>>> Here are some differences between the current and proposed
>>> split.data.frame.
>>
>> Adding 'drop=FALSE' fixes this case. See in line correction below.
>>
>> Chuck
>>
>>>
>>>> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),
>>>
>>> Named=c(one=1,two=2,three=3,four=4,five=5),
>>> row.names=as.character(1001:1005))
>>>>
>>>> group<-c("A","B","A","A","B")
>>>> split.data.frame(d,group)
>>>
>>> $A
>>> Matrix.1 Matrix.2 Named
>>> 1001 1 6 1
>>> 1003 3 8 3
>>> 1004 4 9 4
>>>
>>> $B
>>> Matrix.1 Matrix.2 Named
>>> 1002 2 7 2
>>> 1005 5 10 5
>>>
>>>> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix
>>>
>>> [1] "processing data.frame"
>>> $A
>>> Matrix Named
>>> [1,] 1 1
>>> [2,] 3 3
>>> [3,] 4 4
>>>
>>> $B
>>> Matrix Named
>>> [1,] 2 2
>>> [2,] 5 5
>>>
>>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>> -----Original Message-----
>>>> From: r-devel-bounces at r-project.org
>>>> [mailto:r-devel-bounces at r-project.org] On Behalf Of
>>>> pengyu.ut at gmail.com
>>>> Sent: Wednesday, December 09, 2009 2:10 PM
>>>> To: r-devel at stat.math.ethz.ch
>>>> Cc: R-bugs at r-project.org
>>>> Subject: [Rd] split() is slow on data.frame (PR#14123)
>>>>
>>>> Please see the following code for the runtime comparison between
>>>> split() and mysplit.data.frame() (they do the same thing
>>>> semantically). mysplit.data.frame() is a fix of split() in term of
>>>> performance. Could somebody include this fix (with possible checking
>>>> for corner cases) in future version of R and let me know the inclusion
>>>> of the fix?
>>>>
>>>> m=300000
>>>> n=6
>>>> k=30000
>>>>
>>>> set.seed(0)
>>>> x=replicate(n,rnorm(m))
>>>> f=sample(1:k, size=m, replace=T)
>>>>
>>>> mysplit.data.frame<-function(x,f) {
>>>> print('processing data.frame')
>>>> v=lapply(
>>>> 1:dim(x)[[2]]
>>>> , function(i) {
>>>> split(x[,i],f)
>>
>> Change to:
>>
>> split(x[,i,drop=FALSE],f)
>>
>>
>>>> }
>>>> )
>>>>
>>>> w=lapply(
>>>> seq(along=v[[1]])
>>>> , function(i) {
>>>> result=do.call(
>>>> cbind
>>>> , lapply(v,
>>>> function(vj) {
>>>> vj[[i]]
>>>> }
>>>> )
>>>> )
>>>> colnames(result)=colnames(x)
>>>> return(result)
>>>> }
>>>> )
>>>> names(w)=names(v[[1]])
>>>> return(w)
>>>> }
>>>>
>>>> system.time(split(as.data.frame(x),f))
>>>> system.time(mysplit.data.frame(as.data.frame(x),f))
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> Charles C. Berry (858) 534-2098
>> Dept of Family/Preventive
>> Medicine
>> E mailto:cberry at tajo.ucsd.edu UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>>
>>
>>
>
More information about the R-help
mailing list