[Rd] Why is there no c.factor?

Matthew Dowle mdowle at mdowle.plus.com
Fri Feb 5 20:17:27 CET 2010


> concat() doesn't get a lot of use
How do you know?  Maybe its used a lot but the users had no need to tell you 
what they were using. The exact opposite might in fact be the case i.e. 
because concat is so good in splus,  you just never hear of problems with it 
from the users. That might be a very good sign.

> perhaps that model would work well for a concatenation function in R
I'd be happy to test it. I'm a bit concerned about performance though given 
what you said about repeated recursive calls, and dispatch. Could you run 
the following test in s-plus please and post back the timing?  If this small 
100MB example was fine, then we could proceed to a 64bit 10GB test. This is 
quite nippy at the moment in R (1.1sec). I'd be happy with a better way as 
long as speed wasn't compromised.

set.seed(1)
L = as.vector(outer(LETTERS,LETTERS,paste,sep=""))       # union set of 676 
levels
F = lapply(1:100, function(i) 
{                                                # create 100 factors
   f = sample(1:100, 1*1024^2 / 4, replace=TRUE)               # each factor 
1MB large (262144 integers), plus small amount for the levels
   levels(f) = sample(L,100) 
# pick 100 levels from the union set
   class(f) = "factor"
   f
})

> head(F[[1]])
[1] RT DM CO JV BG KU
100 Levels: YC FO PN IL CB CY HQ ...
> head(F[[2]])
[1] RK PD FE SG SJ CQ
100 Levels: JV FV DX NL XB ND CY QQ ...
>

With c.factor from data.table, as posted, placed in .GlobalEnv

> system.time(G <- do.call("c",F))
   user  system elapsed
   0.81    0.32    1.12
> head(G)
[1] RT DM CO JV BG KU        # looks right, comparing to F[[1]] above
676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU 
AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
> G[262145:262150]
[1] RK PD FE SG SJ CQ          # looks right, comparing to F[[2]] above
676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU 
AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
> identical(as.character(G),as.character(unlist(F)))
[1] TRUE

So I guess this would be compared to following in splus ?

system.time(G <- do.call("concat", F))

or maybe its just the following :

system.time(G <- concat(F))

I don't have splus so I can't test that myself.


"William Dunlap" <wdunlap at tibco.com> wrote in message 
news:77EB52C6DD32BA4D87471DCD70C8D7000275B4CA at NA-PA-VBE03.na.tibco.com...
> -----Original Message-----
> From: r-devel-bounces at r-project.org
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Peter Dalgaard
> Sent: Friday, February 05, 2010 7:41 AM
> To: Hadley Wickham
> Cc: John Fox; r-devel at r-project.org; Thomas Lumley
> Subject: Re: [Rd] Why is there no c.factor?
>
> Hadley Wickham wrote:
> > On Thu, Feb 4, 2010 at 12:03 PM, Hadley Wickham
> <hadley at rice.edu> wrote:
> >>> I'd propose the following: If the sets of levels of all
> arguments are the
> >>> same, then c.factor() would return a factor with the
> common set of levels;
> >>> if the sets of levels differ, then, as Hadley suggests,
> the level-set of the
> >>> result would be the union of sets of levels of the
> arguments, but a warning
> >>> would be issued.
> >> I like this compromise (as long as there was an argument
> to suppress
> >> the warning)
> >
> > If I provided code to do this, along with the warnings for ordered
> > factors and using the optimisation suggested by Matthew, is
> there any
> > member of R core would be interested in sponsoring it?
> >
> > Hadley
> >
>
> Messing with c() is a bit unattractive (I'm not too happy
> with the other
> c methods either; normally c() strips attributes and reduces
> to the base
> class, and those obviously do not), but a more general
> concat() function
> has been suggested a number of times. With a suitable range
> of methods,
> this could also be used to reimplement rbind.data.frame (which,
> incidentally, already contains a method for concatenating
> factors, with
> several ugly warts!)

Yes, c() should have been put on the deprecated list a couple
of decades ago, since people expect it to do too many
incompatible things.  And factor should have been a virtual
class, with subclasses "FixedLevels" (e.g., Sex) or "AdHocLevels"
(e.g., FamilyName), so c() and [()<- could do the appropriate
thing in either case.

Back to reality, S+ has a concat(...) function, whose comments say
# This function works like c() except that names of arguments are
# ignored.  That is, it concatenates its arguments into a single
# S vector object, without considering the names of the arguments,
# in the order that the arguments are given.
#
# To make this function work for new classes, it is only necessary
# to make methods for the concat.two function, which concatenates
# two vectors; recursion will take care of the rest.
concat() is not generic but it repeatedly calls concat.two(x,y), an
SV4-generic that dispatches on the classes of x and y.  Thus you
can easily predict the class of concat(x,y,z), although it may not
be the same as the class of concat(z,y,x), given suitably bizarre
methods for concat.two().

concat() doesn't get a lot of use but I think the idea is sound.
Perhaps that model would work well for a concatenation function in R.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

>
> -- 
>    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>  (*) \(*) -- University of Copenhagen   Denmark      Ph:
> (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX:
> (+45) 35327907
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list