[Rd] Why is there no c.factor?

Fri Feb 5 20:43:38 CET 2010

> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Matthew Dowle
> Sent: Friday, February 05, 2010 11:17 AM
> To: r-devel at stat.math.ethz.ch
> Subject: Re: [Rd] Why is there no c.factor?
> 
> 
> > concat() doesn't get a lot of use
> How do you know?  Maybe its used a lot but the users had no 
> need to tell you 
> what they were using. The exact opposite might in fact be the 
> case i.e. 
> because concat is so good in splus,  you just never hear of 
> problems with it 
> from the users. That might be a very good sign.

We don't use concat in many of our functions.
It tends to be used only where c fails.  It
is slower than c(), in part because it is an SV4
generic while c is a .Internal (the fastest S+
interface to C code).  concat() is also written
entirely in S code, with calls to heavyweights like
sapply.  Writing it in C would speed it up a lot.

  > sys.time(for(i in 1:10000)c(1,2))
  [1] 0.27 0.27
  > sys.time(for(i in 1:10000)concat(1,2))
  [1] 20.29 20.29
  > sys.time(for(i in 1:10000)concat.two(1,2))
  [1] 0.52 0.52

The last just calls the default method of concat.two,
which is a call to c().

> 
> > perhaps that model would work well for a concatenation function in R
> I'd be happy to test it. I'm a bit concerned about 
> performance though given 
> what you said about repeated recursive calls, and dispatch. 
> Could you run 
> the following test in s-plus please and post back the timing? 
>  If this small 
> 100MB example was fine, then we could proceed to a 64bit 10GB 
> test. This is 
> quite nippy at the moment in R (1.1sec). I'd be happy with a 
> better way as 
> long as speed wasn't compromised.
> 
> set.seed(1)
> L = as.vector(outer(LETTERS,LETTERS,paste,sep=""))       # 
> union set of 676 
> levels
> F = lapply(1:100, function(i) 
> {                                                # create 100 factors
>    f = sample(1:100, 1*1024^2 / 4, replace=TRUE)              
>  # each factor 
> 1MB large (262144 integers), plus small amount for the levels
>    levels(f) = sample(L,100) 
> # pick 100 levels from the union set
>    class(f) = "factor"
>    f
> })
> 
> > head(F[[1]])
> [1] RT DM CO JV BG KU
> 100 Levels: YC FO PN IL CB CY HQ ...
> > head(F[[2]])
> [1] RK PD FE SG SJ CQ
> 100 Levels: JV FV DX NL XB ND CY QQ ...
> >
> 
> With c.factor from data.table, as posted, placed in .GlobalEnv
> 
> > system.time(G <- do.call("c",F))
>    user  system elapsed
>    0.81    0.32    1.12
> > head(G)
> [1] RT DM CO JV BG KU        # looks right, comparing to F[[1]] above
> 676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP 
> AQ AR AS AT AU 
> AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
> > G[262145:262150]
> [1] RK PD FE SG SJ CQ          # looks right, comparing to 
> F[[2]] above
> 676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP 
> AQ AR AS AT AU 
> AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
> > identical(as.character(G),as.character(unlist(F)))
> [1] TRUE
> 
> So I guess this would be compared to following in splus ?
> 
> system.time(G <- do.call("concat", F))
> 
> or maybe its just the following :
> 
> system.time(G <- concat(F))
> 
> I don't have splus so I can't test that myself.
> 
> 
> "William Dunlap" <wdunlap at tibco.com> wrote in message 
> news:77EB52C6DD32BA4D87471DCD70C8D7000275B4CA at NA-PA-VBE03.na.t
ibco.com...
> > -----Original Message-----
> > From: r-devel-bounces at r-project.org
> > [mailto:r-devel-bounces at r-project.org] On Behalf Of Peter Dalgaard
> > Sent: Friday, February 05, 2010 7:41 AM
> > To: Hadley Wickham
> > Cc: John Fox; r-devel at r-project.org; Thomas Lumley
> > Subject: Re: [Rd] Why is there no c.factor?
> >
> > Hadley Wickham wrote:
> > > On Thu, Feb 4, 2010 at 12:03 PM, Hadley Wickham
> > <hadley at rice.edu> wrote:
> > >>> I'd propose the following: If the sets of levels of all
> > arguments are the
> > >>> same, then c.factor() would return a factor with the
> > common set of levels;
> > >>> if the sets of levels differ, then, as Hadley suggests,
> > the level-set of the
> > >>> result would be the union of sets of levels of the
> > arguments, but a warning
> > >>> would be issued.
> > >> I like this compromise (as long as there was an argument
> > to suppress
> > >> the warning)
> > >
> > > If I provided code to do this, along with the warnings for ordered
> > > factors and using the optimisation suggested by Matthew, is
> > there any
> > > member of R core would be interested in sponsoring it?
> > >
> > > Hadley
> > >
> >
> > Messing with c() is a bit unattractive (I'm not too happy
> > with the other
> > c methods either; normally c() strips attributes and reduces
> > to the base
> > class, and those obviously do not), but a more general
> > concat() function
> > has been suggested a number of times. With a suitable range
> > of methods,
> > this could also be used to reimplement rbind.data.frame (which,
> > incidentally, already contains a method for concatenating
> > factors, with
> > several ugly warts!)
> 
> Yes, c() should have been put on the deprecated list a couple
> of decades ago, since people expect it to do too many
> incompatible things.  And factor should have been a virtual
> class, with subclasses "FixedLevels" (e.g., Sex) or "AdHocLevels"
> (e.g., FamilyName), so c() and [()<- could do the appropriate
> thing in either case.
> 
> Back to reality, S+ has a concat(...) function, whose comments say
> # This function works like c() except that names of arguments are
> # ignored.  That is, it concatenates its arguments into a single
> # S vector object, without considering the names of the arguments,
> # in the order that the arguments are given.
> #
> # To make this function work for new classes, it is only necessary
> # to make methods for the concat.two function, which concatenates
> # two vectors; recursion will take care of the rest.
> concat() is not generic but it repeatedly calls concat.two(x,y), an
> SV4-generic that dispatches on the classes of x and y.  Thus you
> can easily predict the class of concat(x,y,z), although it may not
> be the same as the class of concat(z,y,x), given suitably bizarre
> methods for concat.two().
> 
> concat() doesn't get a lot of use but I think the idea is sound.
> Perhaps that model would work well for a concatenation function in R.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> >
> > -- 
> >    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
> >   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
> >  (*) \(*) -- University of Copenhagen   Denmark      Ph:
> > (+45) 35327918
> > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX:
> > (+45) 35327907
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>