[Rd] Why is there no c.factor?
William Dunlap
wdunlap at tibco.com
Fri Feb 5 20:43:38 CET 2010
> From: r-devel-bounces at r-project.org
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Matthew Dowle
> Sent: Friday, February 05, 2010 11:17 AM
> To: r-devel at stat.math.ethz.ch
> Subject: Re: [Rd] Why is there no c.factor?
>
>
> > concat() doesn't get a lot of use
> How do you know? Maybe its used a lot but the users had no
> need to tell you
> what they were using. The exact opposite might in fact be the
> case i.e.
> because concat is so good in splus, you just never hear of
> problems with it
> from the users. That might be a very good sign.
We don't use concat in many of our functions.
It tends to be used only where c fails. It
is slower than c(), in part because it is an SV4
generic while c is a .Internal (the fastest S+
interface to C code). concat() is also written
entirely in S code, with calls to heavyweights like
sapply. Writing it in C would speed it up a lot.
> sys.time(for(i in 1:10000)c(1,2))
[1] 0.27 0.27
> sys.time(for(i in 1:10000)concat(1,2))
[1] 20.29 20.29
> sys.time(for(i in 1:10000)concat.two(1,2))
[1] 0.52 0.52
The last just calls the default method of concat.two,
which is a call to c().
>
> > perhaps that model would work well for a concatenation function in R
> I'd be happy to test it. I'm a bit concerned about
> performance though given
> what you said about repeated recursive calls, and dispatch.
> Could you run
> the following test in s-plus please and post back the timing?
> If this small
> 100MB example was fine, then we could proceed to a 64bit 10GB
> test. This is
> quite nippy at the moment in R (1.1sec). I'd be happy with a
> better way as
> long as speed wasn't compromised.
>
> set.seed(1)
> L = as.vector(outer(LETTERS,LETTERS,paste,sep="")) #
> union set of 676
> levels
> F = lapply(1:100, function(i)
> { # create 100 factors
> f = sample(1:100, 1*1024^2 / 4, replace=TRUE)
> # each factor
> 1MB large (262144 integers), plus small amount for the levels
> levels(f) = sample(L,100)
> # pick 100 levels from the union set
> class(f) = "factor"
> f
> })
>
> > head(F[[1]])
> [1] RT DM CO JV BG KU
> 100 Levels: YC FO PN IL CB CY HQ ...
> > head(F[[2]])
> [1] RK PD FE SG SJ CQ
> 100 Levels: JV FV DX NL XB ND CY QQ ...
> >
>
> With c.factor from data.table, as posted, placed in .GlobalEnv
>
> > system.time(G <- do.call("c",F))
> user system elapsed
> 0.81 0.32 1.12
> > head(G)
> [1] RT DM CO JV BG KU # looks right, comparing to F[[1]] above
> 676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP
> AQ AR AS AT AU
> AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
> > G[262145:262150]
> [1] RK PD FE SG SJ CQ # looks right, comparing to
> F[[2]] above
> 676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP
> AQ AR AS AT AU
> AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
> > identical(as.character(G),as.character(unlist(F)))
> [1] TRUE
>
> So I guess this would be compared to following in splus ?
>
> system.time(G <- do.call("concat", F))
>
> or maybe its just the following :
>
> system.time(G <- concat(F))
>
> I don't have splus so I can't test that myself.
>
>
> "William Dunlap" <wdunlap at tibco.com> wrote in message
> news:77EB52C6DD32BA4D87471DCD70C8D7000275B4CA at NA-PA-VBE03.na.t
ibco.com...
> > -----Original Message-----
> > From: r-devel-bounces at r-project.org
> > [mailto:r-devel-bounces at r-project.org] On Behalf Of Peter Dalgaard
> > Sent: Friday, February 05, 2010 7:41 AM
> > To: Hadley Wickham
> > Cc: John Fox; r-devel at r-project.org; Thomas Lumley
> > Subject: Re: [Rd] Why is there no c.factor?
> >
> > Hadley Wickham wrote:
> > > On Thu, Feb 4, 2010 at 12:03 PM, Hadley Wickham
> > <hadley at rice.edu> wrote:
> > >>> I'd propose the following: If the sets of levels of all
> > arguments are the
> > >>> same, then c.factor() would return a factor with the
> > common set of levels;
> > >>> if the sets of levels differ, then, as Hadley suggests,
> > the level-set of the
> > >>> result would be the union of sets of levels of the
> > arguments, but a warning
> > >>> would be issued.
> > >> I like this compromise (as long as there was an argument
> > to suppress
> > >> the warning)
> > >
> > > If I provided code to do this, along with the warnings for ordered
> > > factors and using the optimisation suggested by Matthew, is
> > there any
> > > member of R core would be interested in sponsoring it?
> > >
> > > Hadley
> > >
> >
> > Messing with c() is a bit unattractive (I'm not too happy
> > with the other
> > c methods either; normally c() strips attributes and reduces
> > to the base
> > class, and those obviously do not), but a more general
> > concat() function
> > has been suggested a number of times. With a suitable range
> > of methods,
> > this could also be used to reimplement rbind.data.frame (which,
> > incidentally, already contains a method for concatenating
> > factors, with
> > several ugly warts!)
>
> Yes, c() should have been put on the deprecated list a couple
> of decades ago, since people expect it to do too many
> incompatible things. And factor should have been a virtual
> class, with subclasses "FixedLevels" (e.g., Sex) or "AdHocLevels"
> (e.g., FamilyName), so c() and [()<- could do the appropriate
> thing in either case.
>
> Back to reality, S+ has a concat(...) function, whose comments say
> # This function works like c() except that names of arguments are
> # ignored. That is, it concatenates its arguments into a single
> # S vector object, without considering the names of the arguments,
> # in the order that the arguments are given.
> #
> # To make this function work for new classes, it is only necessary
> # to make methods for the concat.two function, which concatenates
> # two vectors; recursion will take care of the rest.
> concat() is not generic but it repeatedly calls concat.two(x,y), an
> SV4-generic that dispatches on the classes of x and y. Thus you
> can easily predict the class of concat(x,y,z), although it may not
> be the same as the class of concat(z,y,x), given suitably bizarre
> methods for concat.two().
>
> concat() doesn't get a lot of use but I think the idea is sound.
> Perhaps that model would work well for a concatenation function in R.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
> >
> > --
> > O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
> > c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
> > (*) \(*) -- University of Copenhagen Denmark Ph:
> > (+45) 35327918
> > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX:
> > (+45) 35327907
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list