[R] Basic question on concatenating factors
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sun Nov 23 08:36:10 CET 2008
On Sun, 23 Nov 2008, jim holtman wrote:
> You are right. union used 'unique(c(x,y))' and I am not sure if
> 'unique' preserves the order, but the help page seems to indicate that
> "an element is omitted if it is identical to any previous element ";
> this might mean that the order is preserved.
It says
'unique' returns a vector, data frame or array like 'x' but with
duplicate elements/rows removed.
Although it is a generic function, it is hard to see how that can be
interpreted to allow the order to be changed.
The claim that union would be more efficiently implemented via sorting is
made with no evidence: do look up a basic computer science textbook for
this kind of thing, as well as how R actually does it. (Also 'efficient'
was not defined: both speed and memory usage are potentially measures of
efficiency.) But for example
> x <- rnorm(1e7)
> system.time(unique(x))
user system elapsed
2.258 0.261 2.523
> system.time(sort(x))
user system elapsed
4.102 0.112 4.231
> system.time(sort(x, method="quick"))
user system elapsed
1.928 0.109 2.047
will indicate that unique() is comparable in speed to sorting.
>
> On Sat, Nov 22, 2008 at 11:43 PM, Stavros Macrakis
> <macrakis at alum.mit.edu> wrote:
>> On Sat, Nov 22, 2008 at 10:20 AM, jim holtman <jholtman at gmail.com> wrote:
>>> c.Factor <-
>>> function (x, y)
>>> {
>>> newlevels = union(levels(x), levels(y))
>>> m = match(levels(y), newlevels)
>>> ans = c(unclass(x), m[unclass(y)])
>>> levels(ans) = newlevels
>>> class(ans) = "factor"
>>> ans
>>> }
>>
>> This algorithm depends crucially on union preserving the order of the
>> elements of its arguments. As far as I can tell, the spec of union
>> does not require this. If union were to (for example) sort its
>> arguments then merge them (generally a more efficient algorithm), this
>> function would no longer work.
>>
>> Fortunately, the fix is simple. Instead of union, use:
>>
>> newlevels <- c(levels(x),setdiff(levels(y),levels(x))
>>
>> which is guaranteed to preserve the order of levels(x).
>>
>> -s
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list