[R] Basic question on concatenating factors
ehud cohen
ehudco.list at gmail.com
Sun Nov 23 17:13:06 CET 2008
Thank you all for the help and enlightening comments.
EC
On Sun, Nov 23, 2008 at 4:40 PM, jim holtman <jholtman at gmail.com> wrote:
> You do have to read a little further on the help page to make sure
> that duplicates are removed if they appear after, and not before,
> others in the vector to see that the order is preserved:
> "Note that unlike the Unix command uniq this omits duplicated and not
> just repeated elements/rows. That is, an element is omitted if it is
> identical to any previous element and not just if it is the same as
> the immediately previous one. "
> This does make it clear that the original order is preserved since it
> is succeeding elements that are removed. So from this, I assume that
> the use of
> unique(x,y)
>
> does preserve the original ordering of the elements.
> On Sun, Nov 23, 2008 at 2:36 AM, Prof Brian Ripley
> <ripley at stats.ox.ac.uk> wrote:
> > On Sun, 23 Nov 2008, jim holtman wrote:
> >
> >> You are right. union used 'unique(c(x,y))' and I am not sure if
> >> 'unique' preserves the order, but the help page seems to indicate that
> >> "an element is omitted if it is identical to any previous element ";
> >> this might mean that the order is preserved.
> >
> > It says
> >
> > 'unique' returns a vector, data frame or array like 'x' but with
> > duplicate elements/rows removed.
> >
> > Although it is a generic function, it is hard to see how that can be
> > interpreted to allow the order to be changed.
> >
> > The claim that union would be more efficiently implemented via sorting is
> > made with no evidence: do look up a basic computer science textbook for this
> > kind of thing, as well as how R actually does it. (Also 'efficient' was not
> > defined: both speed and memory usage are potentially measures of
> > efficiency.) But for example
> >
> >> x <- rnorm(1e7)
> >> system.time(unique(x))
> >
> > user system elapsed
> > 2.258 0.261 2.523
> >> system.time(sort(x))
> >
> > user system elapsed
> > 4.102 0.112 4.231
> >> system.time(sort(x, method="quick"))
> >
> > user system elapsed
> > 1.928 0.109 2.047
> >
> > will indicate that unique() is comparable in speed to sorting.
> >
> >> On Sat, Nov 22, 2008 at 11:43 PM, Stavros Macrakis
> >> <macrakis at alum.mit.edu> wrote:
> >>>
> >>> On Sat, Nov 22, 2008 at 10:20 AM, jim holtman <jholtman at gmail.com> wrote:
> >>>>
> >>>> c.Factor <-
> >>>> function (x, y)
> >>>> {
> >>>> newlevels = union(levels(x), levels(y))
> >>>> m = match(levels(y), newlevels)
> >>>> ans = c(unclass(x), m[unclass(y)])
> >>>> levels(ans) = newlevels
> >>>> class(ans) = "factor"
> >>>> ans
> >>>> }
> >>>
> >>> This algorithm depends crucially on union preserving the order of the
> >>> elements of its arguments. As far as I can tell, the spec of union
> >>> does not require this. If union were to (for example) sort its
> >>> arguments then merge them (generally a more efficient algorithm), this
> >>> function would no longer work.
> >>>
> >>> Fortunately, the fix is simple. Instead of union, use:
> >>>
> >>> newlevels <- c(levels(x),setdiff(levels(y),levels(x))
> >>>
> >>> which is guaranteed to preserve the order of levels(x).
> >>>
> >>> -s
> >>>
> >>
