[R] Basic question on concatenating factors
ehud cohen
ehudco.list at gmail.com
Sun Nov 23 17:13:06 CET 2008
Thank you all for the help and enlightening comments.
EC
On Sun, Nov 23, 2008 at 4:40 PM, jim holtman <jholtman at gmail.com> wrote:
>
> You do have to read a little further on the help page to make sure
> that duplicates are removed if they appear after, and not before,
> others in the vector to see that the order is preserved:
>
> "Note that unlike the Unix command uniq this omits duplicated and not
> just repeated elements/rows. That is, an element is omitted if it is
> identical to any previous element and not just if it is the same as
> the immediately previous one. "
>
> This does make it clear that the original order is preserved since it
> is succeeding elements that are removed. So from this, I assume that
> the use of
>
> unique(x,y)
>
> does preserve the original ordering of the elements.
>
>
>
> On Sun, Nov 23, 2008 at 2:36 AM, Prof Brian Ripley
> <ripley at stats.ox.ac.uk> wrote:
> > On Sun, 23 Nov 2008, jim holtman wrote:
> >
> >> You are right. union used 'unique(c(x,y))' and I am not sure if
> >> 'unique' preserves the order, but the help page seems to indicate that
> >> "an element is omitted if it is identical to any previous element ";
> >> this might mean that the order is preserved.
> >
> > It says
> >
> > 'unique' returns a vector, data frame or array like 'x' but with
> > duplicate elements/rows removed.
> >
> > Although it is a generic function, it is hard to see how that can be
> > interpreted to allow the order to be changed.
> >
> > The claim that union would be more efficiently implemented via sorting is
> > made with no evidence: do look up a basic computer science textbook for this
> > kind of thing, as well as how R actually does it. (Also 'efficient' was not
> > defined: both speed and memory usage are potentially measures of
> > efficiency.) But for example
> >
> >> x <- rnorm(1e7)
> >> system.time(unique(x))
> >
> > user system elapsed
> > 2.258 0.261 2.523
> >>
> >> system.time(sort(x))
> >
> > user system elapsed
> > 4.102 0.112 4.231
> >>
> >> system.time(sort(x, method="quick"))
> >
> > user system elapsed
> > 1.928 0.109 2.047
> >
> > will indicate that unique() is comparable in speed to sorting.
> >
> >
> >>
> >> On Sat, Nov 22, 2008 at 11:43 PM, Stavros Macrakis
> >> <macrakis at alum.mit.edu> wrote:
> >>>
> >>> On Sat, Nov 22, 2008 at 10:20 AM, jim holtman <jholtman at gmail.com> wrote:
> >>>>
> >>>> c.Factor <-
> >>>> function (x, y)
> >>>> {
> >>>> newlevels = union(levels(x), levels(y))
> >>>> m = match(levels(y), newlevels)
> >>>> ans = c(unclass(x), m[unclass(y)])
> >>>> levels(ans) = newlevels
> >>>> class(ans) = "factor"
> >>>> ans
> >>>> }
> >>>
> >>> This algorithm depends crucially on union preserving the order of the
> >>> elements of its arguments. As far as I can tell, the spec of union
> >>> does not require this. If union were to (for example) sort its
> >>> arguments then merge them (generally a more efficient algorithm), this
> >>> function would no longer work.
> >>>
> >>> Fortunately, the fix is simple. Instead of union, use:
> >>>
> >>> newlevels <- c(levels(x),setdiff(levels(y),levels(x))
> >>>
> >>> which is guaranteed to preserve the order of levels(x).
> >>>
> >>> -s
> >>>
> >>
> >>
> >>
> >> --
> >> Jim Holtman
> >> Cincinnati, OH
> >> +1 513 646 9390
> >>
> >> What is the problem that you are trying to solve?
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> > --
> > Brian D. Ripley, ripley at stats.ox.ac.uk
> > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> > University of Oxford, Tel: +44 1865 272861 (self)
> > 1 South Parks Road, +44 1865 272866 (PA)
> > Oxford OX1 3TG, UK Fax: +44 1865 272595
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list