[Rd] ecdf with lots of ties is inefficient (PR#7292)

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Oct 17 09:24:04 CEST 2004


This seems a _very_ unusual use of ecdf -- what are you using it for that
a sample of size 10,000 would not do equally well?

If you have a need for a more efficient version of ecdf, please develop 
one and submit a patch.  I don't think it would be hard as ecdf does

    x <- sort(x)
    rval <- approxfun(x, (1:n)/n, method = "constant", yleft = 0,
                      yright = 1, f = 0, ties = "ordered")

_but_ it might be hard to recognize the situation you are in without much
computation.  Something along the lines of

    vals <- sort(unique(x))
    y <- tabulate(match(x, vals))
    rval <- approxfun(vals, cumsum(y)/n, method = "constant", yleft = 0,
                      yright = 1, f = 0, ties = "ordered")

should work better for you and may be little slower if there are no ties, 
but will use more memory.

A quick play suggests that the real problem is not with ecdf (at least not
for me with x <- sample(1:200, 2e7, replace=TRUE)), but with plotting the
result.  Please investigate what might be a reasonable compromise.

On Sun, 17 Oct 2004 martin at gsc.riken.jp wrote:

> Full_Name: Martin Frith
> Version: R-2.0.0
> OS: linux-gnu
> Submission from: (NULL) (134.160.83.73)
> 
> 
> I have large vectors containing 100,000 to 20,000,000 numbers. However,
> they only contain a few hundred *distinct* numbers (e.g. positive
> integers < 200). When I do ecdf(v), it either runs out of memory, or it
> succeeds, but when I plot the ecdf with postscript, the output is
> unnecessarily bloated because the same lines get redrawn many times. The
> complexity of ecdf should depend on how many distinct numbers there are,
> not how many total numbers.
> 
> This is my first bug report, so forgive me if I've done something stupid!

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list