[Rd] ecdf with lots of ties is inefficient (PR#7292)
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sun Oct 17 09:24:04 CEST 2004
This seems a _very_ unusual use of ecdf -- what are you using it for that
a sample of size 10,000 would not do equally well?
If you have a need for a more efficient version of ecdf, please develop
one and submit a patch. I don't think it would be hard as ecdf does
x <- sort(x)
rval <- approxfun(x, (1:n)/n, method = "constant", yleft = 0,
yright = 1, f = 0, ties = "ordered")
_but_ it might be hard to recognize the situation you are in without much
computation. Something along the lines of
vals <- sort(unique(x))
y <- tabulate(match(x, vals))
rval <- approxfun(vals, cumsum(y)/n, method = "constant", yleft = 0,
yright = 1, f = 0, ties = "ordered")
should work better for you and may be little slower if there are no ties,
but will use more memory.
A quick play suggests that the real problem is not with ecdf (at least not
for me with x <- sample(1:200, 2e7, replace=TRUE)), but with plotting the
result. Please investigate what might be a reasonable compromise.
On Sun, 17 Oct 2004 martin at gsc.riken.jp wrote:
> Full_Name: Martin Frith
> Version: R-2.0.0
> OS: linux-gnu
> Submission from: (NULL) (134.160.83.73)
>
>
> I have large vectors containing 100,000 to 20,000,000 numbers. However,
> they only contain a few hundred *distinct* numbers (e.g. positive
> integers < 200). When I do ecdf(v), it either runs out of memory, or it
> succeeds, but when I plot the ecdf with postscript, the output is
> unnecessarily bloated because the same lines get redrawn many times. The
> complexity of ecdf should depend on how many distinct numbers there are,
> not how many total numbers.
>
> This is my first bug report, so forgive me if I've done something stupid!
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list