[Rd] 'partial' in sort() inefficient?

Deepayan Sarkar deepayan.sarkar at gmail.com
Fri Nov 25 21:25:50 CET 2005


I often need to work with large vectors whose distribution I want to
summarize by Q-Q plots.  Since the vectors are large, I use a subset
of quantiles, e.g.

quantile(x, probs = ppoints(1000))

Unfortunately, this seemed to be taking too long for large x (much
longer than 'sort').  I initially thought maybe quantile was doing
something sophisticated (which I don't really need with a large data
set), so I would write something simple myself.  I did, but then I
noticed that 'quantile' was doing essentially the same thing, with the
exception that it was calling 'sort' with a non-null 'partial'
argument.


?sort says:

     If 'partial' is not 'NULL', it is taken to contain indices of
     elements of 'x' which are to be placed in their correct positions
     by partial sorting.  After the sort, the values specified in
     'partial' are in their correct position in the sorted array.  Any
     values smaller than these values are guaranteed to have a smaller
     index in the sorted array and any values which are greater are
     guaranteed to have a bigger index in the sorted array.  This is
     included for efficiency, and many of the options are not available
     for partial sorting.

However, rather than being efficient, this seems to considerably slow
things down (I haven't checked memory efficiency).  Consider the
following code (imitating what 'quantile' does by default):


probs <- ppoints(1000)

for (i in seq(5000, 50000, 5000))
{
     x <- rnorm(i)
     n <- length(x)
     index <- 1 + (n - 1) * probs
     lo <- floor(index)
     hi <- ceiling(index)
     keep <- as.integer(unique(c(lo, hi)))
     cat(system.time(y1 <- sort(x, partial = keep))[1])
     cat("\t")
     cat(system.time(y2 <- sort(x))[1])
     cat("\t")
     cat(round(max(abs( y1 - y2 )), digits = 3))
     cat("\t")
     cat(max(abs( y1[keep] - y2[keep] )))
     cat("\n")
}


The first two columns in the output are timings for 'sort' with and
without 'partial', the last two columns are just a rough check that
partial sorting is doing what it claims.  With R-2.2:


0.78    0       0.031   0
1.64    0.01    0.565   0
2.59    0.01    0.646   0
3.44    0.01    0.487   0
4.4     0.01    0.569   0
5.26    0.01    0.642   0
6.29    0.01    1.071   0
7.18    0.02    0.566   0
8.18    0.02    1.094   0
9.01    0.03    0.89    0
> version
         _
platform i686-pc-linux-gnu
arch     i686
os       linux-gnu
system   i686, linux-gnu
status   Patched
major    2
minor    2.0
year     2005
month    10
day      16
svn rev  35911
language R


This also holds on (a faster machine running) r-devel


0.39    0       0.176   0
0.85    0.01    0.62    0
1.32    0       0.193   0
1.8     0       0.949   0
2.29    0       0.73    0
2.77    0       1.185   0
3.25    0.01    0.813   0
3.75    0.01    1.171   0
4.21    0.01    0.827   0
4.74    0.01    0.372   0
> version
         _
platform x86_64-unknown-linux-gnu
arch     x86_64
os       linux-gnu
system   x86_64, linux-gnu
status   Under development (unstable)
major    2
minor    3.0
year     2005
month    11
day      25
svn rev  36468
language R


Speed improves when the number of 'partial' indices is small, but even
for 10 indices plain 'sort' is faster.

Am I missing something?  I haven't checked if NA's make a difference
or if there might be memory usage issues, but even so, could
'quantile' at least have an option to disable partial sorting?

Deepayan



More information about the R-devel mailing list