[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
bill at insightful.com
Sat Jul 29 02:31:44 CEST 2006
On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:
> Hi Bill,
> > Splus8.0 has something like what you are talking about
> > that provides a fast way to compute
> > sapply(split(xVector, integerGroupCode), summaryFunction)
> > for some common summary functions. The 'integerGroupCode'
> > is typically the codes from a factor, but you could compute
> > it in other ways. It needs to be a "small" integer in
> > the range 1:ngroups (like the 'bin' argument to tabulate).
> > Like tabulate(), which is called from table(), these are
> > meant to be called from other functions that can set up
> > appropriate group codes. E.g., groupSums or rowSums or
> > fancier things could be based on this.
> > They do not insist you sort the input in any way. That
> > would really only be useful for group medians and we haven't
> > written that one yet.
> The sort is also useful for recoding each group into subgroups based
> on some other numeric vector. This is the problem I run into trying
> to build portfolios that can be used as benchmarks for long term
> stock returns.
> Another issue I have is that to recode a long character string that I
> use as a sort key for accessing a subgroup of the data in the
> data.frame to a set of small integers is not fast. I can make a fast
> implementation if the data is sorted by the key, but without the
> sort, just converting my sort keys to the required small integer
> codes would be expensive for very long vectors since my small integer
> codes would have to reflect the order of the data (ie. be increasing
> subportfolio numbers).
True, but the underlying grouped summary code
shouldn't require you to do the sorting. If
codes <- match(char, sort(unique(char)))
is too slow then you could try sorting the
data set by th 'char' column and doing
codes <- cumsum(char[-1] != char[-length(char)])
(loop over columns before doing cumsum if there
are several columns).
If that isn't fast enough, then you could sort
in the C code as well, but I think there could
be lots of cases where that is slower.
I've used this code for out of core applications,
where I definitely do not want to sort the whole
> More specifically, I am now converting all of my SAS code to R code
> and the problem is I have lots of snippets of SAS that do the
> following ...
> PROC SORT;
> BY MDSIZ FSIZ;
> /* WRITE OUT THE MIN SIZE CUTOFF VALUES */
> PROC UNIVARIATE NOPRINT;
> VAR FSIZ;
> BY MDSIZ;
> OUTPUT OUT=TMPS1 MIN=XMIN;
> where my sort key MDSIZ is a character string that is the
> concatenation of the month ending date MD and the size portfolio of a
> particular firm (SIZ) and I want to find the cutoff points (the mins)
> for each of the portfolios for every month end date across all traded
> > The typical prototype is
> >> igroupSums
> > function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if
> > (is.null(
> > group)) 1 else max(as.integer(group), na.rm = T))
> > and the currently supported summary functions are
> > mean : igroupMeans
> > sum : igroupSums
> > prod : igroupProds
> > min : igroupMins
> > max : igroupMaxs
> > range : igroupRanges
> > any : igroupAnys
> > all : igroupAlls
> SAS is similar in that is also has a specific list of functions you
> can request including all of the basic stats from a PROC univariate
> including higher moment stuff (skewness, kurtosis, robust
> statistics, and even statistical test results for each coded
> subgroup, and the nice thing is all combinations can be done with one
> But to do that SAS does require the presorting, but it does run
> really fast for even super long vectors with lots of sort keys.
> Similarly the next snippet of code, will take the file and resort it
> by the portfolio key and then the market to book ratio (MTB) for all
> trading firms for all monthly periods since 1980. It will then
> split each size portfolio for each month ending date into 5 equal
> portfolios based on market to book ratios (thus the need for the
> sort). SAS returns a coded integer vector PMTB (made up of 1s to 5
> with 1s's for the smallest MTB and 5 for the largest MTB) repeated
> for each subgroup of MDSIZ. PMTB matches the original vector in
> length and therefore fits right into the data frame.
> /* SPLIT INTO Market to Book QUINTILES BY MDSIZ */
> PROC SORT;
> BY MDSIZ MTB;
> PROC RANK GROUPS=5 OUT=TMPS0;
> VAR MTB;
> RANKS PMTB;
> BY MDSIZ;
> The problem of assigning elements of a long data vector to portfolios
> and sub portfolios based on the values of specific data columns which
> must be calculated at each step and are not fixed or hardcoded is one
> that finance can run into (and therefore I run into it).
> So by sorting I could handle the need for "small integer" recoding
> and the small integers would have meaning (i.e. higher values will
> represent larger MTB firms, etc).
> That just leaves the problem of calculating stats on short sequences
> of of a longer integer.
> > They are fast:
> >> x<-runif(2e6)
> >> i<-rep(1:1e6, 2)
> >> sys.time(sx <- igroupSums(x,i))
> >  0.66 0.67
> >> length(sx)
> >  1000000
> > On that machine R takes 44 seconds to go the lapply/split
> > route:
> >> unix.time(unlist(lapply(split(x,i), sum)))
> >  43.24 0.78 44.11 0.00 0.00
> Yes! That is exactly what I need.
> Are there plans for adding something like that to R?
bill at insightful dot com
"All statements in this message represent the opinions of the author and do
not necessarily reflect Insightful Corporation policy or position."
More information about the R-devel