[R] Function to "lump" factors together?

Tue Oct 18 09:51:13 CEST 2011

On Oct 18, 2011, at 05:36 , David Winsemius wrote:

> 
> On Oct 17, 2011, at 9:45 PM, David Wolfskill wrote:
> 
>> Sorry about the odd terminology, but I suspect that my intent might be
>> completely missed had I used "aggregate" or "classify" (each of which
>> appears to have some rather special meanings in statistical analysis and
>> modeling).
>> 
>> I have some data about software builds; one of the characteristics of
>> each is the name of the branch.
>> 
>> A colleague has generated some fairly interesting graphs from the data,
>> but he's treating each unique branch as if it were a separate factor.
>> 
>> Last I checked, I had 276 unique branches, but these could be
>> aggregated, classified, or "lumped" into about 8 - 10 categories; I
>> believe it would be useful and helpful for me to be able to do precisely
>> that.
>> 
>> A facility that could work for this purpose (that that we use in our
>> "continuous build" driver) is the Bourne shell "case" statement.  Such a
>> construct might look like:
>> 
>> 	case branch in
>> 	trunk)    factor="trunk"; continue;;
>> 	IB*)      factor="IB"; continue;;
>> 	DEV*)     factor="DEV"; continue;;
>> 	PVT*)     factor="PVT"; continue;;
>> 	RELEASE*) factor="RELEASE"; continue;;
>> 	*)        factor="UNK"; continue;;
>> 	esac
>> 
>> Which would assign one of 6 values to "factor" depending on the value of
>> "branch" -- using "UNK" as a default if nothing else matched.
>> 
>> Mind, the patterns there are "Shell Patterns" ("globs"), not regular
>> expressions.
>> 
>> I've looked at R functions match(), pmatch(), charmatch(), and switch();
>> while each looks as it it might be coercable to get the result I want,
>> it also looks to require iteration over the thousands of entries I have
>> -- as well as using the functions in question in a fairly "unnatural"
>> way.
>> 
>> I could also write my own function that iterates over the entries,
>> generating factors from the branch names -- but I can't help but think
>> that what I'm trying to do can't be so uncommon that someone hasn't
>> already written a function to do what I'm trying to do.  And I'd really
>> rather avoid "re-inventing the wheel," here.
> 
> Here's a loopless lumping of random letters with an "other" value . There better ways, but my efforts with match and switch came to naught. "pmatch" returns a numeric vector that selects the group.
> 
> > x <- sample(letters[1:10], 50, replace =TRUE)
> > c("abc","abc","abc","def","def","def","ghi","ghi","ghi", "j")[pmatch(x, letters[1:10], duplicates.ok=TRUE, nomatch=10)]

pmatch() should work for alternatives of the FOO* style, yes. For a full, vectorized, version of the "case", I'd expect that one ticket is a nesting of 

x2 <- ifelse ( grepl(p1,x), s1 ,
   ifelse ( grepl(p2,x), s2 ,
...
         ifelse ( grepl(pn,x),sn ,"UNK")
    )
  ) 

(The patterns p1,..., pn are regexps, but glob2rx() exists). Also notice that it may be easier to work on the level set of the original factor rather than converting the whole thing to character and back:

x <- levels(f)
x2 <- ifelse(....)
f2 <- f
levels(f2) <- x2 

It should also be possible to think up a version that avoids the nesting of ifelse. Something like

m <- sapply(patterns, grepl, x) 
first <- .... 
x2 <- replacements[first]

the "...." is the tricky bit: Get the index of the first TRUE element in each column of a logical matrix. It can be done straightforwardly with apply() and match(), but any more efficient variants escape me just now.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com