[R] Function to "lump" factors together?

Tue Oct 18 05:36:32 CEST 2011

On Oct 17, 2011, at 9:45 PM, David Wolfskill wrote:

> Sorry about the odd terminology, but I suspect that my intent might be
> completely missed had I used "aggregate" or "classify" (each of which
> appears to have some rather special meanings in statistical analysis  
> and
> modeling).
>
> I have some data about software builds; one of the characteristics of
> each is the name of the branch.
>
> A colleague has generated some fairly interesting graphs from the  
> data,
> but he's treating each unique branch as if it were a separate factor.
>
> Last I checked, I had 276 unique branches, but these could be
> aggregated, classified, or "lumped" into about 8 - 10 categories; I
> believe it would be useful and helpful for me to be able to do  
> precisely
> that.
>
> A facility that could work for this purpose (that that we use in our
> "continuous build" driver) is the Bourne shell "case" statement.   
> Such a
> construct might look like:
>
> 	case branch in
> 	trunk)    factor="trunk"; continue;;
> 	IB*)      factor="IB"; continue;;
> 	DEV*)     factor="DEV"; continue;;
> 	PVT*)     factor="PVT"; continue;;
> 	RELEASE*) factor="RELEASE"; continue;;
> 	*)        factor="UNK"; continue;;
> 	esac
>
> Which would assign one of 6 values to "factor" depending on the  
> value of
> "branch" -- using "UNK" as a default if nothing else matched.
>
> Mind, the patterns there are "Shell Patterns" ("globs"), not regular
> expressions.
>
> I've looked at R functions match(), pmatch(), charmatch(), and  
> switch();
> while each looks as it it might be coercable to get the result I want,
> it also looks to require iteration over the thousands of entries I  
> have
> -- as well as using the functions in question in a fairly "unnatural"
> way.
>
> I could also write my own function that iterates over the entries,
> generating factors from the branch names -- but I can't help but think
> that what I'm trying to do can't be so uncommon that someone hasn't
> already written a function to do what I'm trying to do.  And I'd  
> really
> rather avoid "re-inventing the wheel," here.

Here's a loopless lumping of random letters with an "other" value .  
There better ways, but my efforts with match and switch came to  
naught. "pmatch" returns a numeric vector that selects the group.

 > x <- sample(letters[1:10], 50, replace =TRUE)
 > c("abc","abc","abc","def","def","def","ghi","ghi","ghi", "j") 
[pmatch(x, letters[1:10], duplicates.ok=TRUE, nomatch=10)]
  [1] "ghi" "ghi" "ghi" "ghi" "ghi" "def" "def" "ghi" "def" "abc"  
"abc" "j"   "def" "def" "ghi"
[16] "abc" "j"   "def" "ghi" "abc" "ghi" "abc" "abc" "abc" "abc" "abc"  
"abc" "ghi" "def" "abc"
[31] "ghi" "def" "ghi" "def" "abc" "ghi" "ghi" "j"   "abc" "def" "abc"  
"ghi" "abc" "def" "def"
[46] "def" "j"   "ghi" "def" "def"

Classifying 5 million letters in about a second:

 > x <- sample(letters[1:10], 5000000, replace =TRUE)
 > system.time( v <- 
c("abc","abc","abc","def","def","def","ghi","ghi","ghi", "j") 
[pmatch(x, letters[1:10], duplicates.ok=TRUE, nomatch=10)] )
    user  system elapsed
   0.858   0.208   1.062

The same strategy (indexing to return a set membership) can be used  
with findInterval.

-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT