[R] Trying to make code more efficient

Mon Jun 13 20:42:27 CEST 2011

On 6/9/2011 12:27 PM, Abraham Mathew wrote:
> I have a repetative task in R and i'm trying to find a more efficient way to
> perform
> the following task.
>
>
> lst<- list(roots = c("car insurance", "auto insurance"),
>               roots2 = c("insurance"), prefix = c("cheap", "budget"),
>               prefix2 = c("low cost"), suffix = c("quote", "quotes"),
>               suffix2 = c("rate", "rates"), suffix3 = c("comparison"),
>               state = c(state), inscompany = c(inscompany), city=c(city),
>               cityst = c(cityst), agency=c(agency))

This is not reproducible since we don't have state, inscompany, etc.

> myone<- function(x, y) {
>            m1<- do.call(paste, expand.grid(lst[[x]], lst[[y]]))
>            mydf<- data.frame(keyword=c(m1))
>        }

Your indentation threw me off for awhile before I realized that mytwo is 
not nested inside myone.

>        mytwo<- function(x, y, z){
>            m2<- do.call(paste, expand.grid(lst[[x]], lst[[y]], lst[[z]]))
>            mydf2<- data.frame(keyword=c(m2))
>        }

Anytime you have many sequentially numbered somethings, that is an 
indication you probably should be using a list (or possibly vector).

>        d1 = mytwo("prefix", "roots", "suffix")
>        d2 = mytwo("prefix", "roots", "suffix2")
>        d3 = mytwo("prefix", "roots", "suffix3")
>        d4 = mytwo("prefix2", "roots", "suffix")
>        d5 = mytwo("prefix2", "roots", "suffix2")
>        d6 = mytwo("prefix2", "roots", "suffix3")
>        d7 = mytwo("prefix", "roots2", "suffix")
>        d8 = mytwo("prefix", "roots2", "suffix2")
>        d9 = mytwo("prefix", "roots2", "suffix3")
>        d10 = mytwo("prefix2", "roots2", "suffix")
>        d11 = mytwo("prefix2", "roots2", "suffix2")
>        d12 = mytwo("prefix2", "roots2", "suffix3")

Well, these first 12 can be generated using mlply (from plyr) and 
another expand.grid to get all the combinations.

d <- mlply(.data = expand.grid(x=c("prefix", "prefix2"),
		y=c("roots", "roots2"),
		z=c("suffix", "suffix2", "suffix3"),
		stringsAsFactors=FALSE),
	.fun = mytwo,
	.expand = FALSE)

>        d13 = myone("prefix", "roots")
>        d14 = myone("prefix2", "roots")
>        d15 = myone("prefix", "roots2")
>        d16 = myone("prefix2", "roots2")

Another pattern of a full cross of two sets which are fed as arguments 
to a function, so something similar to before.

>        d17 = myone("roots", "suffix")
>        d18 = myone("roots", "suffix2")
>        d19 = myone("roots", "suffix3")
>        d20 = myone("roots2", "suffix")
>        d21 = myone("roots2", "suffix2")
>        d22 = myone("roots2", "suffix3")

Trying to see bigger patterns.  There is the set prefix/prefix2 (call it 
set P), roots/root2 (call it set R), and suffix/suffix2/suffix3 (call it 
set S).  Pick two or three of these sets and, keeping them in order, 
send all the crosses of the sets as arguments to a function (that takes 
an appropriate number of arguments).

In fact, myone and mytwo could (probably) be replaced with

my <- function(...) {
	data.frame(keyword=c(do.call(paste, do.call(expand.grid, lst[c(...)]))))
}

>        d23 = myone("state", "roots")
>        d24 = myone("city", "roots")
>        d25 = myone("cityst", "roots")
>        d26 = myone("inscompany", "roots")
>        d27 = myone("state", "roots2")
>        d28 = myone("city", "roots2")
>        d29 = myone("cityst", "roots2")
>        d30 = myone("inscompany", "roots2")

OK, need to broaden the pattern.  Another set is 
state/city/cityst/inscompany (call it set I).  If thinking in an order, 
it is before roots/roots2 (set R).

>        d31 = mytwo("state", "roots", "suffix")
>        d32 = mytwo("city", "roots", "suffix")
>        d33 = mytwo("cityst", "roots", "suffix")
>        d34 = mytwo("inscompany", "roots", "suffix")
>        d35 = mytwo("state", "roots", "suffix2")
>        d36 = mytwo("city", "roots", "suffix2")
>        d37 = mytwo("cityst", "roots", "suffix2")
>        d38 = mytwo("inscompany", "roots", "suffix2")
>        d39 = mytwo("state", "roots", "suffix3")
>        d40 = mytwo("city", "roots", "suffix3")
>        d41 = mytwo("cityst", "roots", "suffix3")
>        d42 = mytwo("inscompany", "roots", "suffix3")
>        d43 = mytwo("state", "roots2", "suffix")
>        d44 = mytwo("city", "roots2", "suffix")
>        d45 = mytwo("cityst", "roots2", "suffix")
>        d46 = mytwo("inscompany", "roots2", "suffix")
>        d47 = mytwo("state", "roots2", "suffix2")
>        d48 = mytwo("city", "roots2", "suffix2")
>        d49 = mytwo("cityst", "roots2", "suffix2")
>        d50 = mytwo("inscompany", "roots2", "suffix2")
>        d51 = mytwo("state", "roots2", "suffix3")
>        d52 = mytwo("city", "roots2", "suffix3")
>        d53 = mytwo("cityst", "roots2", "suffix3")
>        d54 = mytwo("inscompany", "roots2", "suffix3")

Three way between I/R/S

>        d55 = mytwo("prefix", "state", "roots")
>        d56 = mytwo("prefix", "city", "roots")
>        d57 = mytwo("prefix", "cityst", "roots")
>        d58 = mytwo("prefix", "inscompany", "roots")
>        d59 = mytwo("prefix2", "state", "roots")
>        d60 = mytwo("prefix2", "city", "roots")
>        d61 = mytwo("prefix2", "cityst", "roots")
>        d62 = mytwo("prefix2", "inscompany", "roots")
>        d63 = mytwo("prefix", "state", "roots2")
>        d64 = mytwo("prefix", "city", "roots2")
>        d65 = mytwo("prefix", "cityst", "roots2")
>        d66 = mytwo("prefix", "inscompany", "roots2")
>        d67 = mytwo("prefix2", "state", "roots2")
>        d68 = mytwo("prefix2", "city", "roots2")
>        d69 = mytwo("prefix2", "cityst", "roots2")
 >        d70 = mytwo("prefix2", "inscompany", "roots2")

Three way between P/I/R

>        d71 = mytwo("prefix", "inscompany", "suffix")
>        d72 = mytwo("prefix", "inscompany", "suffix2")
>        d73 = mytwo("prefix", "inscompany", "suffix3")
>        d74 = mytwo("prefix2", "inscompany", "suffix")
>        d75 = mytwo("prefix2", "inscompany", "suffix2")
>        d76 = mytwo("prefix2", "inscompany", "suffix3")

This doesn't follow the pattern; it is just inscompany rather than all 
of I (crossed with P and S).  Is it just incomplete and should be all of 
P/I/S?

How about:

lst <- list(roots = c("car insurance", "auto insurance"),
              roots2 = c("insurance"), prefix = c("cheap", "budget"),
              prefix2 = c("low cost"), suffix = c("quote", "quotes"),
              suffix2 = c("rate", "rates"), suffix3 = c("comparison"),
              state = c("state"), inscompany = c("inscompany"),
              city=c("city"),
              cityst = c("cityst"), agency=c("agency"))

my <- function(...) {
	data.frame(keyword=c(do.call(paste, do.call(expand.grid, lst[c(...)]))))
}

setP <- c("prefix", "prefix2")
setI <- c("state", "city", "cityst", "inscompany")
setR <- c("roots", "roots2")
setS <- c("suffix", "suffix2", "suffix3")

d <- c(
	mlply(expand.grid(setP, setR, setS, stringsAsFactors = FALSE), my, 
.expand=FALSE),
	mlply(expand.grid(setP, setR, stringsAsFactors = FALSE), my, 
.expand=FALSE),
	mlply(expand.grid(setR, setS, stringsAsFactors = FALSE), my, 
.expand=FALSE),
	mlply(expand.grid(setI, setR, stringsAsFactors = FALSE), my, 
.expand=FALSE),
	mlply(expand.grid(setI, setR, setS, stringsAsFactors = FALSE), my, 
.expand=FALSE),
	mlply(expand.grid(setP, setI, setR, stringsAsFactors = FALSE), my, 
.expand=FALSE),
	mlply(expand.grid(setP, "inscompany", setS, stringsAsFactors = FALSE), 
my, .expand=FALSE)
)

Of course, you could keep going with the abstraction:

mymlply <- function(...) {
	mlply(expand.grid(..., stringsAsFactors = FALSE), my, .expand=FALSE)
}

sets <- list(setP, setI, setR, setS)

d <- c(
	mlply(t(combn(4,2)), function(...) {mymlply(sets[c(...)])}, .expand=FALSE),
	mlply(t(combn(4,3)), function(...) {mymlply(sets[c(...)])}, .expand=FALSE)
)
d <- unlist(d, recursive=FALSE)

which gives all 2 or 3 selections of the 4 sets (120 data frames in 
all), and expands them to all crosses, and looks up each of those in lst 
and makes the dataframes that are the crosses.

Now, I don't know why you would want this this way, necessarily.  I am 
guessing the innermost dataframes should be of character, not factor. 
But it is (close to) what you asked for.

>
> Obviously, this code gets rather repetative, even with the function, and I
> was
> wondering if there's a shortcut that I should consider to simplify the
> process.
>
> Thanks,
>
> I'm running R 2.13 on Ubuntu 10.10
>
> 	[[alternative HTML version deleted]]
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University