# [R] problem in applying function in data subset (with a level) - using plyr or other alternative are also welcome

Dennis Murphy djmuser at gmail.com
Sat Sep 3 10:07:26 CEST 2011

```Hi:

I tried to figure out what you were doing...some of it I think I
grasped, other parts not so much.

On Fri, Sep 2, 2011 at 8:18 PM, Maya Joshi <maya.d.joshi at gmail.com> wrote:
> Dear R experts.
>
> I might be missing something obvious. I have been trying to fix this problem

Let's start by reducing some of your code:

ped <- rep(1:3, c(4, 3, 3))
y <- rnorm(10, 8, 2)
# This replaces all of your sample() statements, and is equivalent:
smat <- matrix(sample(1:3, 120, replace = TRUE), ncol = 12)
colnames(smat) <- c('M1a', 'M1b', 'M1aP1', 'M1bP2',
'M2a', 'M2b', 'M2aP1', 'M2bP2',
'M3a', 'M3b', 'M3aP1', 'M3bP2')
mydf <- as.data.frame(cbind(ped, y, smat))

>
> #data
> ped <- c(rep(1, 4), rep(2, 3), rep(3, 3))
> y <- rnorm(10, 8, 2)
>
> # variable set 1
> M1a <- sample (c(1, 2,3), 10, replace= T)
> M1b <- sample (c(1, 2,3), 10, replace= T)
> M1aP1 <- sample (c(1, 2,3), 10, replace= T)
> M1bP2 <- sample (c(1, 2,3), 10, replace= T)
>
> # variable set 2
> M2a <- sample (c(1, 2,3), 10, replace= T)
> M2b <- sample (c(1, 2,3), 10, replace= T)
> M2aP1 <- sample (c(1, 2,3), 10, replace= T)
> M2bP2 <- sample (c(1, 2,3), 10, replace= T)
>
> # variable set 3
> M3a <- sample (c(1, 2,3), 10, replace= T)
> M3b <- sample (c(1, 2,3), 10, replace= T)
> M3aP1 <- sample (c(1, 2,3), 10, replace= T)
> M3bP2 <- sample (c(1, 2,3), 10, replace= T)
>
> mydf <- data.frame (ped, M1a,M1b,M1aP1,M1bP2, M2a,M2b,M2aP1,M2bP2,
> M3a,M3b,M3aP1,M3bP2, y)
>
> # functions and further calculations
>
> mmat <- matrix
> (c("M1a","M2a","M3a","M1b","M2b","M3b","M1aP1","M2aP1","M3aP1",
> "M1bP2","M2bP2","M3bP2"), ncol = 4)

As far as I can tell, you want to compare a with aP1 and b with bP1
for all three M values. Given the way your data are arranged, each row
of the matrix smat contains all 12 values. apply() on index 1
generally takes a row vector as its input source. Your function to
pass to apply should respect that. My version of myfun() takes the row
vector as input, divides it into two subgroups of six, uses the
ifelse() function to do the comparisons and reshapes it into a 3 x 2
matrix, and then finally takes the row sums of the matrix, which is
the return value from the function.

myfun <- function(x) {
# Indices of the input vector to be compared
u <- c(1, 2, 5, 6, 9, 10)
v <- c(3, 4, 7, 8, 11, 12)
ot <- matrix(ifelse(x[u] == x[v], 1, -1), ncol = 2, byrow = TRUE)
rowSums(ot)
}

# Apply myfun() to the matrix of samples (smat)
qt <- t(apply(smat, 1, myfun))
colnames(qt) <- paste('M', 1L:3L, sep = '')
mydf2 <- data.frame(ped, y, as.data.frame(qt), Msum = rowSums(qt))

I wasn't sure what you wanted to do with M1-M3, so I added a row sum
variable just in case.
>
> # first function
> myfun <- function(x) {
> x<- as.vector(x)
> ot1 <- ifelse(mydf[x[1]] == mydf[x[3]], 1, -1)
> ot2 <- ifelse(mydf[x[2]] == mydf[x[4]], 1, -1)
> qt <- ot1 + ot2
> return(qt)
> }
> qt <- apply(mmat, 1, myfun)

# Mean deviation vector for y - OK.
> ydv <- c((y - mean(y))^2)
> qtd <- data.frame(ped, ydv, qt)
>

Here's the point where things get more confusing. I'm not sure what
you intend from sumD; as it stands, it will, in each row, multiply y
by the values in qt, which will generate an n x 3 matrix, n
representing the number of rows in the sub-data frame indexed by ped.
sumD will then sum the entire n x 3 matrix. Is that what you wanted?

More generally, when you write a function to pass to ddply(), the
input argument should be a data frame and the output should be a data
frame, especially if you intend to return more than one value. Part of
the problem with your function is that the variables inside the body
have no reference to the input data frame. Moreover, your mean squared
deviation function always uses 4 as the denominator rather than the
number of rows of the input data frame. (We'll ignore the bias in the
estimator since we're concentrating on the code.)

I have absolutely no idea if the following is what you intended, but
I'm showing this to illustrate how to create a function for input into
ddply() that can be applied groupwise.

myfun2 <- function(d) {
# input argument d is a data frame
# vector of squared mean deviations of y
ydv <- with(d, (y - mean(y))^2)
vydv <- mean(ydv)
# multiply the squared deviations vector by the sum
# of M1-M3 and then add the cross-products
sumD <- sum(ydv * d\$Msum)
# or equivalently, as an inner product:
# sumD <- ydv %*% d\$Msum
# Return the ratio as a data frame
data.frame(Rt = vydv / sumD)
}

ddply(mydf2, .(ped), myfun2)

## My result:
> ddply(mydf2, .(ped), myfun2)
ped          Rt
1   1  0.35787388
2   2 -0.17739049
3   3 -0.09958257

I'm reasonably sure that myfun() is OK, but myfun2() contains more
than a little guesswork. Assuming it's wrong, what the code can tell
you is how to pass the variables in from the input data frame and to
output a data frame when called within ddply().

HTH,
Dennis

> # second function
> myfun2 <- function(dataframe) {
> vydv <- sum(ydv)*0.25
> sumD <- sum(ydv * qt)
> Rt <- vydv / sumD
> return(Rt)
> }
>
> # using plyr
> require(plyr)
> dfsumd1 <- ddply(mydf,.(mydf\$ped),myfun2)
>
> Here are 2 issues:
> (1) The output just one, I need the output for all three set of variables
> (as listed above)
>
> (2)  all three values of dfsumd is returning to same for all level of ped:
> 1,2, 3
> Means that the function is applied to whole dataset but only replicated in
> output !!!
>
> I tried with plyr not being lazy but due to my limited R knowledge, If you
> have a different suggestion, you are welcome too.
>
>
> Maya
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help