[R] by (tapply) and for loop differences

Peter Dalgaard p.dalgaard at biostat.ku.dk
Tue Jul 5 12:37:11 CEST 2005


"Bashir Saghir (Aztek Global)" <Saghir.Bashir at ucb-group.com> writes:

> I am getting a difference in results when running some analysis using by and
> tapply compare to using a for loop. I've tried searching the web but had no
> luck with the keywords I used.
>  
> I've attached a simple example below to illustrates my problem. I get a
> difference in the mean of yvar, diff and the p-value using tapply & by
> compared to a for loop. I cannot see what I am doing wrong. Can anyone help?
> 
> > # Simulate some data - I'll do 2 simulations...
> > 
> > xvar = rnorm(40, 20, 5)
> > yvar = rnorm(40, 22, 2)
> > num = factor(rep(1:2, each=20))
> > sdat = data.frame(cbind(num, xvar, yvar))
> > 
> > # Define a function to do a simple t test and return some values...
> > 
> > kindtest = function(varx, vary){
> +    res = t.test(varx, vary)
> +    x.mn = res$estimate[1]
> +    y.mn = res$estimate[2]
> +    diff = y.mn-x.mn
> +    pval = res$p.value
> +    cat("Mean xvar =", x.mn, " Mean yvar =", y.mn)
> +    cat(" diff =", diff, "  p-value=", pval, "\n\n")
> +    list(x.mn=x.mn, y.mn=y.mn, diff=diff, pval=pval)
> + }
> 
> ## Results from by and tapply
> 
> > attach(sdat)
> >   bres = by(xvar, num, kindtest, yvar)  
> Mean xvar = 19.8904  Mean yvar = 21.97729 diff = 2.086891   p-value=
> 0.06222805 
> Mean xvar = 19.88329  Mean yvar = 21.97729 diff = 2.093996   p-value=
> 0.05245329 
> 
> >   tres = tapply(xvar, num, kindtest, yvar)
> Mean xvar = 19.8904  Mean yvar = 21.97729 diff = 2.086891   p-value=
> 0.06222805 
> Mean xvar = 19.88329  Mean yvar = 21.97729 diff = 2.093996   p-value=
> 0.05245329 
> 
> > detach(sdat,1)
> 
> ## Results from for
> 
> > for(i in 1:2) {
> +   subdat= subset(sdat, num==i)
> +   kindtest(subdat$xvar, subdat$yvar)
> + }
> Mean xvar = 19.8904  Mean yvar = 21.98615 diff = 2.095746   p-value=
> 0.07319223 
> Mean xvar = 19.88329  Mean yvar = 21.96843 diff = 2.085141   p-value=
> 0.05850057 
> 

The fact that the by/tapply approach is giving you the same Mean yvar
for both groups should be a dead giveaway....

Stick print(varx) and print(vary) into kindtest, and you'll see the
point. You are passing yvar *without* subsetting (and since the t.test
isn't paired, it can hardly be expected to complain that x and y
differ in length...).

This is probably closer to the mark:

  by(sdat, num, with, kindtest(xvar, yvar))

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907




More information about the R-help mailing list