[R] debugging R code and dealing with dependencies

Uwe Ligges ligges at statistik.tu-dortmund.de
Thu Dec 25 21:31:21 CET 2014


This is a rather detailed analysis, thanks, but I think it should be 
send to the maintainer of the "RFGLS" package (CCing).

Best,
Uwe Ligges


On 25.12.2014 10:04, Mike Miller wrote:
> I just wanted to put this out there.  It's just some of my observations
> about things that happen with R, or happened in this particular
> investigation.  There were definitely some lessons for me in this, and
> maybe that will be true of someone else.  The main thing I picked up is
> that it is good to put plenty of checks into our code -- if we expect
> input of a certain type or class, then I should either coerce input into
> that structure or test the input and throw an error.  If the function
> works very differently for different kinds of input, this should be
> documented.  The more people are doing this, the better things will go
> for everyone.
>
>
> I was working with a CRAN package called RFGLS...
>
> http://cran.r-project.org/web/packages/RFGLS/index.html
>
> ...and I was getting an error.  After a few rounds of testing I realized
> that the error was caused by a FAMID variable that was of character type.
>
> The problem seemed to be that gls.batch() expected FAMID to be integers,
> but the default ought to be character type because family and individual
> IDs in nearly all genetic-analysis software are character strings (they
> might even be people's names).
>
> This was the error:
>
> Error in sum(blocksize) : invalid 'type' (character) of argument
> Calls: gls.batch -> bdsmatrix
>
> To figure out more about it, I spent a bunch of time to go from CMD
> BATCH mode to an interactive session so that I could look at
> traceback().  That got me this additional info:
>
>> traceback()
> 2: bdsmatrix(sizelist, lme.out$sigma at blocks, dimnames = list(id, id))
>
> bdsmatrix() is from a package on which RFGLS depends:
>
> http://cran.r-project.org/web/packages/bdsmatrix/index.html
>
> The problem is that RFGLS's gls.batch() function is sending something to
> bdsmatrix's bdsmatrix() that it can't handle.  So I look at the code for
> bdsmatrix() and I see this:
>
>       if (any(blocksize <= 0))
>           stop("Block sizes must be >0")
>       if (any(as.integer(blocksize) != blocksize))
>           stop("Block sizes must be integers")
>       n1 <- as.integer(sum(blocksize))
>
> The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE)
> only if blocksize contains one or more noninteger numeric values.  It
> doesn't fail if blocksize is character or logical if the character
> strings are integers.  Example:
>
>> 4=="4"
> [1] TRUE
>
> That's an interesting feature of R, but I guess that's how it works.
> Also this:
>
>> 1=="1"
> [1] TRUE
>> 1==TRUE
> [1] TRUE
>> "1"==TRUE
> [1] FALSE
>
> bdsmatrix() has no test that blocksize is numeric, so it fails when
> sum(blocksize) cannot sum character strings.
>
> Next I had to figure out where RFGLS's gls.batch() is going wrong in
> producing sizelist.  It is created in a number of steps, but I
> identified this line as especially suspicious:
>
> test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)
>
>
> famsize was later converted to sizelist, and this line also includes
> FAMID, so this is likely where the problem originates.  Of course this
> is the big problem with debugging -- it's hard to find the source of an
> error that occurs far downstream in another function from a different
> package. I see that ave() is used, so I have to understand ave().
>
> William Dunlap provided some guidance:
>
> "ave() uses its first argument, 'x', to set the length of its output and
> to make an initial guess at the type of its output.  The return value of
> FUN can alter the type, but only in an 'upward' direction where
> logical<integer<numeric<complex<character<list.  (This is the same rule
> that x[i]<-newvalue uses.)"
>
> In other words, if x is of character type, the output cannot be of
> integer or numeric type even if the output of FUN is always of integer
> or numeric type.  Looking at the ave() code, I can understand that choice:
>
> function (x, ..., FUN = mean)
> {
>      if (missing(...))
>          x[] <- FUN(x)
>      else {
>          g <- interaction(...)
>          split(x, g) <- lapply(split(x, g), FUN)
>      }
>      x
> }
>
> If the factor is missing an element, then the corresponding element of X
> is not changed in the output:
>
>> fact <- gl(2,2)
>> fact[3] <- NA
>> fact
> [1] 1    1    <NA> 2
> Levels: 1 2
>> ave(1:4, fact)
> [1] 1.5 1.5 3.0 4.0
>
> That's a reasonable plan, but it isn't the documented functioning of
> ave().  From the document...
>
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html
>
> ...you get next to nothing about what the function actually does.  It
> does say that x is "a numeric," but the function does not throw an error
> when x is not numeric.  So if someone writes code expecting numeric x,
> but a user provides a non-numeric x, there may be trouble.
>
> I suspect that the programmer saw that the code worked in her examples
> and she went on to other things.  I can't blame the documentation for
> that, but it is possible that if it said something about the relation
> between the type of the input and the type of the output she might have
> written it differently.  In addition, I probably would have caught it
> sooner and I would have understood the problem.
>
> This is how I'll recommend they fix the bug in the code (thanks to those
> of you who helped with this):
>
> temp.vec <- as.character( test.dat$FAMID[ test.dat$FTYPE != 6 ] )
> test.dat$famsize[ test.dat$FTYPE != 6 ] <- as.vector( table( temp.vec )[
> temp.vec ] )
> rm(temp.vec)
>
> I think we should force FAMID to be character from the beginning, though.
>
> Best,
> Mike
>
> FYI -- RFGLS code that fails in RFGLS version 1.1:
>
> library(RFGLS)
>
> data(pheno)
> data(geno)
> data(map)
> data(pedigree)
> data(rescovmtx)
>   #comment out the following two lines and it will run correctly
> pedigree$FAMID <- as.character(pedigree$FAMID)
> pheno$FAMID <- as.character(pheno$FAMID)
> minigwas <- gls.batch(phenfile=pheno,genfile=data.frame(t(geno)),
>    pedifile=pedigree,
>    snp.names=map[,2],input.mode=c(1,2,3),pediheader=FALSE,
>    pedicolname=c("FAMID","ID","PID","MID","SEX"),
>    phen="Zscore",covars="IsFemale",
>    outfile=NULL,col.names=TRUE,return.value=TRUE)
> str(minigwas)
>
> Mike
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list