[R] debugging R code and dealing with dependencies
Uwe Ligges
ligges at statistik.tu-dortmund.de
Thu Dec 25 21:31:21 CET 2014
This is a rather detailed analysis, thanks, but I think it should be
send to the maintainer of the "RFGLS" package (CCing).
Best,
Uwe Ligges
On 25.12.2014 10:04, Mike Miller wrote:
> I just wanted to put this out there. It's just some of my observations
> about things that happen with R, or happened in this particular
> investigation. There were definitely some lessons for me in this, and
> maybe that will be true of someone else. The main thing I picked up is
> that it is good to put plenty of checks into our code -- if we expect
> input of a certain type or class, then I should either coerce input into
> that structure or test the input and throw an error. If the function
> works very differently for different kinds of input, this should be
> documented. The more people are doing this, the better things will go
> for everyone.
>
>
> I was working with a CRAN package called RFGLS...
>
> http://cran.r-project.org/web/packages/RFGLS/index.html
>
> ...and I was getting an error. After a few rounds of testing I realized
> that the error was caused by a FAMID variable that was of character type.
>
> The problem seemed to be that gls.batch() expected FAMID to be integers,
> but the default ought to be character type because family and individual
> IDs in nearly all genetic-analysis software are character strings (they
> might even be people's names).
>
> This was the error:
>
> Error in sum(blocksize) : invalid 'type' (character) of argument
> Calls: gls.batch -> bdsmatrix
>
> To figure out more about it, I spent a bunch of time to go from CMD
> BATCH mode to an interactive session so that I could look at
> traceback(). That got me this additional info:
>
>> traceback()
> 2: bdsmatrix(sizelist, lme.out$sigma at blocks, dimnames = list(id, id))
>
> bdsmatrix() is from a package on which RFGLS depends:
>
> http://cran.r-project.org/web/packages/bdsmatrix/index.html
>
> The problem is that RFGLS's gls.batch() function is sending something to
> bdsmatrix's bdsmatrix() that it can't handle. So I look at the code for
> bdsmatrix() and I see this:
>
> if (any(blocksize <= 0))
> stop("Block sizes must be >0")
> if (any(as.integer(blocksize) != blocksize))
> stop("Block sizes must be integers")
> n1 <- as.integer(sum(blocksize))
>
> The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE)
> only if blocksize contains one or more noninteger numeric values. It
> doesn't fail if blocksize is character or logical if the character
> strings are integers. Example:
>
>> 4=="4"
> [1] TRUE
>
> That's an interesting feature of R, but I guess that's how it works.
> Also this:
>
>> 1=="1"
> [1] TRUE
>> 1==TRUE
> [1] TRUE
>> "1"==TRUE
> [1] FALSE
>
> bdsmatrix() has no test that blocksize is numeric, so it fails when
> sum(blocksize) cannot sum character strings.
>
> Next I had to figure out where RFGLS's gls.batch() is going wrong in
> producing sizelist. It is created in a number of steps, but I
> identified this line as especially suspicious:
>
> test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)
>
>
> famsize was later converted to sizelist, and this line also includes
> FAMID, so this is likely where the problem originates. Of course this
> is the big problem with debugging -- it's hard to find the source of an
> error that occurs far downstream in another function from a different
> package. I see that ave() is used, so I have to understand ave().
>
> William Dunlap provided some guidance:
>
> "ave() uses its first argument, 'x', to set the length of its output and
> to make an initial guess at the type of its output. The return value of
> FUN can alter the type, but only in an 'upward' direction where
> logical<integer<numeric<complex<character<list. (This is the same rule
> that x[i]<-newvalue uses.)"
>
> In other words, if x is of character type, the output cannot be of
> integer or numeric type even if the output of FUN is always of integer
> or numeric type. Looking at the ave() code, I can understand that choice:
>
> function (x, ..., FUN = mean)
> {
> if (missing(...))
> x[] <- FUN(x)
> else {
> g <- interaction(...)
> split(x, g) <- lapply(split(x, g), FUN)
> }
> x
> }
>
> If the factor is missing an element, then the corresponding element of X
> is not changed in the output:
>
>> fact <- gl(2,2)
>> fact[3] <- NA
>> fact
> [1] 1 1 <NA> 2
> Levels: 1 2
>> ave(1:4, fact)
> [1] 1.5 1.5 3.0 4.0
>
> That's a reasonable plan, but it isn't the documented functioning of
> ave(). From the document...
>
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html
>
> ...you get next to nothing about what the function actually does. It
> does say that x is "a numeric," but the function does not throw an error
> when x is not numeric. So if someone writes code expecting numeric x,
> but a user provides a non-numeric x, there may be trouble.
>
> I suspect that the programmer saw that the code worked in her examples
> and she went on to other things. I can't blame the documentation for
> that, but it is possible that if it said something about the relation
> between the type of the input and the type of the output she might have
> written it differently. In addition, I probably would have caught it
> sooner and I would have understood the problem.
>
> This is how I'll recommend they fix the bug in the code (thanks to those
> of you who helped with this):
>
> temp.vec <- as.character( test.dat$FAMID[ test.dat$FTYPE != 6 ] )
> test.dat$famsize[ test.dat$FTYPE != 6 ] <- as.vector( table( temp.vec )[
> temp.vec ] )
> rm(temp.vec)
>
> I think we should force FAMID to be character from the beginning, though.
>
> Best,
> Mike
>
> FYI -- RFGLS code that fails in RFGLS version 1.1:
>
> library(RFGLS)
>
> data(pheno)
> data(geno)
> data(map)
> data(pedigree)
> data(rescovmtx)
> #comment out the following two lines and it will run correctly
> pedigree$FAMID <- as.character(pedigree$FAMID)
> pheno$FAMID <- as.character(pheno$FAMID)
> minigwas <- gls.batch(phenfile=pheno,genfile=data.frame(t(geno)),
> pedifile=pedigree,
> snp.names=map[,2],input.mode=c(1,2,3),pediheader=FALSE,
> pedicolname=c("FAMID","ID","PID","MID","SEX"),
> phen="Zscore",covars="IsFemale",
> outfile=NULL,col.names=TRUE,return.value=TRUE)
> str(minigwas)
>
> Mike
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list