[R] debugging R code and dealing with dependencies
Mike Miller
mbmiller+l at gmail.com
Thu Dec 25 21:45:04 CET 2014
Thanks, but I was already in touch with Rob Kirkpatrick about it. We all
work together at U Minnesota, or did until Rob went to VCU.
Mike
On Thu, 25 Dec 2014, Uwe Ligges wrote:
> This is a rather detailed analysis, thanks, but I think it should be
> send to the maintainer of the "RFGLS" package (CCing).
>
> Best,
> Uwe Ligges
>
>
> On 25.12.2014 10:04, Mike Miller wrote:
>> I just wanted to put this out there. It's just some of my observations
>> about things that happen with R, or happened in this particular
>> investigation. There were definitely some lessons for me in this, and
>> maybe that will be true of someone else. The main thing I picked up is
>> that it is good to put plenty of checks into our code -- if we expect
>> input of a certain type or class, then I should either coerce input into
>> that structure or test the input and throw an error. If the function
>> works very differently for different kinds of input, this should be
>> documented. The more people are doing this, the better things will go
>> for everyone.
>>
>>
>> I was working with a CRAN package called RFGLS...
>>
>> http://cran.r-project.org/web/packages/RFGLS/index.html
>>
>> ...and I was getting an error. After a few rounds of testing I realized
>> that the error was caused by a FAMID variable that was of character type.
>>
>> The problem seemed to be that gls.batch() expected FAMID to be integers,
>> but the default ought to be character type because family and individual
>> IDs in nearly all genetic-analysis software are character strings (they
>> might even be people's names).
>>
>> This was the error:
>>
>> Error in sum(blocksize) : invalid 'type' (character) of argument
>> Calls: gls.batch -> bdsmatrix
>>
>> To figure out more about it, I spent a bunch of time to go from CMD
>> BATCH mode to an interactive session so that I could look at
>> traceback(). That got me this additional info:
>>
>>> traceback()
>> 2: bdsmatrix(sizelist, lme.out$sigma at blocks, dimnames = list(id, id))
>>
>> bdsmatrix() is from a package on which RFGLS depends:
>>
>> http://cran.r-project.org/web/packages/bdsmatrix/index.html
>>
>> The problem is that RFGLS's gls.batch() function is sending something to
>> bdsmatrix's bdsmatrix() that it can't handle. So I look at the code for
>> bdsmatrix() and I see this:
>>
>> if (any(blocksize <= 0))
>> stop("Block sizes must be >0")
>> if (any(as.integer(blocksize) != blocksize))
>> stop("Block sizes must be integers")
>> n1 <- as.integer(sum(blocksize))
>>
>> The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE)
>> only if blocksize contains one or more noninteger numeric values. It
>> doesn't fail if blocksize is character or logical if the character
>> strings are integers. Example:
>>
>>> 4=="4"
>> [1] TRUE
>>
>> That's an interesting feature of R, but I guess that's how it works.
>> Also this:
>>
>>> 1=="1"
>> [1] TRUE
>>> 1==TRUE
>> [1] TRUE
>>> "1"==TRUE
>> [1] FALSE
>>
>> bdsmatrix() has no test that blocksize is numeric, so it fails when
>> sum(blocksize) cannot sum character strings.
>>
>> Next I had to figure out where RFGLS's gls.batch() is going wrong in
>> producing sizelist. It is created in a number of steps, but I
>> identified this line as especially suspicious:
>>
>> test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)
>>
>>
>> famsize was later converted to sizelist, and this line also includes
>> FAMID, so this is likely where the problem originates. Of course this
>> is the big problem with debugging -- it's hard to find the source of an
>> error that occurs far downstream in another function from a different
>> package. I see that ave() is used, so I have to understand ave().
>>
>> William Dunlap provided some guidance:
>>
>> "ave() uses its first argument, 'x', to set the length of its output and
>> to make an initial guess at the type of its output. The return value of
>> FUN can alter the type, but only in an 'upward' direction where
>> logical<integer<numeric<complex<character<list. (This is the same rule
>> that x[i]<-newvalue uses.)"
>>
>> In other words, if x is of character type, the output cannot be of
>> integer or numeric type even if the output of FUN is always of integer
>> or numeric type. Looking at the ave() code, I can understand that choice:
>>
>> function (x, ..., FUN = mean)
>> {
>> if (missing(...))
>> x[] <- FUN(x)
>> else {
>> g <- interaction(...)
>> split(x, g) <- lapply(split(x, g), FUN)
>> }
>> x
>> }
>>
>> If the factor is missing an element, then the corresponding element of X
>> is not changed in the output:
>>
>>> fact <- gl(2,2)
>>> fact[3] <- NA
>>> fact
>> [1] 1 1 <NA> 2
>> Levels: 1 2
>>> ave(1:4, fact)
>> [1] 1.5 1.5 3.0 4.0
>>
>> That's a reasonable plan, but it isn't the documented functioning of
>> ave(). From the document...
>>
>> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html
>>
>> ...you get next to nothing about what the function actually does. It
>> does say that x is "a numeric," but the function does not throw an error
>> when x is not numeric. So if someone writes code expecting numeric x,
>> but a user provides a non-numeric x, there may be trouble.
>>
>> I suspect that the programmer saw that the code worked in her examples
>> and she went on to other things. I can't blame the documentation for
>> that, but it is possible that if it said something about the relation
>> between the type of the input and the type of the output she might have
>> written it differently. In addition, I probably would have caught it
>> sooner and I would have understood the problem.
>>
>> This is how I'll recommend they fix the bug in the code (thanks to those
>> of you who helped with this):
>>
>> temp.vec <- as.character( test.dat$FAMID[ test.dat$FTYPE != 6 ] )
>> test.dat$famsize[ test.dat$FTYPE != 6 ] <- as.vector( table( temp.vec )[
>> temp.vec ] )
>> rm(temp.vec)
>>
>> I think we should force FAMID to be character from the beginning, though.
>>
>> Best,
>> Mike
>>
>> FYI -- RFGLS code that fails in RFGLS version 1.1:
>>
>> library(RFGLS)
>>
>> data(pheno)
>> data(geno)
>> data(map)
>> data(pedigree)
>> data(rescovmtx)
>> #comment out the following two lines and it will run correctly
>> pedigree$FAMID <- as.character(pedigree$FAMID)
>> pheno$FAMID <- as.character(pheno$FAMID)
>> minigwas <- gls.batch(phenfile=pheno,genfile=data.frame(t(geno)),
>> pedifile=pedigree,
>> snp.names=map[,2],input.mode=c(1,2,3),pediheader=FALSE,
>> pedicolname=c("FAMID","ID","PID","MID","SEX"),
>> phen="Zscore",covars="IsFemale",
>> outfile=NULL,col.names=TRUE,return.value=TRUE)
>> str(minigwas)
>>
>> Mike
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list