[R] debugging R code and dealing with dependencies

Mike Miller mbmiller+l at gmail.com
Thu Dec 25 21:45:04 CET 2014


Thanks, but I was already in touch with Rob Kirkpatrick about it.  We all 
work together at U Minnesota, or did until Rob went to VCU.

Mike


On Thu, 25 Dec 2014, Uwe Ligges wrote:

> This is a rather detailed analysis, thanks, but I think it should be 
> send to the maintainer of the "RFGLS" package (CCing).
>
> Best,
> Uwe Ligges
>
>
> On 25.12.2014 10:04, Mike Miller wrote:
>> I just wanted to put this out there.  It's just some of my observations
>> about things that happen with R, or happened in this particular
>> investigation.  There were definitely some lessons for me in this, and
>> maybe that will be true of someone else.  The main thing I picked up is
>> that it is good to put plenty of checks into our code -- if we expect
>> input of a certain type or class, then I should either coerce input into
>> that structure or test the input and throw an error.  If the function
>> works very differently for different kinds of input, this should be
>> documented.  The more people are doing this, the better things will go
>> for everyone.
>> 
>> 
>> I was working with a CRAN package called RFGLS...
>> 
>> http://cran.r-project.org/web/packages/RFGLS/index.html
>> 
>> ...and I was getting an error.  After a few rounds of testing I realized
>> that the error was caused by a FAMID variable that was of character type.
>> 
>> The problem seemed to be that gls.batch() expected FAMID to be integers,
>> but the default ought to be character type because family and individual
>> IDs in nearly all genetic-analysis software are character strings (they
>> might even be people's names).
>> 
>> This was the error:
>> 
>> Error in sum(blocksize) : invalid 'type' (character) of argument
>> Calls: gls.batch -> bdsmatrix
>> 
>> To figure out more about it, I spent a bunch of time to go from CMD
>> BATCH mode to an interactive session so that I could look at
>> traceback().  That got me this additional info:
>> 
>>> traceback()
>> 2: bdsmatrix(sizelist, lme.out$sigma at blocks, dimnames = list(id, id))
>> 
>> bdsmatrix() is from a package on which RFGLS depends:
>> 
>> http://cran.r-project.org/web/packages/bdsmatrix/index.html
>> 
>> The problem is that RFGLS's gls.batch() function is sending something to
>> bdsmatrix's bdsmatrix() that it can't handle.  So I look at the code for
>> bdsmatrix() and I see this:
>>
>>       if (any(blocksize <= 0))
>>           stop("Block sizes must be >0")
>>       if (any(as.integer(blocksize) != blocksize))
>>           stop("Block sizes must be integers")
>>       n1 <- as.integer(sum(blocksize))
>> 
>> The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE)
>> only if blocksize contains one or more noninteger numeric values.  It
>> doesn't fail if blocksize is character or logical if the character
>> strings are integers.  Example:
>> 
>>> 4=="4"
>> [1] TRUE
>> 
>> That's an interesting feature of R, but I guess that's how it works.
>> Also this:
>> 
>>> 1=="1"
>> [1] TRUE
>>> 1==TRUE
>> [1] TRUE
>>> "1"==TRUE
>> [1] FALSE
>> 
>> bdsmatrix() has no test that blocksize is numeric, so it fails when
>> sum(blocksize) cannot sum character strings.
>> 
>> Next I had to figure out where RFGLS's gls.batch() is going wrong in
>> producing sizelist.  It is created in a number of steps, but I
>> identified this line as especially suspicious:
>> 
>> test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)
>> 
>> 
>> famsize was later converted to sizelist, and this line also includes
>> FAMID, so this is likely where the problem originates.  Of course this
>> is the big problem with debugging -- it's hard to find the source of an
>> error that occurs far downstream in another function from a different
>> package. I see that ave() is used, so I have to understand ave().
>> 
>> William Dunlap provided some guidance:
>> 
>> "ave() uses its first argument, 'x', to set the length of its output and
>> to make an initial guess at the type of its output.  The return value of
>> FUN can alter the type, but only in an 'upward' direction where
>> logical<integer<numeric<complex<character<list.  (This is the same rule
>> that x[i]<-newvalue uses.)"
>> 
>> In other words, if x is of character type, the output cannot be of
>> integer or numeric type even if the output of FUN is always of integer
>> or numeric type.  Looking at the ave() code, I can understand that choice:
>> 
>> function (x, ..., FUN = mean)
>> {
>>      if (missing(...))
>>          x[] <- FUN(x)
>>      else {
>>          g <- interaction(...)
>>          split(x, g) <- lapply(split(x, g), FUN)
>>      }
>>      x
>> }
>> 
>> If the factor is missing an element, then the corresponding element of X
>> is not changed in the output:
>> 
>>> fact <- gl(2,2)
>>> fact[3] <- NA
>>> fact
>> [1] 1    1    <NA> 2
>> Levels: 1 2
>>> ave(1:4, fact)
>> [1] 1.5 1.5 3.0 4.0
>> 
>> That's a reasonable plan, but it isn't the documented functioning of
>> ave().  From the document...
>> 
>> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html
>> 
>> ...you get next to nothing about what the function actually does.  It
>> does say that x is "a numeric," but the function does not throw an error
>> when x is not numeric.  So if someone writes code expecting numeric x,
>> but a user provides a non-numeric x, there may be trouble.
>> 
>> I suspect that the programmer saw that the code worked in her examples
>> and she went on to other things.  I can't blame the documentation for
>> that, but it is possible that if it said something about the relation
>> between the type of the input and the type of the output she might have
>> written it differently.  In addition, I probably would have caught it
>> sooner and I would have understood the problem.
>> 
>> This is how I'll recommend they fix the bug in the code (thanks to those
>> of you who helped with this):
>> 
>> temp.vec <- as.character( test.dat$FAMID[ test.dat$FTYPE != 6 ] )
>> test.dat$famsize[ test.dat$FTYPE != 6 ] <- as.vector( table( temp.vec )[
>> temp.vec ] )
>> rm(temp.vec)
>> 
>> I think we should force FAMID to be character from the beginning, though.
>> 
>> Best,
>> Mike
>> 
>> FYI -- RFGLS code that fails in RFGLS version 1.1:
>> 
>> library(RFGLS)
>> 
>> data(pheno)
>> data(geno)
>> data(map)
>> data(pedigree)
>> data(rescovmtx)
>>   #comment out the following two lines and it will run correctly
>> pedigree$FAMID <- as.character(pedigree$FAMID)
>> pheno$FAMID <- as.character(pheno$FAMID)
>> minigwas <- gls.batch(phenfile=pheno,genfile=data.frame(t(geno)),
>>    pedifile=pedigree,
>>    snp.names=map[,2],input.mode=c(1,2,3),pediheader=FALSE,
>>    pedicolname=c("FAMID","ID","PID","MID","SEX"),
>>    phen="Zscore",covars="IsFemale",
>>    outfile=NULL,col.names=TRUE,return.value=TRUE)
>> str(minigwas)
>> 
>> Mike
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list