[BioC] -which() pitfall in subsetting exressions

Sun May 26 15:59:45 CEST 2013

[bioc-devel added and subject header changed]

Dear List & Developers
Mani's post reminded me of the pitfall posed by using expressions such as

    x = x[ -which(someExpression) ]

which tends to have unintended consequences in particular if all elements of 'someExpression' are FALSE. Compare:

x=1:10
> x[ -which(x==17) ]
integer(0)
> x[ x!=17 ]
 [1]  1  2  3  4  5  6  7  8  9 10

A quick search in Bioconductor source code found this idiom in multiple packages, including:

IRanges Biostrings beadarray BeadDataPackR BioNet affycoretools annotate copynumber CRImage DiffBind flowStats geNetClassifier GGtools maigesPack maskBAD MineICA MmPalateMiRNA Mulcom nem phyloseq plgem RBGL girafe Ringo RIPSeeker rnaSeqMap SomatiCA Starr VanillaICE

I cannot tell whether this poses an actual bug threat in any of these cases, but developers might want to double-check.

	Best wishes
	Wolfgang

On May 24, 2013, at 5:08 pm, "Narayanan, Manikandan (NIH/NIAID) [E]" <manikandan.narayanan at nih.gov> wrote:

> Great, thank you Alejandro - I will check out the code and let you know how it goes. 
> 
> In the mean time, I also found a small typo in the DEXSeqHTML function. The line 
>  results[which(is.na(results$pvalue)), ][, c("pvalue", "padjust")] = 1
> breaks with an error if there are no NA pvalues, and can be patched by changing to:
>    results[which(is.na(results$pvalue)),c("pvalue", "padjust")] = 1 
> 
> While you are at it, it would also be useful if extraCols is indexed by geneIDs as rownames - that way we can send in extra info for as many 
> genes in any order and it would be properly added to genetable for display in the DEXSeqHTML report. In detail, if you could change:
>  genetable <- cbind(extraCols, genetable)
> to 
>  genetable <- cbind(extraCols[match(genetable$geneID, rownames(extraCols)),], genetable]) 
> or something similar and change the documentation appropriatley, that would be quite helpful to users too!
> 
> 
> Thanks, 
> Mani
> 
> 
> 
> ________________________________________
> From: Alejandro Reyes [alejandro.reyes at embl.de]
> Sent: Friday, May 24, 2013 8:19 AM
> To: Narayanan, Manikandan (NIH/NIAID) [E]
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] DEXSeq on two-exon genes: how to specify a formula without redundant terms
> 
> Dear Manikandan Narayanan,
> 
> Sorry for the delayed reply!
> 
> Thanks for your e-mail! It has led us to spot two minor bugs and
> therefore modified the DEXSeq code in two aspects:
> 
> Firstly, redundant terms were removed in DEXSeq in a very naive way:  a
> lm.fit model was first fitted and the columns from the model matrix with
> NA coefficients were removed from the model matrix, now a proper
> function has been written for this purpose (with basically the code to
> what you also proposed). Secondly, our "if" statement that you mentioned
> was wrongly located within the function, which was causing your error
> message. It has been relocated and now it seems to work for its purpose
> ( I tested with your model frame and formula).
> 
> You will find the changes in the last version of the svn (1.7.2). Let me
> know if it works for you or if you have additional questions.
> 
> Best regards,
> Alejandro Reyes
> 
> 
>> Hi,
>>   Would someone be kind enough to clarify the two questions raised about DEXSeq in my earlier email below?
>> 
>>   Would attaching any further information help? I am working with DEXSeq 1.6.0, and here are the model frame and model matrices if it makes it easier. Thanks!
>> 
>>> mf
>>         sample exon sizeFactor condition   batch dispersion count
>> 1 Untr_biorep1 E001  1.0517803      Untr biorep1 0.13523549    25
>> 2 Untr_biorep1 E002  1.0517803      Untr biorep1 0.01656906   708
>> 3  LPS_biorep1 E001  1.0010474       LPS biorep1 0.13523549    20
>> 4  LPS_biorep1 E002  1.0010474       LPS biorep1 0.01656906   442
>> 5 Untr_biorep2 E001  0.8920483      Untr biorep2 0.13523549    26
>> 6 Untr_biorep2 E002  0.8920483      Untr biorep2 0.01656906   947
>> 7  LPS_biorep2 E001  1.1090401       LPS biorep2 0.13523549    25
>> 8  LPS_biorep2 E002  1.1090401       LPS biorep2 0.01656906   884
>> 
>>> mm
>>   (Intercept) sampleLPS_biorep2 sampleUntr_biorep1 sampleUntr_biorep2
>> 1           1                 0                  1                  0
>> 2           1                 0                  1                  0
>> 3           1                 0                  0                  0
>> 4           1                 0                  0                  0
>> 5           1                 0                  0                  1
>> 6           1                 0                  0                  1
>> 7           1                 1                  0                  0
>> 8           1                 1                  0                  0
>>   conditionUntr batchbiorep2 exonE002 conditionUntr:exonE002
>> 1             1            0        0                      0
>> 2             1            0        1                      1
>> 3             0            0        0                      0
>> 4             0            0        1                      0
>> 5             1            1        0                      0
>> 6             1            1        1                      1
>> 7             0            1        0                      0
>> 8             0            1        1                      0
>>   batchbiorep2:exonE002
>> 1                     0
>> 2                     0
>> 3                     0
>> 4                     0
>> 5                     0
>> 6                     1
>> 7                     0
>> 8                     1
>> attr(,"assign")
>> [1] 0 1 1 1 2 3 4 5 6
>> attr(,"contrasts")
>> attr(,"contrasts")$sample
>> [1] "contr.treatment"
>> 
>> attr(,"contrasts")$condition
>> [1] "contr.treatment"
>> 
>> attr(,"contrasts")$batch
>> [1] "contr.treatment"
>> 
>> attr(,"contrasts")$exon
>> [1] "contr.treatment"
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: Narayanan, Manikandan (NIH/NIAID) [E]
>> Sent: Thursday, May 16, 2013 11:14 AM
>> To: bioconductor at r-project.org
>> Subject: DEXSeq on two-exon genes: how to specify a formula without redundant terms
>> 
>> Hi DEXSeq users/developers,
>>   I have used DEXSeq successfuly for genes with many exons and really like the diagnostic/visualization plots that come with it. Recently though, for genes with two testable exons, I am getting the "Underdetermined model; cannot estimate dispersions." error.
>> 
>>   I figure this is due to redundant terms in my formula as shown in PS below. So my questions are:
>> 
>> 1) Is there a way to specify the formula  count ~ sample + (condition + batch) * exon so that redundant terms 'condition + batch' are removed?
>> 
>> 2) If not, is it safe to change ncol(mm) to qr(mm)$rank (i.e., rank of model matrix to remove redundant terms) in this piece of code in estimateExonDispersionsForModelFrame:
>>     if (nrow(mm) <= ncol(mm))
>>         stop("Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.")
>> 
>> Would changing the code this way violate any assumptions of the DEXSeq model?
>> 
>> 
>> Thank you,
>> Mani
>> 
>> 
>> PS: # condition + batch terms are redundant as sample term is already present!
>>> formulaDispersion
>> count ~ sample + (condition + batch) * exon
>> 
>>> design(ecs)
>>                  condition       batch
>> Untr_biorep1      Untr     biorep1
>> LPS_biorep1        LPS     biorep1
>> Untr_biorep2      Untr     biorep2
>> LPS_biorep2        LPS     biorep2
>> 
>>> colnames(model.matrix(formulaDispersion, mf))
>> [1] "(Intercept)"            "sampleLPS_biorep2"      "sampleUntr_biorep1"
>> [4] "sampleUntr_biorep2"     "conditionUntr"          "batchbiorep2"
>> [7] "exonE002"               "conditionUntr:exonE002" "batchbiorep2:exonE002"
>> 
>>      [[alternative HTML version deleted]]
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor