[BioC] -which() pitfall in subsetting exressions
Wolfgang Huber
whuber at embl.de
Sun May 26 15:59:45 CEST 2013
[bioc-devel added and subject header changed]
Dear List & Developers
Mani's post reminded me of the pitfall posed by using expressions such as
x = x[ -which(someExpression) ]
which tends to have unintended consequences in particular if all elements of 'someExpression' are FALSE. Compare:
x=1:10
> x[ -which(x==17) ]
integer(0)
> x[ x!=17 ]
[1] 1 2 3 4 5 6 7 8 9 10
A quick search in Bioconductor source code found this idiom in multiple packages, including:
IRanges Biostrings beadarray BeadDataPackR BioNet affycoretools annotate copynumber CRImage DiffBind flowStats geNetClassifier GGtools maigesPack maskBAD MineICA MmPalateMiRNA Mulcom nem phyloseq plgem RBGL girafe Ringo RIPSeeker rnaSeqMap SomatiCA Starr VanillaICE
I cannot tell whether this poses an actual bug threat in any of these cases, but developers might want to double-check.
Best wishes
Wolfgang
On May 24, 2013, at 5:08 pm, "Narayanan, Manikandan (NIH/NIAID) [E]" <manikandan.narayanan at nih.gov> wrote:
> Great, thank you Alejandro - I will check out the code and let you know how it goes.
>
> In the mean time, I also found a small typo in the DEXSeqHTML function. The line
> results[which(is.na(results$pvalue)), ][, c("pvalue", "padjust")] = 1
> breaks with an error if there are no NA pvalues, and can be patched by changing to:
> results[which(is.na(results$pvalue)),c("pvalue", "padjust")] = 1
>
> While you are at it, it would also be useful if extraCols is indexed by geneIDs as rownames - that way we can send in extra info for as many
> genes in any order and it would be properly added to genetable for display in the DEXSeqHTML report. In detail, if you could change:
> genetable <- cbind(extraCols, genetable)
> to
> genetable <- cbind(extraCols[match(genetable$geneID, rownames(extraCols)),], genetable])
> or something similar and change the documentation appropriatley, that would be quite helpful to users too!
>
>
> Thanks,
> Mani
>
>
>
> ________________________________________
> From: Alejandro Reyes [alejandro.reyes at embl.de]
> Sent: Friday, May 24, 2013 8:19 AM
> To: Narayanan, Manikandan (NIH/NIAID) [E]
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] DEXSeq on two-exon genes: how to specify a formula without redundant terms
>
> Dear Manikandan Narayanan,
>
> Sorry for the delayed reply!
>
> Thanks for your e-mail! It has led us to spot two minor bugs and
> therefore modified the DEXSeq code in two aspects:
>
> Firstly, redundant terms were removed in DEXSeq in a very naive way: a
> lm.fit model was first fitted and the columns from the model matrix with
> NA coefficients were removed from the model matrix, now a proper
> function has been written for this purpose (with basically the code to
> what you also proposed). Secondly, our "if" statement that you mentioned
> was wrongly located within the function, which was causing your error
> message. It has been relocated and now it seems to work for its purpose
> ( I tested with your model frame and formula).
>
> You will find the changes in the last version of the svn (1.7.2). Let me
> know if it works for you or if you have additional questions.
>
> Best regards,
> Alejandro Reyes
>
>
>> Hi,
>> Would someone be kind enough to clarify the two questions raised about DEXSeq in my earlier email below?
>>
>> Would attaching any further information help? I am working with DEXSeq 1.6.0, and here are the model frame and model matrices if it makes it easier. Thanks!
>>
>>> mf
>> sample exon sizeFactor condition batch dispersion count
>> 1 Untr_biorep1 E001 1.0517803 Untr biorep1 0.13523549 25
>> 2 Untr_biorep1 E002 1.0517803 Untr biorep1 0.01656906 708
>> 3 LPS_biorep1 E001 1.0010474 LPS biorep1 0.13523549 20
>> 4 LPS_biorep1 E002 1.0010474 LPS biorep1 0.01656906 442
>> 5 Untr_biorep2 E001 0.8920483 Untr biorep2 0.13523549 26
>> 6 Untr_biorep2 E002 0.8920483 Untr biorep2 0.01656906 947
>> 7 LPS_biorep2 E001 1.1090401 LPS biorep2 0.13523549 25
>> 8 LPS_biorep2 E002 1.1090401 LPS biorep2 0.01656906 884
>>
>>> mm
>> (Intercept) sampleLPS_biorep2 sampleUntr_biorep1 sampleUntr_biorep2
>> 1 1 0 1 0
>> 2 1 0 1 0
>> 3 1 0 0 0
>> 4 1 0 0 0
>> 5 1 0 0 1
>> 6 1 0 0 1
>> 7 1 1 0 0
>> 8 1 1 0 0
>> conditionUntr batchbiorep2 exonE002 conditionUntr:exonE002
>> 1 1 0 0 0
>> 2 1 0 1 1
>> 3 0 0 0 0
>> 4 0 0 1 0
>> 5 1 1 0 0
>> 6 1 1 1 1
>> 7 0 1 0 0
>> 8 0 1 1 0
>> batchbiorep2:exonE002
>> 1 0
>> 2 0
>> 3 0
>> 4 0
>> 5 0
>> 6 1
>> 7 0
>> 8 1
>> attr(,"assign")
>> [1] 0 1 1 1 2 3 4 5 6
>> attr(,"contrasts")
>> attr(,"contrasts")$sample
>> [1] "contr.treatment"
>>
>> attr(,"contrasts")$condition
>> [1] "contr.treatment"
>>
>> attr(,"contrasts")$batch
>> [1] "contr.treatment"
>>
>> attr(,"contrasts")$exon
>> [1] "contr.treatment"
>>
>>
>>
>>
>>
>>
>>
>> From: Narayanan, Manikandan (NIH/NIAID) [E]
>> Sent: Thursday, May 16, 2013 11:14 AM
>> To: bioconductor at r-project.org
>> Subject: DEXSeq on two-exon genes: how to specify a formula without redundant terms
>>
>> Hi DEXSeq users/developers,
>> I have used DEXSeq successfuly for genes with many exons and really like the diagnostic/visualization plots that come with it. Recently though, for genes with two testable exons, I am getting the "Underdetermined model; cannot estimate dispersions." error.
>>
>> I figure this is due to redundant terms in my formula as shown in PS below. So my questions are:
>>
>> 1) Is there a way to specify the formula count ~ sample + (condition + batch) * exon so that redundant terms 'condition + batch' are removed?
>>
>> 2) If not, is it safe to change ncol(mm) to qr(mm)$rank (i.e., rank of model matrix to remove redundant terms) in this piece of code in estimateExonDispersionsForModelFrame:
>> if (nrow(mm) <= ncol(mm))
>> stop("Underdetermined model; cannot estimate dispersions. Maybe replicates have not been properly specified.")
>>
>> Would changing the code this way violate any assumptions of the DEXSeq model?
>>
>>
>> Thank you,
>> Mani
>>
>>
>> PS: # condition + batch terms are redundant as sample term is already present!
>>> formulaDispersion
>> count ~ sample + (condition + batch) * exon
>>
>>> design(ecs)
>> condition batch
>> Untr_biorep1 Untr biorep1
>> LPS_biorep1 LPS biorep1
>> Untr_biorep2 Untr biorep2
>> LPS_biorep2 LPS biorep2
>>
>>> colnames(model.matrix(formulaDispersion, mf))
>> [1] "(Intercept)" "sampleLPS_biorep2" "sampleUntr_biorep1"
>> [4] "sampleUntr_biorep2" "conditionUntr" "batchbiorep2"
>> [7] "exonE002" "conditionUntr:exonE002" "batchbiorep2:exonE002"
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list