[Bioc-devel] little tiny bug in CDSID annotations from predictCoding()
Valerie Obenchain
vobencha at fredhutch.org
Tue Mar 17 19:59:41 CET 2015
The second mapping in .localCoordinates() in AllUtils.R was using a
listed 'to' instead of unlisted 'to'. This means the mapping was at the
level of the outer list elements vs the inner. We want to map/overlap
with an unlisted object because each range can have a different CDSID.
This second mapping provided the index for retrieving the CDSIDs which
is why you saw a difference between the two.
This was a bug for predictCoding only.
Valerie
On 03/17/2015 11:39 AM, Robert Castelo wrote:
> Thanks! Do you have an explanation for the apparent disagreement in
> CDSID annotations that i described below the bug, between
> predictCoding() and locateVariants()?
>
> robert.
>
>
> -------- Original message --------
> From: Valerie Obenchain
> Date:17/03/2015 19:18 (GMT+01:00)
> To: bioc-devel at r-project.org
> Subject: Re: [Bioc-devel] little tiny bug in CDSID annotations from
> predictCoding()
>
> Hi Robert,
>
> Thanks for reporting the typo and bug. Now fixed in 1.13.41.
>
> Valerie
>
> On 03/17/2015 10:58 AM, Robert Castelo wrote:
> > in my message below, the line that it says:
> >
> > head(loc_all$CDSID)
> >
> > it should say
> >
> > head(coding2$CDSID)
> >
> > cheers,
> >
> > robert.
> >
> > ==========================================
> > hi,
> >
> > there is a little tiny bug in the current devel version of
> > VariantAnnotation::predictCoding(), and more concretely within
> > VariantAnnotation:::.localCoordinates(), that precludes the correct
> > annotation of the CDSID column:
> >
> > library(VariantAnnotation)
> > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> > library(BSgenome.Hsapiens.UCSC.hg19)
> >
> > vcf <- readVcf(system.file("extdata", "CEUtrio.vcf.bgz",
> > package="VariantFiltering"), genome="hg19")
> > seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
> > vcf <- dropSeqlevels(vcf, "chrM")
> > coding1 <- predictCoding(vcf, txdb, Hsapiens)
> > head(coding1$CDSID)
> > IntegerList of length 6
> > [[1]] integer(0)
> > [[2]] integer(0)
> > [[3]] integer(0)
> > [[4]] integer(0)
> > [[5]] integer(0)
> > [[6]] integer(0)
> > table(elementLengths(coding1$CDSID))
> >
> > 0
> > 6038
> >
> > my sessionInfo() is at the end of the message.
> >
> > here is the patch, just replacing 'map2$trancriptHits' by
> > 'map2$transcriptsHits':
> >
> > --- R/AllUtilities.R (revision 100756)
> > +++ R/AllUtilities.R (working copy)
> > @@ -284,7 +284,7 @@
> > cdsid <- IntegerList(integer(0))
> > map2 <- mapToTranscripts(unname(from)[xHits], to,
> > ignore.strand=ignore.strand)
> > - cds <- mcols(unlist(to,
> use.names=FALSE))$cds_id[map2$trancriptsHits]
> > + cds <- mcols(unlist(to,
> use.names=FALSE))$cds_id[map2$transcriptsHits]
> > if (length(cds)) {
> > cdslst <- unique(splitAsList(cds, map2$xHits))
> > cdsid <- cdslst
> >
> > with this fix then things seem to work again:
> >
> > coding1 <- predictCoding(vcf, txdb, Hsapiens)
> >> head(coding1$CDSID)
> > IntegerList of length 6
> > [["1"]] 21771
> > [["2"]] 21771
> > [["3"]] 21771
> > [["4"]] 21771
> > [["5"]] 21428
> > [["6"]] 21428
> > table(elementLengths(coding1$CDSID))
> >
> > 1 2 3 4 5 6 7 8 9 10 12 13 14
> 16 19
> > 873 1229 1024 993 615 524 324 168 82 21 12 15 42
> 76 40
> >
> > while investigating this bug i used VariantAnnotation::locateVariants()
> > which also annotates the CDSID column, and it seemed to be working.
> > however, i noticed that both, predictCoding() and locateVariants(), do
> > not give an identical annotation for the CDSID column in coding variants:
> >
> > coding2 <- locateVariants(vcf, txdb, CodingVariants())
> > head(loc_all$CDSID)
> > IntegerList of length 6
> > [["1"]] 210777
> > [["2"]] 210777
> > [["3"]] 210777
> > [["4"]] 210778
> > [["5"]] 208140
> > [["6"]] 208141
> > table(elementLengths(coding2$CDSID))
> >
> > 1 2 3 4
> > 4987 901 138 12
> >
> > in principle, it seems that both are annotating valid CDSID keys:
> >
> > allcdsinfo <- select(txdb, keys=keys(txdb, keytype="CDSID"),
> > columns="CDSID", keytype="CDSID")
> > sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in%
> > allcdsinfo$CDSID)
> > [1] 0
> > sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in%
> > allcdsinfo$CDSID)
> > [1] 0
> >
> > but predictCoding() annotates CDSID values that are not present in
> > locateVariants() annotations and viceversa:
> >
> > sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in%
> > as.character(unlist(coding2$CDSID, use.names=FALSE)))
> > [1] 24057
> > sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in%
> > as.character(unlist(coding1$CDSID, use.names=FALSE)))
> > [1] 7251
> >
> > length(unique(intersect(as.character(unlist(coding2$CDSID,
> > use.names=FALSE)), as.character(unlist(coding1$CDSID,
> use.names=FALSE)))))
> > [1] 0
> >
> > should not both annotate the same CDSID values on coding variants?
> >
> >
> > thanks!
> > robert.
> > ps: sessionInfo()
> > R Under development (unstable) (2014-10-14 r66765)
> > Platform: x86_64-unknown-linux-gnu (64-bit)
> >
> > locale:
> > [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
> > [3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
> > [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
> > [7] LC_PAPER=en_US.UTF8 LC_NAME=C
> > [9] LC_ADDRESS=C LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats4 parallel stats graphics grDevices
> > [6] utils datasets methods base
> >
> > other attached packages:
> > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.1
> > [2] GenomicFeatures_1.19.31
> > [3] AnnotationDbi_1.29.17
> > [4] Biobase_2.27.2
> > [5] BSgenome.Hsapiens.UCSC.hg19_1.4.0
> > [6] BSgenome_1.35.17
> > [7] rtracklayer_1.27.8
> > [8] VariantAnnotation_1.13.40
> > [9] Rsamtools_1.19.44
> > [10] Biostrings_2.35.11
> > [11] XVector_0.7.4
> > [12] GenomicRanges_1.19.46
> > [13] GenomeInfoDb_1.3.13
> > [14] IRanges_2.1.43
> > [15] S4Vectors_0.5.22
> > [16] BiocGenerics_0.13.6
> > [17] vimcom_1.0-0
> > [18] setwidth_1.0-3
> > [19] colorout_1.0-3
> >
> > loaded via a namespace (and not attached):
> > [1] BiocParallel_1.1.15 biomaRt_2.23.5
> > [3] bitops_1.0-6 DBI_0.3.1
> > [5] GenomicAlignments_1.3.31 RCurl_1.95-4.5
> > [7] RSQLite_1.0.0 tools_3.2.0
> > [9] XML_3.98-1.1 zlibbioc_1.13.2
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109
Email: vobencha at fredhutch.org
Phone: (206) 667-3158
More information about the Bioc-devel
mailing list