[Bioc-devel] little tiny bug in CDSID annotations from predictCoding()

Robert Castelo robert.castelo at upf.edu
Tue Mar 17 17:04:27 CET 2015


hi,

there is a little tiny bug in the current devel version of 
VariantAnnotation::predictCoding(), and more concretely within
VariantAnnotation:::.localCoordinates(), that precludes the correct 
annotation of the CDSID column:

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)

vcf <- readVcf(system.file("extdata", "CEUtrio.vcf.bgz", 
package="VariantFiltering"), genome="hg19")
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
vcf <- dropSeqlevels(vcf, "chrM")
coding1 <- predictCoding(vcf, txdb, Hsapiens)
head(coding1$CDSID)
IntegerList of length 6
[[1]] integer(0)
[[2]] integer(0)
[[3]] integer(0)
[[4]] integer(0)
[[5]] integer(0)
[[6]] integer(0)
table(elementLengths(coding1$CDSID))

    0
6038

my sessionInfo() is at the end of the message.

here is the patch, just replacing 'map2$trancriptHits' by 
'map2$transcriptsHits':

--- R/AllUtilities.R	(revision 100756)
+++ R/AllUtilities.R	(working copy)
@@ -284,7 +284,7 @@
      cdsid <- IntegerList(integer(0))
      map2 <- mapToTranscripts(unname(from)[xHits], to,
                               ignore.strand=ignore.strand)
-    cds <- mcols(unlist(to, use.names=FALSE))$cds_id[map2$trancriptsHits]
+    cds <- mcols(unlist(to, use.names=FALSE))$cds_id[map2$transcriptsHits]
      if (length(cds)) {
          cdslst <- unique(splitAsList(cds, map2$xHits))
          cdsid <- cdslst

with this fix then things seem to work again:

coding1 <- predictCoding(vcf, txdb, Hsapiens)
 > head(coding1$CDSID)
IntegerList of length 6
[["1"]] 21771
[["2"]] 21771
[["3"]] 21771
[["4"]] 21771
[["5"]] 21428
[["6"]] 21428
table(elementLengths(coding1$CDSID))

    1    2    3    4    5    6    7    8    9   10   12   13   14   16   19
  873 1229 1024  993  615  524  324  168   82   21   12   15   42   76   40

while investigating this bug i used VariantAnnotation::locateVariants() 
which also annotates the CDSID column, and it seemed to be working. 
however, i noticed that both, predictCoding() and locateVariants(), do 
not give an identical annotation for the CDSID column in coding variants:

coding2 <- locateVariants(vcf, txdb, CodingVariants())
head(loc_all$CDSID)
IntegerList of length 6
[["1"]] 210777
[["2"]] 210777
[["3"]] 210777
[["4"]] 210778
[["5"]] 208140
[["6"]] 208141
table(elementLengths(coding2$CDSID))

    1    2    3    4
4987  901  138   12

in principle, it seems that both are annotating valid CDSID keys:

allcdsinfo <- select(txdb, keys=keys(txdb, keytype="CDSID"), 
columns="CDSID", keytype="CDSID")
sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in% 
allcdsinfo$CDSID)
[1] 0
sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in% 
allcdsinfo$CDSID)
[1] 0

but predictCoding() annotates CDSID values that are not present in 
locateVariants() annotations and viceversa:

sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in% 
as.character(unlist(coding2$CDSID, use.names=FALSE)))
[1] 24057
sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in% 
as.character(unlist(coding1$CDSID, use.names=FALSE)))
[1] 7251

length(unique(intersect(as.character(unlist(coding2$CDSID, 
use.names=FALSE)), as.character(unlist(coding1$CDSID, use.names=FALSE)))))
[1] 0

should not both annotate the same CDSID values on coding variants?


thanks!
robert.
ps: sessionInfo()
R Under development (unstable) (2014-10-14 r66765)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF8        LC_COLLATE=en_US.UTF8
  [5] LC_MONETARY=en_US.UTF8    LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8       LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices
[6] utils     datasets  methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.1
  [2] GenomicFeatures_1.19.31
  [3] AnnotationDbi_1.29.17
  [4] Biobase_2.27.2
  [5] BSgenome.Hsapiens.UCSC.hg19_1.4.0
  [6] BSgenome_1.35.17
  [7] rtracklayer_1.27.8
  [8] VariantAnnotation_1.13.40
  [9] Rsamtools_1.19.44
[10] Biostrings_2.35.11
[11] XVector_0.7.4
[12] GenomicRanges_1.19.46
[13] GenomeInfoDb_1.3.13
[14] IRanges_2.1.43
[15] S4Vectors_0.5.22
[16] BiocGenerics_0.13.6
[17] vimcom_1.0-0
[18] setwidth_1.0-3
[19] colorout_1.0-3

loaded via a namespace (and not attached):
  [1] BiocParallel_1.1.15      biomaRt_2.23.5
  [3] bitops_1.0-6             DBI_0.3.1
  [5] GenomicAlignments_1.3.31 RCurl_1.95-4.5
  [7] RSQLite_1.0.0            tools_3.2.0
  [9] XML_3.98-1.1             zlibbioc_1.13.2



More information about the Bioc-devel mailing list