[Bioc-devel] strandless introns with VariantAnnotation::locateVariants()

Robert Castelo robert.castelo at upf.edu
Thu Jun 11 12:26:04 CEST 2015


hi Valerie,

thanks for the fix, it seems that it required a more complex update than 
i initially thought looking at the code. actually, i found out that this 
update is breaking with the sample VCF file from the VariantFiltering 
package. i guess this requires some further fix in VariantAnnotation. 
The problem shows up in both, devel and release, please find at the 
bottom both sessionInfo() outputs.

thanks!!
robert.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering")

vcf <- readVcf(fl, genome="hg19")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

## this is necessary to avoid the chrM length glitch between
## b37 (GATK bundle 2.8 reference genome) and hg19 versions
commonlev <- intersect(seqlevels(vcf), seqlevels(txdb))
commonlev <- commonlev[seqlengths(vcf)[commonlev] == 
seqlengths(txdb)[commonlev]]
vcf <- keepSeqlevels(vcf, commonlev)

loc_intron <- locateVariants(vcf, txdb, region=IntronVariants())
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
   subscript contains NAs or out-of-bounds indices
traceback()
27: stop("subscript contains NAs or out-of-bounds indices")
26: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
25: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
24: normalizeSingleBracketSubscript(i, x, as.NSBS = TRUE)
23: extractROWS(x, i)
22: extractROWS(x, i)
21: ss[txHits]
20: ss[txHits]
19: is(strand, "Rle")
18: newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand = 
strand,
         mcols = DataFrame(...), seqlengths = seqlengths, seqinfo = seqinfo)
17: GRanges(seqnames = seqnames(query)[xHits], ranges = 
IRanges(ranges(query)[xHits]),
         strand = ss[txHits], LOCATION = .location(length(xHits),
             vtype), LOCSTART = start(map), LOCEND = end(map), QUERYID = 
xHits,
         TXID = txid, CDSID = cdsid, GENEID = NA_character_, PRECEDEID = 
CharacterList(character(0)),
         FOLLOWID = CharacterList(character(0)))
16: .makeResult(query, subject, "intron", ignore.strand = ignore.strand,
         asHits = asHits)
15: .local(query, subject, region, ...)
14: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand 
= ignore.strand,
         asHits = asHits)
13: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand 
= ignore.strand,
         asHits = asHits)
12: eval(expr, envir, enclos)
11: eval(call, parent.frame())
10: callGeneric(query, cache[["intbytx"]], region, ..., ignore.strand = 
ignore.strand,
         asHits = asHits)
9: .local(query, subject, region, ...)
8: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
        ignore.strand = ignore.strand, asHits = asHits)
7: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
        ignore.strand = ignore.strand, asHits = asHits)
6: eval(expr, envir, enclos)
5: eval(call, parent.frame())
4: callGeneric(rowRanges(query), subject, region, ..., cache = cache,
        ignore.strand = ignore.strand, asHits = asHits)
3: .local(query, subject, region, ...)
2: locateVariants(vcf, txdb, region = IntronVariants())
1: locateVariants(vcf, txdb, region = IntronVariants())
R Under development (unstable) (2015-04-28 r68268)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)

locale:
  [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C 
LC_TIME=en_US.UTF8
  [4] LC_COLLATE=en_US.UTF8     LC_MONETARY=en_US.UTF8 
LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8       LC_NAME=C                 LC_ADDRESS=C 

[10] LC_TELEPHONE=C            LC_MEASUREMENT=en_US.UTF8 
LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
  methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3 GenomicFeatures_1.21.12 

  [3] AnnotationDbi_1.31.14                   VariantAnnotation_1.15.11 

  [5] Rsamtools_1.21.8                        Biostrings_2.37.2 

  [7] XVector_0.9.1                           SummarizedExperiment_0.1.5 

  [9] Biobase_2.29.1                          GenomicRanges_1.21.14 

[11] GenomeInfoDb_1.5.7                      IRanges_2.3.10 

[13] S4Vectors_0.7.3                         BiocGenerics_0.15.2 

[15] vimcom_1.0-0                            setwidth_1.0-3 

[17] colorout_1.0-3

loaded via a namespace (and not attached):
  [1] GenomicAlignments_1.5.9 zlibbioc_1.15.0 
BiocParallel_1.3.22     BSgenome_1.37.1
  [5] tools_3.3.0             DBI_0.3.1               lambda.r_1.1.7 
       futile.logger_1.4.1
  [9] rtracklayer_1.29.7      futile.options_1.0.0    bitops_1.0-6 
       RCurl_1.95-4.6
[13] biomaRt_2.25.1          RSQLite_1.0.0           XML_3.98-1.2


*** SESSION INFORMATION FOR RELEASE ******

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)

locale:
  [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C 
LC_TIME=en_US.UTF8
  [4] LC_COLLATE=en_US.UTF8     LC_MONETARY=en_US.UTF8 
LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8       LC_NAME=C                 LC_ADDRESS=C 

[10] LC_TELEPHONE=C            LC_MEASUREMENT=en_US.UTF8 
LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
  methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2 GenomicFeatures_1.20.1 

  [3] AnnotationDbi_1.30.1                    Biobase_2.28.0 

  [5] VariantAnnotation_1.14.2                Rsamtools_1.20.4 

  [7] Biostrings_2.36.1                       XVector_0.8.0 

  [9] GenomicRanges_1.20.5                    GenomeInfoDb_1.4.0 

[11] IRanges_2.2.4                           S4Vectors_0.6.0 

[13] BiocGenerics_0.14.0                     vimcom_1.2-3 

[15] setwidth_1.0-3                          colorout_1.1-0 


loaded via a namespace (and not attached):
  [1] zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.2 
      BSgenome_1.36.0
  [5] tools_3.2.0             DBI_0.3.1               lambda.r_1.1.7 
       futile.logger_1.4.1
  [9] rtracklayer_1.28.4      futile.options_1.0.0    bitops_1.0-6 
       RCurl_1.95-4.6
[13] biomaRt_2.24.0          RSQLite_1.0.0           XML_3.98-1.2


On 06/11/2015 01:55 AM, Valerie Obenchain wrote:
> Thanks for the report.
>
> Now fixed in release (1.14.2) and devel (1.3.25).
>
> Valerie
>
>
>
> On 06/09/2015 09:44 AM, Robert Castelo wrote:
>> hi,
>>
>> currently, the annotation of variants in intronic regions by
>> VariantAnnotation and the locateVariants() function does not assign
>> strand to annotations in introns:
>>
>> library(VariantAnnotation)
>> example(locateVariants)
>> loc_all[loc_all$LOCATION == "intron"]
>> GRanges object with 2 ranges and 9 metadata columns:
>> seqnames ranges strand | LOCATION LOCSTART LOCEND
>> QUERYID TXID CDSID GENEID
>> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
>> <integer> <character> <IntegerList> <character>
>> chr1 [13302, 13302] * | intron 948 948
>> 3 2 100287102
>> chr1 [13327, 13327] * | intron 973 973
>> 4 2 100287102
>> PRECEDEID FOLLOWID
>> <CharacterList> <CharacterList>
>>
>>
>> -------
>> seqinfo: 1 sequence from hg19 genome; no seqlengths
>>
>>
>> however, introns are stranded, so I would suggest to include the strand
>> information in variants annotated to intronic regions. After a quick
>> look to the source code I believe the relevant line is within the
>> private function .makeResult(), concretely at:
>>
>> strand=strand(query)[xHits],
>>
>> where is taking the strand of the query (the variant) while I guess it
>> should be the subject (the annotation).
>>
>>
>> cheers,
>>
>> robert.
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

-- 
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550



More information about the Bioc-devel mailing list