[Bioc-devel] strandless introns with VariantAnnotation::locateVariants()

Thu Jun 11 18:27:48 CEST 2015

Valerie, thanks for fixing this quickly, I guess that if 
intronsByTranscript() returns introns with different strand within a 
single transcript is rather an annotation error than a biological 
feature, so it is good to prompt a warning about it.

robert.

On 6/11/15 5:51 PM, Valerie Obenchain wrote:
> This issue is that 'subject' is a GRangesList which has multiple 
> strands per list element (so a simple strand(subject) can't be used to 
> construct a GRanges). Additionally, there could be different strands 
> in a single list element. I've added a check for this and a warning is 
> thrown and strand is set to '*' if different strands are found in the 
> same list element.
>
> The fix has been checked into release and devel.
>
> Valerie
>
>
> On 06/11/2015 03:26 AM, Robert Castelo wrote:
>> hi Valerie,
>>
>> thanks for the fix, it seems that it required a more complex update than
>> i initially thought looking at the code. actually, i found out that this
>> update is breaking with the sample VCF file from the VariantFiltering
>> package. i guess this requires some further fix in VariantAnnotation.
>> The problem shows up in both, devel and release, please find at the
>> bottom both sessionInfo() outputs.
>>
>> thanks!!
>> robert.
>>
>> library(VariantAnnotation)
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>
>> fl <- system.file("extdata", "CEUtrio.vcf.bgz", 
>> package="VariantFiltering")
>>
>> vcf <- readVcf(fl, genome="hg19")
>>
>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>> seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
>>
>> ## this is necessary to avoid the chrM length glitch between
>> ## b37 (GATK bundle 2.8 reference genome) and hg19 versions
>> commonlev <- intersect(seqlevels(vcf), seqlevels(txdb))
>> commonlev <- commonlev[seqlengths(vcf)[commonlev] ==
>> seqlengths(txdb)[commonlev]]
>> vcf <- keepSeqlevels(vcf, commonlev)
>>
>> loc_intron <- locateVariants(vcf, txdb, region=IntronVariants())
>> Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
>>    subscript contains NAs or out-of-bounds indices
>> traceback()
>> 27: stop("subscript contains NAs or out-of-bounds indices")
>> 26: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
>> 25: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
>> 24: normalizeSingleBracketSubscript(i, x, as.NSBS = TRUE)
>> 23: extractROWS(x, i)
>> 22: extractROWS(x, i)
>> 21: ss[txHits]
>> 20: ss[txHits]
>> 19: is(strand, "Rle")
>> 18: newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand =
>> strand,
>>          mcols = DataFrame(...), seqlengths = seqlengths, seqinfo =
>> seqinfo)
>> 17: GRanges(seqnames = seqnames(query)[xHits], ranges =
>> IRanges(ranges(query)[xHits]),
>>          strand = ss[txHits], LOCATION = .location(length(xHits),
>>              vtype), LOCSTART = start(map), LOCEND = end(map), QUERYID =
>> xHits,
>>          TXID = txid, CDSID = cdsid, GENEID = NA_character_, PRECEDEID =
>> CharacterList(character(0)),
>>          FOLLOWID = CharacterList(character(0)))
>> 16: .makeResult(query, subject, "intron", ignore.strand = ignore.strand,
>>          asHits = asHits)
>> 15: .local(query, subject, region, ...)
>> 14: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand
>> = ignore.strand,
>>          asHits = asHits)
>> 13: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand
>> = ignore.strand,
>>          asHits = asHits)
>> 12: eval(expr, envir, enclos)
>> 11: eval(call, parent.frame())
>> 10: callGeneric(query, cache[["intbytx"]], region, ..., ignore.strand =
>> ignore.strand,
>>          asHits = asHits)
>> 9: .local(query, subject, region, ...)
>> 8: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
>>         ignore.strand = ignore.strand, asHits = asHits)
>> 7: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
>>         ignore.strand = ignore.strand, asHits = asHits)
>> 6: eval(expr, envir, enclos)
>> 5: eval(call, parent.frame())
>> 4: callGeneric(rowRanges(query), subject, region, ..., cache = cache,
>>         ignore.strand = ignore.strand, asHits = asHits)
>> 3: .local(query, subject, region, ...)
>> 2: locateVariants(vcf, txdb, region = IntronVariants())
>> 1: locateVariants(vcf, txdb, region = IntronVariants())
>> R Under development (unstable) (2015-04-28 r68268)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> Running under: Fedora release 12 (Constantine)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C LC_TIME=en_US.UTF8
>>   [4] LC_COLLATE=en_US.UTF8     LC_MONETARY=en_US.UTF8
>> LC_MESSAGES=en_US.UTF8
>>   [7] LC_PAPER=en_US.UTF8       LC_NAME=C LC_ADDRESS=C
>> [10] LC_TELEPHONE=C            LC_MEASUREMENT=en_US.UTF8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats4    parallel  stats     graphics  grDevices utils datasets
>>   methods   base
>>
>> other attached packages:
>>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3 GenomicFeatures_1.21.12
>>   [3] AnnotationDbi_1.31.14 VariantAnnotation_1.15.11
>>   [5] Rsamtools_1.21.8                        Biostrings_2.37.2
>>   [7] XVector_0.9.1 SummarizedExperiment_0.1.5
>>   [9] Biobase_2.29.1 GenomicRanges_1.21.14
>> [11] GenomeInfoDb_1.5.7                      IRanges_2.3.10
>> [13] S4Vectors_0.7.3                         BiocGenerics_0.15.2
>> [15] vimcom_1.0-0                            setwidth_1.0-3
>> [17] colorout_1.0-3
>>
>> loaded via a namespace (and not attached):
>>   [1] GenomicAlignments_1.5.9 zlibbioc_1.15.0 BiocParallel_1.3.22
>> BSgenome_1.37.1
>>   [5] tools_3.3.0             DBI_0.3.1 lambda.r_1.1.7
>>        futile.logger_1.4.1
>>   [9] rtracklayer_1.29.7      futile.options_1.0.0 bitops_1.0-6
>> RCurl_1.95-4.6
>> [13] biomaRt_2.25.1          RSQLite_1.0.0 XML_3.98-1.2
>>
>>
>> *** SESSION INFORMATION FOR RELEASE ******
>>
>> sessionInfo()
>> R version 3.2.0 (2015-04-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> Running under: Fedora release 12 (Constantine)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C LC_TIME=en_US.UTF8
>>   [4] LC_COLLATE=en_US.UTF8     LC_MONETARY=en_US.UTF8
>> LC_MESSAGES=en_US.UTF8
>>   [7] LC_PAPER=en_US.UTF8       LC_NAME=C LC_ADDRESS=C
>> [10] LC_TELEPHONE=C            LC_MEASUREMENT=en_US.UTF8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats4    parallel  stats     graphics  grDevices utils datasets
>>   methods   base
>>
>> other attached packages:
>>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2 GenomicFeatures_1.20.1
>>   [3] AnnotationDbi_1.30.1                    Biobase_2.28.0
>>   [5] VariantAnnotation_1.14.2                Rsamtools_1.20.4
>>   [7] Biostrings_2.36.1                       XVector_0.8.0
>>   [9] GenomicRanges_1.20.5                    GenomeInfoDb_1.4.0
>> [11] IRanges_2.2.4                           S4Vectors_0.6.0
>> [13] BiocGenerics_0.14.0                     vimcom_1.2-3
>> [15] setwidth_1.0-3                          colorout_1.1-0
>>
>> loaded via a namespace (and not attached):
>>   [1] zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.2
>>       BSgenome_1.36.0
>>   [5] tools_3.2.0             DBI_0.3.1 lambda.r_1.1.7
>>        futile.logger_1.4.1
>>   [9] rtracklayer_1.28.4      futile.options_1.0.0 bitops_1.0-6
>> RCurl_1.95-4.6
>> [13] biomaRt_2.24.0          RSQLite_1.0.0 XML_3.98-1.2
>>
>>
>> On 06/11/2015 01:55 AM, Valerie Obenchain wrote:
>>> Thanks for the report.
>>>
>>> Now fixed in release (1.14.2) and devel (1.3.25).
>>>
>>> Valerie
>>>
>>>
>>>
>>> On 06/09/2015 09:44 AM, Robert Castelo wrote:
>>>> hi,
>>>>
>>>> currently, the annotation of variants in intronic regions by
>>>> VariantAnnotation and the locateVariants() function does not assign
>>>> strand to annotations in introns:
>>>>
>>>> library(VariantAnnotation)
>>>> example(locateVariants)
>>>> loc_all[loc_all$LOCATION == "intron"]
>>>> GRanges object with 2 ranges and 9 metadata columns:
>>>> seqnames ranges strand | LOCATION LOCSTART LOCEND
>>>> QUERYID TXID CDSID GENEID
>>>> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
>>>> <integer> <character> <IntegerList> <character>
>>>> chr1 [13302, 13302] * | intron 948 948
>>>> 3 2 100287102
>>>> chr1 [13327, 13327] * | intron 973 973
>>>> 4 2 100287102
>>>> PRECEDEID FOLLOWID
>>>> <CharacterList> <CharacterList>
>>>>
>>>>
>>>> -------
>>>> seqinfo: 1 sequence from hg19 genome; no seqlengths
>>>>
>>>>
>>>> however, introns are stranded, so I would suggest to include the 
>>>> strand
>>>> information in variants annotated to intronic regions. After a quick
>>>> look to the source code I believe the relevant line is within the
>>>> private function .makeResult(), concretely at:
>>>>
>>>> strand=strand(query)[xHits],
>>>>
>>>> where is taking the strand of the query (the variant) while I guess it
>>>> should be the subject (the annotation).
>>>>
>>>>
>>>> cheers,
>>>>
>>>> robert.
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>
>