[BioC] Complete variant toolbox: gmapR/VariantTools/VariantAnnotation - revived
Robert Castelo
robert.castelo at upf.edu
Tue Jul 29 05:41:47 CEST 2014
Excellent! Thanks for fixing it, and again for this new feature that
enables calculating cDNA positions.
best regards,
robert.
On 7/28/14 11:30 PM, Valerie Obenchain wrote:
> OK. We'll try it again. See 1.17.29.
>
> It looks like I did not introduce this with ignore.strand. The bug was
> deeper and could be handled in either .orderElementsByTranscription()
> or the pair of .listCumsum* functions. I chose to add an arg to
> .orderElementsByTranscription() that controls whether negative strand
> ranges are ordered large-to-small (3' to 5') or small-to-large (5' to
> 3').
>
> Valerie
>
>
> On 07/28/2014 01:23 PM, Robert Castelo wrote:
>> hi Valerie,
>>
>> i'm still not getting the right answer, i've looked at the unit test you
>> added and i'm afraid it does not recreate the situation i described
>> below, which involved one transcript with two exons, while the unit test
>> forms two transcripts of one exon. so as it is now, doing:
>>
>> mapCoords(gr, grl, ignore.strand=TRUE)
>> > map
>> GRanges with 2 ranges and 2 metadata columns:
>> seqnames ranges strand | queryHits subjectHits
>> <Rle> <IRanges> <Rle> | <integer> <integer>
>> [1] chrA [1797, 1797] - | 1 2
>> [2] chrA [1797, 1797] - | 2 2
>>
>> you get the right answer, but doing:
>>
>> mapCoords(gr, GRangesList(unlist(grl)), ignore.strand=TRUE)
>> > map
>> GRanges with 2 ranges and 2 metadata columns:
>> seqnames ranges strand | queryHits subjectHits
>> <Rle> <IRanges> <Rle> | <integer> <integer>
>> [1] chrA [2036, 2036] - | 1 1
>> [2] chrA [2036, 2036] - | 2 1
>>
>> you get the previous wrong answer. you'll find my sessionInfo() below.
>>
>> cheers,
>> robert.
>> ps: sessionInfo()
>> R version 3.1.1 (2014-07-10)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> base
>>
>> other attached packages:
>> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 GenomicFeatures_1.17.13
>> AnnotationDbi_1.27.9
>> [4] Biobase_2.25.0 VariantAnnotation_1.11.19 Rsamtools_1.17.31
>> [7] Biostrings_2.33.13 XVector_0.5.7 GenomicRanges_1.17.28
>> [10] GenomeInfoDb_1.1.17 IRanges_1.99.23 S4Vectors_0.1.2
>> [13] BiocGenerics_0.11.3 vimcom_0.9-93 setwidth_1.0-3
>> [16] colorout_1.0-2
>>
>> loaded via a namespace (and not attached):
>> [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.7.9
>> biomaRt_2.21.1
>> [5] bitops_1.0-6 brew_1.0-6 BSgenome_1.33.8
>> checkmate_1.2
>> [9] codetools_0.2-8 DBI_0.2-7 digest_0.6.4
>> fail_1.2
>> [13] foreach_1.4.2 GenomicAlignments_1.1.22
>> iterators_1.0.7 Rcpp_0.11.2
>> [17] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.25.13
>> sendmailR_1.1-2
>> [21] stats4_3.1.1 stringr_0.6.2 tools_3.1.1
>> XML_3.98-1.1
>> [25] zlibbioc_1.11.1
>>
>>
>> On 7/28/14 1:25 PM, Valerie Obenchain wrote:
>>> Hi Robert,
>> >
>> > Now fixed in GenomicRanges 1.17.28.
>> >
>> > I introduced this bug when trying to be too clever with
>> > ignore.strand. Thanks for catching it and, as always, providing a
>> > clear reproducible example.
>> >
>> > Valerie
>> >
>> > On 07/24/2014 08:24 AM, Robert Castelo wrote:
>> >> Dear Valerie,
>> >>
>> >> sorry for my long delay in reacting to your message below. This
>> >> addition to VariantAnnotation by Michael and you is great. I just
>> >> would like to make a comment and report what looks to me as a bug
>> >> related to the computation of cDNA positions, just in case you have
>> >> suggestions and comments, specially with respect to what at the
>> >> moment seems like a bug to me.
>> >>
>> >> the way in which mapCoords() works is by finding all possible
>> >> overlaps of each GRanges entry, corresponding to a variant. This is
>> >> strictly not necessary once you have gone through locateVariants()
>> >> and/or predictCoding() since these functions already find all
>> >> possible overlaps of each variant to different transcripts and
>> >> regions. So, to get the corresponding annotation of cDNA positions
>> >> I had to write the following function which is slightly more
>> >> elaborated from what you suggest below:
>> >>
>> >> ## this function assumes that locateVariants() or predictCoding()
>> >> ## have been called so that there is a TXID metadata column
>> >> .cDNAloc <- function(gr, txdb) { if (!"TXID" %in%
>> >> colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not
>> >> found in input GRanges object.")
>> >>
>> >> exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx,
>> >> eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <-
>> >> rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <-
>> >> gr[qolap] mcols(res) <- append(values(res),
>> >> DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ##
>> >> restrict results to cDNA positions of query TXID res <-
>> >> res[res$TXID == res$TXID2]
>> >>
>> >> start(res$cDNALOC) }
>> >>
>> >> i'm going to run it with a toy example of a variant that overlaps
>> >> a coding region of 7 transcripts in the positive strand and the
>> >> three UTR region of one transcript in the negative strand.
>> >>
>> >> library(VariantAnnotation)
>> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> >>
>> >> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349),
>> >> width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L)
>> >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr,
>> >> txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb)
>> >>
>> >> now let's look to the cDNA position of the transcript in the
>> >> negative strand (the last entry of the resulting GRanges object):
>> >>
>> >> GRanges with 1 range and 2 metadata columns: seqnames
>> >> ranges strand | TXID cDNALOC <Rle> <IRanges>
>> >> <Rle> | <integer> <integer> [1] chr21 [43522349, 43522349]
>> >> - | 73335 2036
>> >>
>> >> and the exonic structure of this transcript
>> >>
>> >> exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges
>> >> with 2 ranges and 3 metadata columns: seqnames ranges
>> >> strand | exon_id exon_name exon_rank <Rle> <IRanges>
>> >> <Rle> | <integer> <character> <integer> [1] chr21 [43528406,
>> >> 43528644] - | 260152 <NA> 1 [2] chr21
>> >> [43522244, 43524145] - | 260151 <NA> 2
>> >>
>> >> from this structure one would expect that the reported cDNA
>> >> position would be:
>> >>
>> >> 43524145 - 43522349 + 1 = 1797
>> >>
>> >> instead of 2036.
>> >>
>> >> after investigating a bit where this value may come from, it seems
>> >> that while the strand is being taken into account for the relative
>> >> position within the exon, the widths of upstream exons is wrongly
>> >> selected (in this case there are no upstream exons):
>> >>
>> >> width(exonsbytx[["73335"]]) [1] 239 1902
>> >>
>> >> 239 + 1796 + 1 = 2036
>> >>
>> >>
>> >> so, my guess is that this must be something to be fixed in
>> >> mapCoords().
>> >>
>> >>
>> >> best regards!!
>> >>
>> >> robert.
>> >>
>> >>
>> >> On 07/13/2014 07:26 AM, Valerie Obenchain wrote:
>> >>> Hi,
>> >>>
>> >>> Following up on this thread:
>> >>>
>> >>>
>> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html
>> >>>
>> >>>
>> >>>
>> These changes are available in VariantAnnotation 1.11.15:
>>>>>
>> >>> 1) LOCSTART, LOCEND
>> >>>
>> >>> locateVariants() has 2 new output columns, LOCSTART and LOCEND.
>> >>> These are LOCATION-centric coordinates and can be different for
>> >>> each row so I thought these names were more descriptive than
>> >>> REFLOCS (discussed in thread). We have 2 values (start/end)
>> >>> instead of a single column of IRanges() because we can't make an
>> >>> IRanges() with missing values. Technically 'missing' ranges are
>> >>> represented by zero-width ranges but we still need a position;
>> >>> there is no position because there was no overlap.
>> >>>
>> >>> 2) mapCoords(), pmapCoords()
>> >>>
>> >>> These functions are courtesy of Michael. mapCoords() maps ranges
>> >>> onto another set of coordinates. You can map to cds-centric,
>> >>> exon-centric or any other type of coordinate. See ?mapCoords in
>> >>> both GenomicRanges and GenomicAlignments.
>> >>>
>> >>>
>> >>> In the previous thread we discussed added cDNA locations to
>> >>> predictCoding(). I've decided against this because it adds the
>> >>> additional overhead of the exonsBy() extraction and a
>> >>> findOverlaps() call. Not all users want the cDNA locations and
>> >>> those that do can now easily get them with mapCoords().
>> >>>
>> >>> ## The usual predictCoding setup:
>> >>> library(BSgenome.Hsapiens.UCSC.hg19)
>> >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <-
>> >>> TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata",
>> >>> "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl,
>> >>> "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <-
>> >>> predictCoding(vcf, txdb, Hsapiens)
>> >>>
>> >>> ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb,
>> >>> "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))],
>> >>> exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)]
>> >>>
>> >>> Let me know if you run into problems or if the docs need more
>> >>> detail.
>> >>>
>> >>> Valerie
>> >>>
>> >>
>> >
>>
>>
>
More information about the Bioconductor
mailing list