[BioC] Complete variant toolbox: gmapR/VariantTools/VariantAnnotation - revived
Valerie Obenchain
vobencha at fhcrc.org
Tue Jul 29 05:30:44 CEST 2014
OK. We'll try it again. See 1.17.29.
It looks like I did not introduce this with ignore.strand. The bug was
deeper and could be handled in either .orderElementsByTranscription() or
the pair of .listCumsum* functions. I chose to add an arg to
.orderElementsByTranscription() that controls whether negative strand
ranges are ordered large-to-small (3' to 5') or small-to-large (5' to 3').
Valerie
On 07/28/2014 01:23 PM, Robert Castelo wrote:
> hi Valerie,
>
> i'm still not getting the right answer, i've looked at the unit test you
> added and i'm afraid it does not recreate the situation i described
> below, which involved one transcript with two exons, while the unit test
> forms two transcripts of one exon. so as it is now, doing:
>
> mapCoords(gr, grl, ignore.strand=TRUE)
> > map
> GRanges with 2 ranges and 2 metadata columns:
> seqnames ranges strand | queryHits subjectHits
> <Rle> <IRanges> <Rle> | <integer> <integer>
> [1] chrA [1797, 1797] - | 1 2
> [2] chrA [1797, 1797] - | 2 2
>
> you get the right answer, but doing:
>
> mapCoords(gr, GRangesList(unlist(grl)), ignore.strand=TRUE)
> > map
> GRanges with 2 ranges and 2 metadata columns:
> seqnames ranges strand | queryHits subjectHits
> <Rle> <IRanges> <Rle> | <integer> <integer>
> [1] chrA [2036, 2036] - | 1 1
> [2] chrA [2036, 2036] - | 2 1
>
> you get the previous wrong answer. you'll find my sessionInfo() below.
>
> cheers,
> robert.
> ps: sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 GenomicFeatures_1.17.13
> AnnotationDbi_1.27.9
> [4] Biobase_2.25.0 VariantAnnotation_1.11.19 Rsamtools_1.17.31
> [7] Biostrings_2.33.13 XVector_0.5.7 GenomicRanges_1.17.28
> [10] GenomeInfoDb_1.1.17 IRanges_1.99.23 S4Vectors_0.1.2
> [13] BiocGenerics_0.11.3 vimcom_0.9-93 setwidth_1.0-3
> [16] colorout_1.0-2
>
> loaded via a namespace (and not attached):
> [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.7.9
> biomaRt_2.21.1
> [5] bitops_1.0-6 brew_1.0-6 BSgenome_1.33.8
> checkmate_1.2
> [9] codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2
> [13] foreach_1.4.2 GenomicAlignments_1.1.22
> iterators_1.0.7 Rcpp_0.11.2
> [17] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.25.13
> sendmailR_1.1-2
> [21] stats4_3.1.1 stringr_0.6.2 tools_3.1.1
> XML_3.98-1.1
> [25] zlibbioc_1.11.1
>
>
> On 7/28/14 1:25 PM, Valerie Obenchain wrote:
>> Hi Robert,
> >
> > Now fixed in GenomicRanges 1.17.28.
> >
> > I introduced this bug when trying to be too clever with
> > ignore.strand. Thanks for catching it and, as always, providing a
> > clear reproducible example.
> >
> > Valerie
> >
> > On 07/24/2014 08:24 AM, Robert Castelo wrote:
> >> Dear Valerie,
> >>
> >> sorry for my long delay in reacting to your message below. This
> >> addition to VariantAnnotation by Michael and you is great. I just
> >> would like to make a comment and report what looks to me as a bug
> >> related to the computation of cDNA positions, just in case you have
> >> suggestions and comments, specially with respect to what at the
> >> moment seems like a bug to me.
> >>
> >> the way in which mapCoords() works is by finding all possible
> >> overlaps of each GRanges entry, corresponding to a variant. This is
> >> strictly not necessary once you have gone through locateVariants()
> >> and/or predictCoding() since these functions already find all
> >> possible overlaps of each variant to different transcripts and
> >> regions. So, to get the corresponding annotation of cDNA positions
> >> I had to write the following function which is slightly more
> >> elaborated from what you suggest below:
> >>
> >> ## this function assumes that locateVariants() or predictCoding()
> >> ## have been called so that there is a TXID metadata column
> >> .cDNAloc <- function(gr, txdb) { if (!"TXID" %in%
> >> colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not
> >> found in input GRanges object.")
> >>
> >> exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx,
> >> eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <-
> >> rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <-
> >> gr[qolap] mcols(res) <- append(values(res),
> >> DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ##
> >> restrict results to cDNA positions of query TXID res <-
> >> res[res$TXID == res$TXID2]
> >>
> >> start(res$cDNALOC) }
> >>
> >> i'm going to run it with a toy example of a variant that overlaps
> >> a coding region of 7 transcripts in the positive strand and the
> >> three UTR region of one transcript in the negative strand.
> >>
> >> library(VariantAnnotation)
> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> >>
> >> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349),
> >> width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L)
> >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr,
> >> txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb)
> >>
> >> now let's look to the cDNA position of the transcript in the
> >> negative strand (the last entry of the resulting GRanges object):
> >>
> >> GRanges with 1 range and 2 metadata columns: seqnames
> >> ranges strand | TXID cDNALOC <Rle> <IRanges>
> >> <Rle> | <integer> <integer> [1] chr21 [43522349, 43522349]
> >> - | 73335 2036
> >>
> >> and the exonic structure of this transcript
> >>
> >> exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges
> >> with 2 ranges and 3 metadata columns: seqnames ranges
> >> strand | exon_id exon_name exon_rank <Rle> <IRanges>
> >> <Rle> | <integer> <character> <integer> [1] chr21 [43528406,
> >> 43528644] - | 260152 <NA> 1 [2] chr21
> >> [43522244, 43524145] - | 260151 <NA> 2
> >>
> >> from this structure one would expect that the reported cDNA
> >> position would be:
> >>
> >> 43524145 - 43522349 + 1 = 1797
> >>
> >> instead of 2036.
> >>
> >> after investigating a bit where this value may come from, it seems
> >> that while the strand is being taken into account for the relative
> >> position within the exon, the widths of upstream exons is wrongly
> >> selected (in this case there are no upstream exons):
> >>
> >> width(exonsbytx[["73335"]]) [1] 239 1902
> >>
> >> 239 + 1796 + 1 = 2036
> >>
> >>
> >> so, my guess is that this must be something to be fixed in
> >> mapCoords().
> >>
> >>
> >> best regards!!
> >>
> >> robert.
> >>
> >>
> >> On 07/13/2014 07:26 AM, Valerie Obenchain wrote:
> >>> Hi,
> >>>
> >>> Following up on this thread:
> >>>
> >>> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html
> >>>
> >>>
> >>>
> These changes are available in VariantAnnotation 1.11.15:
>>>>
> >>> 1) LOCSTART, LOCEND
> >>>
> >>> locateVariants() has 2 new output columns, LOCSTART and LOCEND.
> >>> These are LOCATION-centric coordinates and can be different for
> >>> each row so I thought these names were more descriptive than
> >>> REFLOCS (discussed in thread). We have 2 values (start/end)
> >>> instead of a single column of IRanges() because we can't make an
> >>> IRanges() with missing values. Technically 'missing' ranges are
> >>> represented by zero-width ranges but we still need a position;
> >>> there is no position because there was no overlap.
> >>>
> >>> 2) mapCoords(), pmapCoords()
> >>>
> >>> These functions are courtesy of Michael. mapCoords() maps ranges
> >>> onto another set of coordinates. You can map to cds-centric,
> >>> exon-centric or any other type of coordinate. See ?mapCoords in
> >>> both GenomicRanges and GenomicAlignments.
> >>>
> >>>
> >>> In the previous thread we discussed added cDNA locations to
> >>> predictCoding(). I've decided against this because it adds the
> >>> additional overhead of the exonsBy() extraction and a
> >>> findOverlaps() call. Not all users want the cDNA locations and
> >>> those that do can now easily get them with mapCoords().
> >>>
> >>> ## The usual predictCoding setup:
> >>> library(BSgenome.Hsapiens.UCSC.hg19)
> >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <-
> >>> TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata",
> >>> "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl,
> >>> "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <-
> >>> predictCoding(vcf, txdb, Hsapiens)
> >>>
> >>> ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb,
> >>> "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))],
> >>> exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)]
> >>>
> >>> Let me know if you run into problems or if the docs need more
> >>> detail.
> >>>
> >>> Valerie
> >>>
> >>
> >
>
>
More information about the Bioconductor
mailing list