[BioC] Complete variant toolbox: gmapR/VariantTools/VariantAnnotation - revived

Tue Jul 29 05:41:47 CEST 2014

Excellent! Thanks for fixing it, and again for this new feature that 
enables calculating cDNA positions.

best regards,

robert.

On 7/28/14 11:30 PM, Valerie Obenchain wrote:
> OK. We'll try it again. See 1.17.29.
>
> It looks like I did not introduce this with ignore.strand. The bug was 
> deeper and could be handled in either .orderElementsByTranscription() 
> or the pair of .listCumsum* functions. I chose to add an arg to 
> .orderElementsByTranscription() that controls whether negative strand 
> ranges are ordered large-to-small (3' to 5') or small-to-large (5' to 
> 3').
>
> Valerie
>
>
> On 07/28/2014 01:23 PM, Robert Castelo wrote:
>> hi Valerie,
>>
>> i'm still not getting the right answer, i've looked at the unit test you
>> added and i'm afraid it does not recreate the situation i described
>> below, which involved one transcript with two exons, while the unit test
>> forms two transcripts of one exon. so as it is now, doing:
>>
>>   mapCoords(gr, grl, ignore.strand=TRUE)
>>  > map
>> GRanges with 2 ranges and 2 metadata columns:
>>        seqnames       ranges strand | queryHits subjectHits
>>           <Rle>    <IRanges>  <Rle> | <integer>   <integer>
>>    [1]     chrA [1797, 1797]      - |         1           2
>>    [2]     chrA [1797, 1797]      - |         2           2
>>
>> you get the right answer, but doing:
>>
>> mapCoords(gr, GRangesList(unlist(grl)), ignore.strand=TRUE)
>>  > map
>> GRanges with 2 ranges and 2 metadata columns:
>>        seqnames       ranges strand | queryHits subjectHits
>>           <Rle>    <IRanges>  <Rle> | <integer>   <integer>
>>    [1]     chrA [2036, 2036]      - |         1           1
>>    [2]     chrA [2036, 2036]      - |         2           1
>>
>> you get the previous wrong answer. you'll find my sessionInfo() below.
>>
>> cheers,
>> robert.
>> ps: sessionInfo()
>> R version 3.1.1 (2014-07-10)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets methods
>> base
>>
>> other attached packages:
>>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 GenomicFeatures_1.17.13
>> AnnotationDbi_1.27.9
>>   [4] Biobase_2.25.0 VariantAnnotation_1.11.19 Rsamtools_1.17.31
>>   [7] Biostrings_2.33.13 XVector_0.5.7 GenomicRanges_1.17.28
>> [10] GenomeInfoDb_1.1.17 IRanges_1.99.23 S4Vectors_0.1.2
>> [13] BiocGenerics_0.11.3 vimcom_0.9-93 setwidth_1.0-3
>> [16] colorout_1.0-2
>>
>> loaded via a namespace (and not attached):
>>   [1] BatchJobs_1.3            BBmisc_1.7 BiocParallel_0.7.9
>> biomaRt_2.21.1
>>   [5] bitops_1.0-6             brew_1.0-6 BSgenome_1.33.8
>> checkmate_1.2
>>   [9] codetools_0.2-8          DBI_0.2-7 digest_0.6.4             
>> fail_1.2
>> [13] foreach_1.4.2            GenomicAlignments_1.1.22
>> iterators_1.0.7          Rcpp_0.11.2
>> [17] RCurl_1.95-4.1           RSQLite_0.11.4 rtracklayer_1.25.13
>> sendmailR_1.1-2
>> [21] stats4_3.1.1             stringr_0.6.2 tools_3.1.1
>> XML_3.98-1.1
>> [25] zlibbioc_1.11.1
>>
>>
>> On 7/28/14 1:25 PM, Valerie Obenchain wrote:
>>> Hi Robert,
>>  >
>>  > Now fixed in GenomicRanges 1.17.28.
>>  >
>>  > I introduced this bug when trying to be too clever with
>>  > ignore.strand. Thanks for catching it and, as always, providing a
>>  > clear reproducible example.
>>  >
>>  > Valerie
>>  >
>>  > On 07/24/2014 08:24 AM, Robert Castelo wrote:
>>  >> Dear Valerie,
>>  >>
>>  >> sorry for my long delay in reacting to your message below. This
>>  >> addition to VariantAnnotation by Michael and you is great. I just
>>  >> would like to make a comment and report what looks to me as a bug
>>  >> related to the computation of cDNA positions, just in case you have
>>  >> suggestions and comments, specially with respect to what at the
>>  >> moment seems like a bug to me.
>>  >>
>>  >> the way in which mapCoords() works is by finding all possible
>>  >> overlaps of each GRanges entry, corresponding to a variant. This is
>>  >> strictly not necessary once you have gone through locateVariants()
>>  >> and/or predictCoding() since these functions already find all
>>  >> possible overlaps of each variant to different transcripts and
>>  >> regions. So, to get the corresponding annotation of cDNA positions
>>  >> I had to write the following function which is slightly more
>>  >> elaborated from what you suggest below:
>>  >>
>>  >> ## this function assumes that locateVariants() or predictCoding()
>>  >> ## have been called so that there is a TXID metadata column
>>  >> .cDNAloc <- function(gr, txdb) { if (!"TXID" %in%
>>  >> colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not
>>  >> found in input GRanges object.")
>>  >>
>>  >> exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx,
>>  >> eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <-
>>  >> rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <-
>>  >> gr[qolap] mcols(res) <- append(values(res),
>>  >> DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ##
>>  >> restrict results to cDNA positions of query TXID res <-
>>  >> res[res$TXID == res$TXID2]
>>  >>
>>  >> start(res$cDNALOC) }
>>  >>
>>  >> i'm going to run it with a toy example of a variant that overlaps
>>  >> a coding region of 7 transcripts in the positive strand and the
>>  >> three UTR region of one transcript in the negative strand.
>>  >>
>>  >> library(VariantAnnotation)
>>  >> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>  >>
>>  >> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349),
>>  >> width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L)
>>  >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr,
>>  >> txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb)
>>  >>
>>  >> now let's look to the cDNA position of the transcript in the
>>  >> negative strand (the last entry of the resulting GRanges object):
>>  >>
>>  >> GRanges with 1 range and 2 metadata columns: seqnames
>>  >> ranges strand |      TXID   cDNALOC <Rle>            <IRanges>
>>  >> <Rle> | <integer> <integer> [1] chr21 [43522349, 43522349]
>>  >> - |     73335      2036
>>  >>
>>  >> and the exonic structure of this transcript
>>  >>
>>  >> exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges
>>  >> with 2 ranges and 3 metadata columns: seqnames               ranges
>>  >> strand |   exon_id   exon_name exon_rank <Rle>            <IRanges>
>>  >> <Rle> | <integer> <character> <integer> [1]    chr21 [43528406,
>>  >> 43528644]      - |    260152        <NA>     1 [2]    chr21
>>  >> [43522244, 43524145]      - |    260151 <NA> 2
>>  >>
>>  >> from this structure one would expect that the reported cDNA
>>  >> position would be:
>>  >>
>>  >> 43524145 - 43522349 + 1 = 1797
>>  >>
>>  >> instead of 2036.
>>  >>
>>  >> after investigating a bit where this value may come from, it seems
>>  >> that while the strand is being taken into account for the relative
>>  >> position within the exon, the widths of upstream exons is wrongly
>>  >> selected (in this case there are no upstream exons):
>>  >>
>>  >> width(exonsbytx[["73335"]]) [1]  239 1902
>>  >>
>>  >> 239 + 1796 + 1 = 2036
>>  >>
>>  >>
>>  >> so, my guess is that this must be something to be fixed in
>>  >> mapCoords().
>>  >>
>>  >>
>>  >> best regards!!
>>  >>
>>  >> robert.
>>  >>
>>  >>
>>  >> On 07/13/2014 07:26 AM, Valerie Obenchain wrote:
>>  >>> Hi,
>>  >>>
>>  >>> Following up on this thread:
>>  >>>
>>  >>> 
>> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html
>>  >>>
>>  >>>
>>  >>>
>> These changes are available in VariantAnnotation 1.11.15:
>>>>>
>>  >>> 1) LOCSTART, LOCEND
>>  >>>
>>  >>> locateVariants() has 2 new output columns, LOCSTART and LOCEND.
>>  >>> These are LOCATION-centric coordinates and can be different for
>>  >>> each row so I thought these names were more descriptive than
>>  >>> REFLOCS (discussed in thread). We have 2 values (start/end)
>>  >>> instead of a single column of IRanges() because we can't make an
>>  >>> IRanges() with missing values. Technically 'missing' ranges are
>>  >>> represented by zero-width ranges but we still need a position;
>>  >>> there is no position because there was no overlap.
>>  >>>
>>  >>> 2) mapCoords(), pmapCoords()
>>  >>>
>>  >>> These functions are courtesy of Michael. mapCoords() maps ranges
>>  >>> onto another set of coordinates. You can map to cds-centric,
>>  >>> exon-centric or any other type of coordinate. See ?mapCoords in
>>  >>> both GenomicRanges and GenomicAlignments.
>>  >>>
>>  >>>
>>  >>> In the previous thread we discussed added cDNA locations to
>>  >>> predictCoding(). I've decided against this because it adds the
>>  >>> additional overhead of the exonsBy() extraction and a
>>  >>> findOverlaps() call. Not all users want the cDNA locations and
>>  >>> those that do can now easily get them with mapCoords().
>>  >>>
>>  >>> ## The usual predictCoding setup:
>>  >>> library(BSgenome.Hsapiens.UCSC.hg19)
>>  >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <-
>>  >>> TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata",
>>  >>> "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl,
>>  >>> "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <-
>>  >>> predictCoding(vcf, txdb, Hsapiens)
>>  >>>
>>  >>> ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb,
>>  >>> "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))],
>>  >>> exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)]
>>  >>>
>>  >>> Let me know if you run into problems or if the docs need more
>>  >>> detail.
>>  >>>
>>  >>> Valerie
>>  >>>
>>  >>
>>  >
>>
>>
>