[BioC] TranscriptDb and genomicranges questions

Wed Aug 11 22:27:11 CEST 2010

Hi Vincent,

Because of the semantic of the "gaps" method for IRanges/IRangesList
object, you won't get the gaps that are at the 5' and 3' ends of the
chromosomes when using the IRanges/IRangesList trick.

An easier way to extract the gaps between transcripts without
considering their strand is to artificially "project" all the
transcripts on the + strand:

   txdb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene")
   tr <- transcripts(txdb)
   strand(tr) <- "+"

Then:

   > gaps(tr)[1:5]
   GRanges with 5 ranges and 0 elementMetadata values
       seqnames         ranges strand |
          <Rle>      <IRanges>  <Rle> |
   [1]     chr1 [    1,  1872]      + |
   [2]     chr1 [ 3534,  4273]      + |
   [3]     chr1 [19670, 20228]      + |
   [4]     chr1 [20367, 24416]      + |
   [5]     chr1 [25945, 42911]      + |

   seqlengths
           chr1   chr1_random         chr10 ...   chrX_random          chrY
        247249719       1663265     135374737 ...       1719168 
57772954

Note the gap at the 5' end of chr1.

Of course now your 'tr' object is sort of "broken" so you need to
remake it if you want to use it for downstream analyses where the
strand actually matters.

Cheers,
H.

On 07/16/2010 03:47 PM, Patrick Aboyoun wrote:
> Vincent,
> I am cc'ing the Bioconductor mailing list with the response to your
> e-mail so others can access it.
>
> The gaps,GRanges-method is strand specific, so when you pass a GRanges
> object containing the transcript bounds into gaps, you will get the gaps
> on the positive strand, the negative strand, and the both strand
> category. This is why you found a gap between transcripts on the
> positive strand, while there was a transcript at some of those same
> positions on the negative strand. If you want a strandless
> representation of the transcripts, I recommend using the IRanges and
> IRangesList classes. Here is code that achieves what I think you are
> looking for:
>
>
>   >  library(GenomicFeatures)
>   >  txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene")
>   >  tx<- transcripts(txdb)
>   >  unstrndTx<- split(ranges(tx), seqnames(tx))
>   >  unstrndGaps<- unlist(gaps(unstrndTx))
>   >  intAlt<- GRanges(seqnames = factor(names(unstrndGaps),
> names(seqlengths(tx))),
> +                   ranges = unname(unstrndGaps),
> +                   seqlengths = seqlengths(tx))
>
>   >  intAlt[1:3]
> GRanges with 3 ranges and 0 elementMetadata values
>       seqnames         ranges strand |
> <Rle>  <IRanges>  <Rle>  |
> [1]     chr1 [ 3534,  4273]      * |
> [2]     chr1 [19670, 20228]      * |
> [3]     chr1 [20367, 24416]      * |
>
> seqlengths
>             chr1   chr1_random         chr10 ...   chrX_random          chrY
>        247249719       1663265     135374737 ...       1719168      57772954
>
>   >  sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-07-01 r52425)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.16  IRanges_1.7.11
>
> loaded via a namespace (and not attached):
>    [1] Biobase_2.9.0      biomaRt_2.5.1      Biostrings_2.17.24
> BSgenome_1.17.5
>    [5] DBI_0.2-5          RCurl_1.4-2        RSQLite_0.9-1
> rtracklayer_1.9.3
>    [9] tools_2.12.0       XML_3.1-0
>
>
>
>
>
>
> On 7/16/10 6:23 AM, Vincent Detours wrote:
>> Dear Patrick,
>>
>> I am learning your packages genomicranges and transcriptdb, which
>> I find impressively efficients. Here are two questions and perhaps a
>> request.
>>
>> In transcriptdb, I get intergenic regions with:
>> -----
>>> txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene")
>>> tr<- transcripts(txdb)
>>> int<- gaps(tr)
>>> int[1:3]
>> GRanges with 3 ranges and 0 elementMetadata values
>>      seqnames         ranges strand |
>> <Rle>  <IRanges>  <Rle>  |
>> [1]     chr1 [    1,  1872]      + |
>> [2]     chr1 [ 3534, 20228]      + |
>> [3]     chr1 [20367, 42911]      + |
>> ...
>> -----
>> Now I type chr1:3,534-20,228 in the UCSC browser, hg18, and I see
>> that there is a transcript within this interval:
>> http://genome.ucsc.edu/cgi-bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632
>> <http://genome.ucsc.edu/cgi-bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632>
>> This transcript is indeed in the tr object
>> -------
>>> which(elementMetadata(tr)[,"tx_name"]=="ENST00000326632")
>> [1] 3880
>>> tr[3880,]
>> GRanges with 1 range and 2 elementMetadata values
>>      seqnames        ranges strand |     tx_id         tx_name
>> <Rle>  <IRanges>  <Rle>  |<integer>  <character>
>> [1]     chr1 [4274, 19669]      - |      1731 ENST00000326632
>>>
>> -------
>> Is there a bug in 'gaps', or am I missing something about how it works?
>>
>> I am using the development version of the packages.
>>
>> Thank you for your help and great software.
>>
>> Vincent Detours, Ph. D.
>> http://homepages.ulb.ac.be/~vdetours/
>> <http://homepages.ulb.ac.be/%7Evdetours/>
>>
>> IRIBHM - Université Libre de Bruxelles
>> Bldg C, room C.4.116
>> ULB, Campus Erasme, CP602
>> 808 route de Lennik
>> B-1070 Brussels
>> Belgium
>>
>> Phone: +32-2-555 4220
>> Fax: +32-2-555 4655
>> Skype: vdetours
>> E-mail: vdetours at ulb.ac.be<http://ulb.ac.be>
>>
>>
>>
>
>
> 	[[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319