[BioC] TranscriptDb and genomicranges questions
Hervé Pagès
hpages at fhcrc.org
Wed Aug 11 22:27:11 CEST 2010
Hi Vincent,
Because of the semantic of the "gaps" method for IRanges/IRangesList
object, you won't get the gaps that are at the 5' and 3' ends of the
chromosomes when using the IRanges/IRangesList trick.
An easier way to extract the gaps between transcripts without
considering their strand is to artificially "project" all the
transcripts on the + strand:
txdb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene")
tr <- transcripts(txdb)
strand(tr) <- "+"
Then:
> gaps(tr)[1:5]
GRanges with 5 ranges and 0 elementMetadata values
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
[1] chr1 [ 1, 1872] + |
[2] chr1 [ 3534, 4273] + |
[3] chr1 [19670, 20228] + |
[4] chr1 [20367, 24416] + |
[5] chr1 [25945, 42911] + |
seqlengths
chr1 chr1_random chr10 ... chrX_random chrY
247249719 1663265 135374737 ... 1719168
57772954
Note the gap at the 5' end of chr1.
Of course now your 'tr' object is sort of "broken" so you need to
remake it if you want to use it for downstream analyses where the
strand actually matters.
Cheers,
H.
On 07/16/2010 03:47 PM, Patrick Aboyoun wrote:
> Vincent,
> I am cc'ing the Bioconductor mailing list with the response to your
> e-mail so others can access it.
>
> The gaps,GRanges-method is strand specific, so when you pass a GRanges
> object containing the transcript bounds into gaps, you will get the gaps
> on the positive strand, the negative strand, and the both strand
> category. This is why you found a gap between transcripts on the
> positive strand, while there was a transcript at some of those same
> positions on the negative strand. If you want a strandless
> representation of the transcripts, I recommend using the IRanges and
> IRangesList classes. Here is code that achieves what I think you are
> looking for:
>
>
> > library(GenomicFeatures)
> > txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene")
> > tx<- transcripts(txdb)
> > unstrndTx<- split(ranges(tx), seqnames(tx))
> > unstrndGaps<- unlist(gaps(unstrndTx))
> > intAlt<- GRanges(seqnames = factor(names(unstrndGaps),
> names(seqlengths(tx))),
> + ranges = unname(unstrndGaps),
> + seqlengths = seqlengths(tx))
>
> > intAlt[1:3]
> GRanges with 3 ranges and 0 elementMetadata values
> seqnames ranges strand |
> <Rle> <IRanges> <Rle> |
> [1] chr1 [ 3534, 4273] * |
> [2] chr1 [19670, 20228] * |
> [3] chr1 [20367, 24416] * |
>
> seqlengths
> chr1 chr1_random chr10 ... chrX_random chrY
> 247249719 1663265 135374737 ... 1719168 57772954
>
> > sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-07-01 r52425)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.16 IRanges_1.7.11
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.0 biomaRt_2.5.1 Biostrings_2.17.24
> BSgenome_1.17.5
> [5] DBI_0.2-5 RCurl_1.4-2 RSQLite_0.9-1
> rtracklayer_1.9.3
> [9] tools_2.12.0 XML_3.1-0
>
>
>
>
>
>
> On 7/16/10 6:23 AM, Vincent Detours wrote:
>> Dear Patrick,
>>
>> I am learning your packages genomicranges and transcriptdb, which
>> I find impressively efficients. Here are two questions and perhaps a
>> request.
>>
>> In transcriptdb, I get intergenic regions with:
>> -----
>>> txdb<- makeTranscriptDbFromUCSC(genome = "hg18", tablename = "ensGene")
>>> tr<- transcripts(txdb)
>>> int<- gaps(tr)
>>> int[1:3]
>> GRanges with 3 ranges and 0 elementMetadata values
>> seqnames ranges strand |
>> <Rle> <IRanges> <Rle> |
>> [1] chr1 [ 1, 1872] + |
>> [2] chr1 [ 3534, 20228] + |
>> [3] chr1 [20367, 42911] + |
>> ...
>> -----
>> Now I type chr1:3,534-20,228 in the UCSC browser, hg18, and I see
>> that there is a transcript within this interval:
>> http://genome.ucsc.edu/cgi-bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632
>> <http://genome.ucsc.edu/cgi-bin/hgc?hgsid=165451654&o=4273&t=19669&g=ensGene&i=ENST00000326632>
>> This transcript is indeed in the tr object
>> -------
>>> which(elementMetadata(tr)[,"tx_name"]=="ENST00000326632")
>> [1] 3880
>>> tr[3880,]
>> GRanges with 1 range and 2 elementMetadata values
>> seqnames ranges strand | tx_id tx_name
>> <Rle> <IRanges> <Rle> |<integer> <character>
>> [1] chr1 [4274, 19669] - | 1731 ENST00000326632
>>>
>> -------
>> Is there a bug in 'gaps', or am I missing something about how it works?
>>
>> I am using the development version of the packages.
>>
>> Thank you for your help and great software.
>>
>> Vincent Detours, Ph. D.
>> http://homepages.ulb.ac.be/~vdetours/
>> <http://homepages.ulb.ac.be/%7Evdetours/>
>>
>> IRIBHM - Université Libre de Bruxelles
>> Bldg C, room C.4.116
>> ULB, Campus Erasme, CP602
>> 808 route de Lennik
>> B-1070 Brussels
>> Belgium
>>
>> Phone: +32-2-555 4220
>> Fax: +32-2-555 4655
>> Skype: vdetours
>> E-mail: vdetours at ulb.ac.be<http://ulb.ac.be>
>>
>>
>>
>
>
> [[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list