[BioC] bug in GenomicRanges gaps?
Blanchette, Marco
MAB at stowers.org
Fri Sep 6 03:07:34 CEST 2013
though... sorry about the genes.noStrand... was working on a different example before posting... bad pasting...
And yes, I was trying to find the gaps between genic regions regardless of there strand. I was doing the same operations as yours until I fumble onto gaps() which seemed to encapsulate nicely in one function the concept of finding gaps between ranges (right??) and if you look at the operation returned, it uses the seqinfo to compute the right and left gaps in relation to the entire chromosme. However, it returns extra ranges that conceptually are not gaps (ie span the full chromosome)...
But thanks for the tidbits Malcolm.
--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.
Kansas City, MO 64110
Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018
On Sep 5, 2013, at 7:53 PM, "Cook, Malcolm" <MEC at stowers.org>
wrote:
> Marco,
>
> genes.noStrand is unset when you first reference it, so it is hard to know what you are trying to do.
>
> Does your definition of intergenic want to ignore strand (i.e. can the same locus be genic on one strand and integenic on another?)
>
> I'm guessing the problem you are witnessing is that the strand _is_ not '*' when you thought you had set it to '*'. Check this!
>
> Also, does your definition of intergenic want to include non-genic segments that are the very beginning/end of a chromosome (and therefore only have a gene on one-side and thus arguably are not truly inter-genic). If so, your approach is not going to include these chromosome-terminal regions. In which case, you might try:
>
> genic.gr<-unlist(transcriptsBy(tr.db,'gene'),use.names=FALSE)
> strand(genic.gr)<-'*'
> genic.gr<-reduce(genic.gr)
> chrom.gr<-as(seqinfo(tr.db),'GRanges')
> intergenic.gr<-setdiff(chrom.gr,genic.gr)
>
> stopifnot(0==length(findOverlaps(intergenic.gr,ignoreSelf=TRUE)))
>
> ~ malcolm_cook at stowers.org
>
> ________________________________________
> From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Blanchette, Marco
> Sent: Thursday, September 05, 2013 7:14 PM
> To: BioConductor
> Subject: [BioC] bug in GenomicRanges gaps?
>
> Hi,
>
> Is this a bug of normal behavior... (I suspect it's a bug). I am trying to compute the intergenic regions using GenomicFeatures txdb and GenomicRanges gaps() and the GRanges return contains entry spanning all chromosomes on both strand... On a mock GRanges, I don't get this weird behavior (See below). Any suggestions?
>
> thanks
>
>> library(GenomicFeatures)
> ### Downloading the txdb from UCSC
>> txdb <- makeTranscriptDbFromUCSC(genome="dm3",
> tablename="ensGene")
>
> ### Getting the transcripts for each genes
>> allTx <- transcriptsBy(txdb, by='gene')
>
> ### Geting the GRanges for each genes
>> genes <- unlist(range(allTx))
>
> ### Removing the strand to perform the reduce operation
>> strand(genes) <- '*'
>
> ### Now, let's find the gaps
>> genic <- reduce(genes.noStrand)
>
> ### Because of teh reduce, there should be no overlaping genic regions
>> length(findOverlaps(genic,ignoreSelf=TRUE))
> [1] 0
>
> ### Finding the intergenic regions
>> intergenic <- gaps(genic)
>
> ### The difference of non-overlaping ranges should return non-overlaping gaps, but...
>> length(findOverlaps(intergenic,ignoreSelf=TRUE))
> [1] 46300
>
> ### What's funky is that I am getting gaps spanning the full lenght of each chromosomes
> ### Why???
>> chrs.width <- seqlengths(seqinfo(intergenic))
>> intergenic[width(intergenic) == gene2chrsLength]
> GRanges with 30 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] chr2L [1, 23011544] +
> [2] chr2L [1, 23011544] -
> [3] chr2R [1, 21146708] +
> [4] chr2R [1, 21146708] -
> [5] chr3L [1, 24543557] +
> ... ... ... ...
> [26] chrXHet [1, 204112] -
> [27] chrYHet [1, 347038] +
> [28] chrYHet [1, 347038] -
> [29] chrUextra [1, 29004656] +
> [30] chrUextra [1, 29004656] -
> ---
> seqlengths:
> chr2L chr2R chr3L chr3R ... chrXHet chrYHet chrUextra
> 23011544 21146708 24543557 27905053 ... 204112 347038 29004656
>
>
> ### On Fake data
>> genes <- GRanges(seqnames=c(rep("2L",3),rep("3L",3)),
> ranges=IRanges(start=rep(c(20,200,350),2),end=rep(c(80,275,450),2)))
>> length(findOverlaps(intergenic,ignoreSelf=TRUE))
> [1] 0
>> intergenic
> GRanges with 6 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] 2L [ 1, 19] *
> [2] 2L [ 81, 199] *
> [3] 2L [276, 349] *
> [4] 3L [ 1, 19] *
> [5] 3L [ 81, 199] *
> [6] 3L [276, 349] *
> ---
> seqlengths:
> 2L 3L
> NA NA
>
> --
> Marco Blanchette, Ph.D.
> Assistant Investigator
> Stowers Institute for Medical Research
> 1000 East 50th St.
>
> Kansas City, MO 64110
>
> Tel: 816-926-4071
> Cell: 816-726-8419
> Fax: 816-926-2018
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list