[Bioc-sig-seq] countOverlaps throws error on GRangesList and GappedAlignments
Hervé Pagès
hpages at fhcrc.org
Mon Apr 11 21:42:31 CEST 2011
Hi Michael,
On 11-04-11 09:31 AM, Michael Lawrence wrote:
> I've noticed some functions just issuing a warning when the SeqInfo objects
> have a different set of chromosomes. In the OP's code, this is issuing an
> error. Has this changed in devel? Is everything now a warning? I would
> recommend consistency here, and a warning seems best, stating the
> assumptions that were made.
The way this is handled has been completely revisited in BioC 2.8.
Now checking whether 2 objects with sequence information (like GRanges,
GRangesList, GappedAlignments, BSgenome) are compatible or not is the
responsibility of the "merge" method for Seqinfo objects (you can use
the seqinfo() getter to extract the Seqinfo slot from a GRanges,
GRangesList, GappedAlignments or BSgenome object).
Here first I show how merging 2 Seqinfo objects works (the end user
never does this directly but this is what happens behind the scene
when s/he tries to compare or combine 2 objects that have a Seqinfo
slot):
library(GenomicRanges)
x <- Seqinfo(seqnames=c("chr1", "chr2", "chr3", "chrM"),
seqlengths=c(100, 200, NA, 15),
isCircular=c(NA, FALSE, FALSE, TRUE))
y <- Seqinfo(seqnames=c("chr3", "chr4", "chrM"),
seqlengths=c(300, NA, 15))
Then:
> x
Seqinfo of length 4
seqnames seqlengths isCircular
chr1 100 NA
chr2 200 FALSE
chr3 NA FALSE
chrM 15 TRUE
> y
Seqinfo of length 3
seqnames seqlengths isCircular
chr3 300 NA
chr4 NA NA
chrM 15 NA
> merge(x, y)
Seqinfo of length 5
seqnames seqlengths isCircular
chr1 100 NA
chr2 200 FALSE
chr3 300 FALSE
chrM 15 TRUE
chr4 NA NA
Warning message:
In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
- in 'x': chr1, chr2
- in 'y': chr4
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
Now if I modify 'y' in a way that *contradicts* what 'x' says about
circularity of chr3 and chrM:
> isCircular(y)[c("chr3", "chrM")] <- c(TRUE, FALSE)
> y
Seqinfo of length 3
seqnames seqlengths isCircular
chr3 300 TRUE
chr4 NA NA
chrM 15 FALSE
Then I get an error:
> merge(x, y)
Error in mergeNamedAtomicVectors(isCircular(x), isCircular(y), what =
c("sequence", :
sequences chr3, chrM have incompatible circularity flags:
- in 'x': FALSE, TRUE
- in 'y': TRUE, FALSE
In addition: Warning message:
In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
- in 'x': chr1, chr2
- in 'y': chr4
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
This is because merge() has no way to reconcile the information stored
in 'x' and 'y'.
Now from an end user perspective, findOverlaps(), countOverlaps(), c(),
union(), intersect(), punion(), pintersect(), etc... on 2 GRanges
objects behave consistently with the above (they just try to merge
seqinfo(gr1) and seqinfo(gr2) before they do anything else):
gr1 <- GRanges(Rle(factor(levels=seqlevels(x))))
seqinfo(gr1) <- x
gr2 <- GRanges(Rle(factor(levels=seqlevels(y))))
seqinfo(gr2) <- y
Then:
> gr1
GRanges with 0 ranges and 0 elementMetadata values
seqnames ranges strand |
seqlengths
chr1 chr2 chr3 chrM
100 200 NA 15
> gr2
GRanges with 0 ranges and 0 elementMetadata values
seqnames ranges strand |
seqlengths
chr3 chr4 chrM
300 NA 15
> countOverlaps(gr1, gr2)
Warning message:
In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
- in 'x': chr1, chr2
- in 'y': chr4
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
Error in queryHits(findOverlaps(query, subject, maxgap = maxgap,
minoverlap = minoverlap, :
error in evaluating the argument 'x' in selecting a method for
function 'queryHits': Error in mergeNamedAtomicVectors(isCircular(x),
isCircular(y), what = c("sequence", :
sequences chr3, chrM have incompatible circularity flags:
- in 'x': FALSE, TRUE
- in 'y': TRUE, FALSE
Hope this clarifies.
H.
> sessionInfo()
R version 2.13.0 RC (2011-04-05 r55310)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicRanges_1.3.38 IRanges_1.9.31
>
> Thanks,
> Michael
>
> On Mon, Apr 11, 2011 at 6:53 AM, Valerie Obenchain<vobencha at fhcrc.org>wrote:
>
>> Hi Lax,
>>
>>
>> On 04/10/11 08:13, Lakshmanan Iyer wrote:
>>
>>> Hello,
>>> I created a GRangesList of 3'UTRs and GappedAlignments of a BAM file using
>>> the following code:
>>> _____________________
>>> library(GenomicRanges);
>>> library(Rsamtools);
>>> library(leeBamViews)
>>> library ("GenomicFeatures")
>>> txdb<- makeTranscriptDbFromUCSC(genome="mm9", tablename="knownGene")
>>> #
>>> # It remembers the transcript ids. so can be used later
>>> p3UTRs<- threeUTRsByTranscript(txdb, use.names=TRUE)
>>> # read the bam alignment
>>> aligns<- readGappedAlignments ("Dend_accepted_hits_sorted.bam")
>>>
>>> --------------------------------------
>>> However, when I try to get the countOverlaps, I get the following error:
>>>
>>>
>>>> c<- countOverlaps (p3UTRs, aligns)
>>>>
>>>>
>>> Error in queryHits(findOverlaps(query, subject, maxgap = maxgap, type =
>>> type)) :
>>> error in evaluating the argument 'x' in selecting a method for function
>>> 'queryHits'
>>>
>>> I suspect that this happens as the sequence names, "chrUn_random
>>> chrX_random
>>> chrY_random" etc which are present in the GRangeslist, p3UTRs are not
>>> present in the GappedAlignement objects.
>>>
>>> My questions is:
>>> 1. Can I make countOverlaps to choose only those intervals in which there
>>> are entries in both the objects?
>>>
>>>
>> No. The general approach is to clean the subject and query before calling
>> countOverlaps.
>>
>> 2. How do I subjset GRangeList (p3UTRs) to include entries from specific
>>> list of chromosomes of interest only?
>>>
>>>
>> To retain specific chromosomes, you can subset with seqnames then reset the
>> seqlevels.
>>
>> In R-2.13/BioC 2.8, using the grl object from the GRangesList man page :
>>
>>> grl
>> GRangesList of length 3
>> $gr1
>> GRanges with 1 range and 2 elementMetadata values
>> seqnames ranges strand | score GC
>> <Rle> <IRanges> <Rle> |<integer> <numeric>
>> [1] chr2 [3, 6] + | 5 0.45
>>
>> $gr2
>> GRanges with 2 ranges and 2 elementMetadata values
>> seqnames ranges strand | score GC
>> <Rle> <IRanges> <Rle> |<integer> <numeric>
>> [1] chr1 [ 7, 9] + | 3 0.3
>> [2] chr1 [13, 15] - | 4 0.5
>>
>> $gr3
>> GRanges with 2 ranges and 2 elementMetadata values
>> seqnames ranges strand | score GC
>> <Rle> <IRanges> <Rle> |<integer> <numeric>
>> [1] chr1 [1, 3] - | 6 0.4
>> [2] chr2 [4, 9] - | 2 0.1
>>
>>
>> seqlengths
>> chr2 chr1
>> NA NA
>>
>> grlsub<- grl[seqnames(grl) == "chr1", ]
>> seqlevels(grlsub)<- seqlevels(grlsub)[seqlevels(grlsub) == "chr1"]
>>
>> You can remove the empty list elements with,
>>
>> grlsub<- grlsub[elementLengths(grlsub)>0]
>>
>>
>>> grlsub
>> GRangesList of length 2
>> $gr2
>> GRanges with 2 ranges and 2 elementMetadata values
>> seqnames ranges strand | score GC
>> <Rle> <IRanges> <Rle> |<integer> <numeric>
>> [1] chr1 [ 7, 9] + | 3 0.3
>> [2] chr1 [13, 15] - | 4 0.5
>>
>> $gr3
>> GRanges with 1 range and 2 elementMetadata values
>> seqnames ranges strand | score GC
>> <Rle> <IRanges> <Rle> |<integer> <numeric>
>> [1] chr1 [1, 3] - | 6 0.4
>>
>>
>> seqlengths
>> chr1
>> NA
>>
>>
>> Valerie
>>
>>
>>> sessionInfo()
>> R version 2.13.0 beta (2011-03-30 r55199)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] GenomicRanges_1.3.37 IRanges_1.9.29
>>
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.13.0
>>
>>
>>
>>
>>
>> -best
>>> -Lax
>>>
>>>
>>>> sessionInfo()
>>>>
>>>>
>>> R version 2.12.0 (2010-10-15)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] GenomicFeatures_1.2.1 leeBamViews_0.99.11 BSgenome_1.18.2
>>> [4] Biobase_2.10.0 Rsamtools_1.2.0 Biostrings_2.18.0
>>> [7] GenomicRanges_1.2.3 IRanges_1.8.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] biomaRt_2.6.0 DBI_0.2-5 RCurl_1.4-3
>>> RSQLite_0.9-2
>>> [5] rtracklayer_1.10.5 tools_2.12.0 XML_3.2-0
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-sig-sequencing
mailing list