[BioC] function precede() not working with GRanges

Wed Jan 12 03:35:33 CET 2011

Hi Michael,

On 01/05/2011 11:30 AM, Michael Lawrence wrote:
> On Wed, Jan 5, 2011 at 10:44 AM, Martin Morgan<mtmorgan at fhcrc.org>  wrote:
>
>> On 01/04/2011 01:15 PM, Michael Lawrence wrote:
>>>
>>>
>>> On Tue, Jan 4, 2011 at 11:52 AM, Martin Morgan<mtmorgan at fhcrc.org
>>> <mailto:mtmorgan at fhcrc.org>>  wrote:
>>>
>>>      On 01/04/2011 11:34 AM, Jeremiah Degenhardt wrote:
>>>      >  Hello Bioconductor,
>>>      >
>>>      >  I am trying to use the function precede() with two GRanges objects
>>>      to find
>>>      >  the closest element from gr1 to an element in gr2. precede() shows
>>>      that it
>>>      >  should work with a GRanges object, however, when I try to use it
>>>      it returns
>>>      >  only a vector of "NA".
>>>      >
>>>      >  Pasted below is a simplified bit of code to reproduce the error and
>> my
>>>      >  session info.
>>>      >
>>>      >  Additionally, it would be even nicer if the function nearest() were
>>>      >  implemented for a GRanges object.
>>>      >
>>>      >  Thanks in advance for the help
>>>
>>>      Hi Jeremiah --
>>>
>>>      Provide strand information for at least one of the reads, e.g.,
>>>
>>>      >  gr1<- GRanges("chr1", IRanges(1,10), "-")
>>>      >  gr2<- GRanges("chr1", IRanges(20,30))
>>>      >  precede(gr2, gr1)
>>>      [1] 1
>>>
>>>      or do the comparison on IRanges with
>>>
>>>      >  precede(ranges(gr1), ranges(gr2))
>>>      [1] 1
>>>
>>>
>>> This does not work in general, because it discards the chromosome.
>>> GRanges is often treated as a flat RangesList/RangedData, with the
>>> strand set to "*".
>>>
>>> Jeremiah and I think it would be reasonable to treat a '*' range as a
>>> '+' range when comparing to another '*' range. That is, assume strand is
>>> unknown/irrelevant and just go by the starts and ends as if it were a
>>> Ranges object, except considering the sequence.
>>
>> Was wondering about use case details, e.g., are all query&  subject '*'?
>>
>
> Well it turns out that in our use case, they weren't all supposed to be "*",
> but there are many cases where one does not have strand information. For
> example, one might have two sets of peaks, for different TF's, and want to
> find those where one precedes the other (and then see if they are upstream
> of a positive strand gene, then do the same with follow and negative strand
> genes). As was mentioned, one could coerce the strand to "+"/"-" in that
> case to achieve the desired effect, but it's not clear why that is necessary
> and whether it even makes sense (peaks do not have a direction).

How general is the "peaks do not have a direction" statement?
In my experience (RNA-seq experiments), peaks are "stranded features"
i.e. they belong to a particular strand.

>
> One thing that often confuses me with GenomicRanges is the dual meaning of
> strand. It seems to indicate both a direction and a coordinate. It makes
> sense to me to have resize() and precede() consider strand as a direction.
> But findOverlaps() treats strands as coordinates such that items on
> different strands do not overlap. This makes less sense.

Why? IMO a positive read (i.e. a read that was aligned to the plus
strand) shouldn't hit features (genes/transcripts/exons) that are
located on the minus strand. If for whatever reason this is not what
the user wants, then s/he can always set the strand of the query and
subject to '*' but I wouldn't say this is the typical usecase.

> Another good
> example is gaps() which will yield sequence-long ranges on the strands not
> represented in the original object. This is very often undesirable. But of
> course my perspective is limited.

I agree that the behaviour of gaps() on GRanges is awkward. You not only
get those sequence-long ranges on the strands not represented in the
original object but you also get them for sequences not represented at
all (as long as they are listed in levels(seqnames(x))). But if gaps()
wasn't doing this, gaps() wouldn't be reversible anymore (i.e.
'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind of an
important property for gaps(). In particular, gaps(x) would give
the same result whether 'x' as a sequence-long range on one strand
or not (very unlikely to happen with real data though).

Maybe an extra arg to gaps() would help control this? Suggestions are
welcome. I'll make the change. Thanks!

H.

>
> Michael
>
> Martin
>>
>>> Obviously, one could often just explicitly coerce the strand to '+', but
>>> the user is not going to expect that such a coercion is required.
>>> Especially people like us who are too lazy to figure out how to find the
>>> documentation on an S4 method.
>>>
>>> Michael
>>>
>>>      nearest will be implemented.
>>>
>>>      Martin
>>>
>>>      >  Jeremiah
>>>      >
>>>      >>  gr1<- GRanges("chr1", IRanges(1,10))
>>>      >>  gr2<- GRanges("chr1", IRanges(20,30))
>>>      >>  precede(gr2, gr1)
>>>      >  [1] NA
>>>      >>
>>>      >
>>>      >>  sessionInfo()
>>>      >  R version 2.13.0 Under development (unstable) (2010-12-07 r53809)
>>>      >  Platform: x86_64-unknown-linux-gnu (64-bit)
>>>      >
>>>      >  locale:
>>>      >   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>      >   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>      >   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>      >   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>      >   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>      >  [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>      >
>>>      >  attached base packages:
>>>      >  [1] grid      stats     graphics  grDevices utils     datasets
>>>       methods
>>>      >  [8] base
>>>      >
>>>      >  other attached packages:
>>>      >   [1] motifRG_0.0.2       rtracklayer_1.11.8  RCurl_1.5-0
>>>      >   [4] bitops_1.0-4.1      seqLogo_1.17.0      chipseq_1.1.2
>>>      >   [7] ShortRead_1.9.8     Rsamtools_1.3.11    lattice_0.18-3
>>>      >  [10] BSgenome_1.19.2     Biostrings_2.19.2   GenomicRanges_1.3.5
>>>      >  [13] IRanges_1.9.16      multicore_0.1-3
>>>      >
>>>      >  loaded via a namespace (and not attached):
>>>      >  [1] Biobase_2.11.6 hwriter_1.3    tools_2.13.0   XML_3.2-0
>>>      >
>>>      >        [[alternative HTML version deleted]]
>>>      >
>>>      >  _______________________________________________
>>>      >  Bioconductor mailing list
>>>      >  Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>      >  https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>      >  Search the archives:
>>>      http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>      --
>>>      Computational Biology
>>>      Fred Hutchinson Cancer Research Center
>>>      1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>
>>>      Location: M1-B861
>>>      Telephone: 206 667-2793
>>>
>>>      _______________________________________________
>>>      Bioconductor mailing list
>>>      Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>      https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>      Search the archives:
>>>      http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319