[BioC] Using precede()/follow() to find two ranges

Thu Aug 22 06:57:46 CEST 2013

On 08/21/2013 01:46 PM, Cook, Malcolm wrote:
> Well, there are many ways to skin this kind of cat (apologies to the internet).
>
> Indeed, there are many cats that are slight variations on this cat.
>
> re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow"
>
> I think that you should probably be handling this else-how.  If what you mean is that there should be some minimal distance between your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought when faced with the possibilities.
>
> That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I think you will find that you will want to compose something along these lines:
>
>
> 1)      use precede and follow each twice to generate your two candidate neighbors

Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but 
maybe with

     subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+")
     query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+")

The first range that query follows is

     f1 = follow(query, subj)

and the second  is

     f2 = follow(subj[f1], subj)

giving the indexes into subj as

     > f1
     [1] 3 4 5
     > f2
     [1] 2 3 4

I guess there are problems when there is no following range (so need to check 
for NA's before doing the second follow?

Another strategy might simply

     sort(subj)

and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]). 
Probably strand will confusing things!

Martin

>
> 2)      use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped
>
> Cheers,
>
> Malcolm
>
> From: d r [mailto:dolevrahat at gmail.com]
> Sent: Wednesday, August 21, 2013 2:57 PM
> To: Cook, Malcolm
> Cc: Steve Lianoglou; bioconductor at r-project.org list
> Subject: Re: [BioC] Using precede()/follow() to find two ranges
>
> Hi Malcolm
> Thanks for your thoughtfulness.
> For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k probes, which are also single nucleotide ranges.
> Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow.
> So, any insight you may be able to offer on this problem will be highly appreciated.
>
> However, I think that for my particular problem overlapping ranges should not be an issue.
>
> On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <MEC at stowers.org<mailto:MEC at stowers.org>> wrote:
> Dolev,
>
> Before chiming in on the problem as it is currently framed, I want to make sure of something.
>
> The documentation for precede and follow read 'Overlapping ranges are excluded.'.
>
> For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover this association.
>
> Is this really desirable for your application?
>
> -Malcolm
>
>
>   >-----Original Message-----
>   >From: bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org> [mailto:bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org>] On Behalf Of d r
>   >Sent: Wednesday, August 21, 2013 12:15 PM
>   >To: Steve Lianoglou
>   >Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org> list
>   >Subject: Re: [BioC] Using precede()/follow() to find two ranges
>   >
>   >Hi Steve and all
>   >
>   >Thanks for your suggestion. I was in fact thinking in this direction.
>   >However, by completly discarding the first set of hits before querying the
>   >second time I run the risk of missing ranges that may be potential second
>   >hits beacuse they were discarded after being a first hit for other ranges.
>   >For examle, if I have these two GRanges:
>   >
>   >gr1:
>   >
>   >GRanges with 2 ranges and 1 metadata column:
>   >
>   >       seqnames                 ranges strand |      names
>   >
>   >          <Rle>              <IRanges>  <Rle> |   <factor>
>   >
>   >   [1]     chr6 [125284212, 125284212]      * | cg00991794
>   >
>   >   [2]     chr6 [150465049, 150465049]      * | cg02250071
>   >
>   >
>   >
>   >
>   >
>   >gr2:
>   >
>   >GRanges with 4 ranges and 1 metadata column:
>   >
>   >       seqnames                 ranges strand |      gene
>   >
>   >          <Rle>              <IRanges>  <Rle> |  <factor>
>   >
>   >   [1]     chr6 [126284212, 124284212]      + |      PGM3
>   >
>   >   [2]     chr6 [160920998, 150920998]      + |   PLEKHG1
>   >
>   >   [3]     chr6 [ 83903012,  83903012]      + |      PGM3
>   >
>   >   [4]     chr6 [190102159, 170102159]       + |    WDR27
>   >
>   >The first hits will be gr2[1] and gr2[2], which will both be discarded.
>   >
>   >Now if I call precede() again the hits I will get will be gr2[3] and
>   >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for
>   >gr1[1].
>   >
>   >
>   >I was thinking that applying a function that will do what Steve
>   >suggested might do the trick if it can run on one range of gr1 at a
>   >time without modifying gr2
>   >
>   >something along the lines of:
>   >
>   >two_hits_apply<-function(gr1,gr2)
>   >{
>   >
>   >p1<-precede(gr1,gr2)
>   >
>   >gr2.less<-gr2[-p1]
>   >
>   >p2<-precede(gr1,gr2.less)
>   >
>   >hits<-c(p1,p2)
>   >
>   >hits
>   >
>   >}
>   >
>   >now all I need is a way to apply a functuon over two GRanegs object.
>   >If I get it right, I will need somehting like mapply(), but that can
>   >actuaaly work on GRanges.
>   >
>   >Is such a function exists, or alternativly, is there a way to do this
>   >with the convential apply functions that I miss?
>   >
>   >Many thanks in advance
>   >Dolev
>   >
>   >
>   >
>   >
>   >
>   >
>   >
>   >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou
>   ><lianoglou.steve at gene.com<mailto:lianoglou.steve at gene.com>>wrote:
>   >
>   >> Hi,
>   >>
>   >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat at gmail.com<mailto:dolevrahat at gmail.com>> wrote:
>   >> > Hello
>   >> >
>   >> > I am looking for a way to find the two preceiding/folliwng ranges in one
>   >> > GRanges object to each range in a second GRanges object.
>   >> >
>   >> > In other words, I am looking for a variation on precede() or follow()
>   >> that
>   >> > will return for each range in x two ranges in subject instead of one: the
>   >> > nearest preceidng(following) range and the second nearest preceding
>   >> > (following) range.
>   >> >
>   >> > Is there a way to do this? (sorry for not giving any more constructive
>   >> > suggestions, I know relativily  little on how GRanges works any therefore
>   >> > am at a loss as to how to proceed)
>   >>
>   >> A first draft way of doing that would be to simply use the results
>   >> from the first precede to remove those elements from the second call?
>   >>
>   >> For instance, if you have two ranges you are querying, gr1 and gr2:
>   >>
>   >> To get the immediately preceding ranges:
>   >>
>   >> R> p1 <- precede(gr1, gr2)
>   >>
>   >> Then to get the ones immediately after that:
>   >>
>   >> R> gr2.less <- gr2[-p1]
>   >> R> p2 <- precede(gr1, gr2.less)
>   >>
>   >> Then you can see who is who with `gr2[p1]` and the who-who is who with
>   >> `gr2.less[p2]` ... that should get you pretty close -- will likely
>   >> have to handle edge cases, for instance when there are no preceding
>   >> ranges, I think you get an NA (if I recall) so think about what you
>   >> want to do with those.
>   >>
>   >> -steve
>   >>
>   >> --
>   >> Steve Lianoglou
>   >> Computational Biologist
>   >> Bioinformatics and Computational Biology
>   >> Genentech
>   >>
>   >
>   >      [[alternative HTML version deleted]]
>   >
>   >_______________________________________________
>   >Bioconductor mailing list
>   >Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>   >https://stat.ethz.ch/mailman/listinfo/bioconductor
>   >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793