[BioC] Using precede()/follow() to find two ranges
Martin Morgan
mtmorgan at fhcrc.org
Thu Aug 22 06:57:46 CEST 2013
On 08/21/2013 01:46 PM, Cook, Malcolm wrote:
> Well, there are many ways to skin this kind of cat (apologies to the internet).
>
> Indeed, there are many cats that are slight variations on this cat.
>
> re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow"
>
> I think that you should probably be handling this else-how. If what you mean is that there should be some minimal distance between your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought when faced with the possibilities.
>
> That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I think you will find that you will want to compose something along these lines:
>
>
> 1) use precede and follow each twice to generate your two candidate neighbors
Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but
maybe with
subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+")
query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+")
The first range that query follows is
f1 = follow(query, subj)
and the second is
f2 = follow(subj[f1], subj)
giving the indexes into subj as
> f1
[1] 3 4 5
> f2
[1] 2 3 4
I guess there are problems when there is no following range (so need to check
for NA's before doing the second follow?
Another strategy might simply
sort(subj)
and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]).
Probably strand will confusing things!
Martin
>
> 2) use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped
>
> Cheers,
>
> Malcolm
>
> From: d r [mailto:dolevrahat at gmail.com]
> Sent: Wednesday, August 21, 2013 2:57 PM
> To: Cook, Malcolm
> Cc: Steve Lianoglou; bioconductor at r-project.org list
> Subject: Re: [BioC] Using precede()/follow() to find two ranges
>
> Hi Malcolm
> Thanks for your thoughtfulness.
> For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k probes, which are also single nucleotide ranges.
> Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow.
> So, any insight you may be able to offer on this problem will be highly appreciated.
>
> However, I think that for my particular problem overlapping ranges should not be an issue.
>
> On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm <MEC at stowers.org<mailto:MEC at stowers.org>> wrote:
> Dolev,
>
> Before chiming in on the problem as it is currently framed, I want to make sure of something.
>
> The documentation for precede and follow read 'Overlapping ranges are excluded.'.
>
> For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover this association.
>
> Is this really desirable for your application?
>
> -Malcolm
>
>
> >-----Original Message-----
> >From: bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org> [mailto:bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org>] On Behalf Of d r
> >Sent: Wednesday, August 21, 2013 12:15 PM
> >To: Steve Lianoglou
> >Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org> list
> >Subject: Re: [BioC] Using precede()/follow() to find two ranges
> >
> >Hi Steve and all
> >
> >Thanks for your suggestion. I was in fact thinking in this direction.
> >However, by completly discarding the first set of hits before querying the
> >second time I run the risk of missing ranges that may be potential second
> >hits beacuse they were discarded after being a first hit for other ranges.
> >For examle, if I have these two GRanges:
> >
> >gr1:
> >
> >GRanges with 2 ranges and 1 metadata column:
> >
> > seqnames ranges strand | names
> >
> > <Rle> <IRanges> <Rle> | <factor>
> >
> > [1] chr6 [125284212, 125284212] * | cg00991794
> >
> > [2] chr6 [150465049, 150465049] * | cg02250071
> >
> >
> >
> >
> >
> >gr2:
> >
> >GRanges with 4 ranges and 1 metadata column:
> >
> > seqnames ranges strand | gene
> >
> > <Rle> <IRanges> <Rle> | <factor>
> >
> > [1] chr6 [126284212, 124284212] + | PGM3
> >
> > [2] chr6 [160920998, 150920998] + | PLEKHG1
> >
> > [3] chr6 [ 83903012, 83903012] + | PGM3
> >
> > [4] chr6 [190102159, 170102159] + | WDR27
> >
> >The first hits will be gr2[1] and gr2[2], which will both be discarded.
> >
> >Now if I call precede() again the hits I will get will be gr2[3] and
> >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for
> >gr1[1].
> >
> >
> >I was thinking that applying a function that will do what Steve
> >suggested might do the trick if it can run on one range of gr1 at a
> >time without modifying gr2
> >
> >something along the lines of:
> >
> >two_hits_apply<-function(gr1,gr2)
> >{
> >
> >p1<-precede(gr1,gr2)
> >
> >gr2.less<-gr2[-p1]
> >
> >p2<-precede(gr1,gr2.less)
> >
> >hits<-c(p1,p2)
> >
> >hits
> >
> >}
> >
> >now all I need is a way to apply a functuon over two GRanegs object.
> >If I get it right, I will need somehting like mapply(), but that can
> >actuaaly work on GRanges.
> >
> >Is such a function exists, or alternativly, is there a way to do this
> >with the convential apply functions that I miss?
> >
> >Many thanks in advance
> >Dolev
> >
> >
> >
> >
> >
> >
> >
> >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou
> ><lianoglou.steve at gene.com<mailto:lianoglou.steve at gene.com>>wrote:
> >
> >> Hi,
> >>
> >> On Wed, Aug 21, 2013 at 7:25 AM, d r <dolevrahat at gmail.com<mailto:dolevrahat at gmail.com>> wrote:
> >> > Hello
> >> >
> >> > I am looking for a way to find the two preceiding/folliwng ranges in one
> >> > GRanges object to each range in a second GRanges object.
> >> >
> >> > In other words, I am looking for a variation on precede() or follow()
> >> that
> >> > will return for each range in x two ranges in subject instead of one: the
> >> > nearest preceidng(following) range and the second nearest preceding
> >> > (following) range.
> >> >
> >> > Is there a way to do this? (sorry for not giving any more constructive
> >> > suggestions, I know relativily little on how GRanges works any therefore
> >> > am at a loss as to how to proceed)
> >>
> >> A first draft way of doing that would be to simply use the results
> >> from the first precede to remove those elements from the second call?
> >>
> >> For instance, if you have two ranges you are querying, gr1 and gr2:
> >>
> >> To get the immediately preceding ranges:
> >>
> >> R> p1 <- precede(gr1, gr2)
> >>
> >> Then to get the ones immediately after that:
> >>
> >> R> gr2.less <- gr2[-p1]
> >> R> p2 <- precede(gr1, gr2.less)
> >>
> >> Then you can see who is who with `gr2[p1]` and the who-who is who with
> >> `gr2.less[p2]` ... that should get you pretty close -- will likely
> >> have to handle edge cases, for instance when there are no preceding
> >> ranges, I think you get an NA (if I recall) so think about what you
> >> want to do with those.
> >>
> >> -steve
> >>
> >> --
> >> Steve Lianoglou
> >> Computational Biologist
> >> Bioinformatics and Computational Biology
> >> Genentech
> >>
> >
> > [[alternative HTML version deleted]]
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list