[BioC] Using precede()/follow() to find two ranges
Cook, Malcolm
MEC at stowers.org
Thu Aug 22 16:57:19 CEST 2013
Dolev,
If your requiremnts should change to include a maximum radius (say, 1000) around your query that you will consider subject hits, you might instead use and approach
ol.df<-within(as.data.frame(findOverlaps(resize(query,1000,use.names=FALSE),subj)),d<-distance(query[queryHits],subj[subjectHits]))
now, use your favorite R approach to filter ol.df based on your other criteria.
Just sayin,
Malcolm
>-----Original Message-----
>From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
>Sent: Thursday, August 22, 2013 9:25 AM
>To: 'd r'; 'Martin Morgan'
>Cc: 'Steve Lianoglou'; 'bioconductor at r-project.org list'
>Subject: Re: [BioC] Using precede()/follow() to find two ranges
>
>Dolev,
>
>If there are no overlaps within the ranges of your subjects then approaches involving precede and follow will of course not be
>effected by overlaps between ranges in your subjects, since, well, there aren't any.
>
>Still be careful with approaches suing precede/follow alone since any of your 1 nt wide query ranges which are themselves exactly
>present in your 1 nt-wide subject ranges won't be found using precede/follow.
>
>~Malcolm
>
>
>From: d r [mailto:dolevrahat at gmail.com]
>Sent: Thursday, August 22, 2013 2:40 AM
>To: Martin Morgan
>Cc: Cook, Malcolm; Steve Lianoglou; bioconductor at r-project.org list
>Subject: Re: [BioC] Using precede()/follow() to find two ranges
>
>Thanks Martin, this sound like a useful approach. I will try to implement it.
>Malcolm - I am not sure that I understood you're comments.
>I do not "want explicitly to exclude overlaps with my cgs". Rather, given that my particular problem invloves intersecting single
>nuclotide ranges,
> I do not think that the problem of overlaps is likely to arise. Am I missing something here?
>
>Thanks agian for all of your for being so helpful
>Dolev
>
>On Thu, Aug 22, 2013 at 7:57 AM, Martin Morgan <mtmorgan at fhcrc.org<mailto:mtmorgan at fhcrc.org>> wrote:
>On 08/21/2013 01:46 PM, Cook, Malcolm wrote:
>Well, there are many ways to skin this kind of cat (apologies to the internet).
>
>Indeed, there are many cats that are slight variations on this cat.
>
>re: "since I'm looking for regulatory elements I am not interested in this type of associations anyhow"
>
>I think that you should probably be handling this else-how. If what you mean is that there should be some minimal distance between
>your cgs and your genes, then, you should probably explicitly make this a requirement, and not add it as an afterthought when faced
>with the possibilities.
>
>That said, if you are wed to the exact to the exact results you specified, and you want explicitly to exclude overlaps with your cgs, I
>think you will find that you will want to compose something along these lines:
>
>
>1) use precede and follow each twice to generate your two candidate neighbors
>
>Not sure if this is what Malcolm meant, or if I'm being a little simplistic, but maybe with
>
> subj = GRanges("A", IRanges(seq(10, 50, 10), width=1), "+")
> query = GRanges("A", IRanges(seq(31, 51, 10), width=1), "+")
>
>The first range that query follows is
>
> f1 = follow(query, subj)
>
>and the second is
>
> f2 = follow(subj[f1], subj)
>
>giving the indexes into subj as
>
> > f1
> [1] 3 4 5
> > f2
> [1] 2 3 4
>
>I guess there are problems when there is no following range (so need to check for NA's before doing the second follow?
>
>Another strategy might simply
>
> sort(subj)
>
>and then f2 = f1 - 1 (except when seqnames(subj)[f1] != seqnames(subj)[f2]). Probably strand will confusing things!
>
>Martin
>
>2) use the range or these candidates and search using findOverlaps within them to see if there are others that were skipped
>
>Cheers,
>
>Malcolm
>
>From: d r [mailto:dolevrahat at gmail.com<mailto:dolevrahat at gmail.com>]
>Sent: Wednesday, August 21, 2013 2:57 PM
>To: Cook, Malcolm
>Cc: Steve Lianoglou; bioconductor at r-project.org<mailto:bioconductor at r-project.org> list
>Subject: Re: [BioC] Using precede()/follow() to find two ranges
>
>Hi Malcolm
>Thanks for your thoughtfulness.
>For the specific problem I'm working on I intend to consider only the TSSs of the genes, which I will compare against illumina 450k
>probes, which are also single nucleotide ranges.
>Therefore, I do not think that overlapping ranges should be an issue here. Of course, it is theoretically possible that a probe may
>overlap a TSS, but since I'm looking for regulatory elements I am not interested in this type of associations anyhow.
>So, any insight you may be able to offer on this problem will be highly appreciated.
>
>However, I think that for my particular problem overlapping ranges should not be an issue.
>On Wed, Aug 21, 2013 at 8:59 PM, Cook, Malcolm
><MEC at stowers.org<mailto:MEC at stowers.org><mailto:MEC at stowers.org<mailto:MEC at stowers.org>>> wrote:
>Dolev,
>
>Before chiming in on the problem as it is currently framed, I want to make sure of something.
>
>The documentation for precede and follow read 'Overlapping ranges are excluded.'.
>
>For instance if one of your cg ranges were to overlap one of your gene ranges, any method based on precede/follow will not discover
>this association.
>
>Is this really desirable for your application?
>
>-Malcolm
>
>
> >-----Original Message-----
> >From: bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org><mailto:bioconductor-bounces at r-
>project.org<mailto:bioconductor-bounces at r-project.org>> [mailto:bioconductor-bounces at r-project.org<mailto:bioconductor-
>bounces at r-project.org><mailto:bioconductor-bounces at r-project.org<mailto:bioconductor-bounces at r-project.org>>] On Behalf Of
>d r
> >Sent: Wednesday, August 21, 2013 12:15 PM
> >To: Steve Lianoglou
> >Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org><mailto:bioconductor at r-project.org<mailto:bioconductor at r-
>project.org>> list
> >Subject: Re: [BioC] Using precede()/follow() to find two ranges
> >
> >Hi Steve and all
> >
> >Thanks for your suggestion. I was in fact thinking in this direction.
> >However, by completly discarding the first set of hits before querying the
> >second time I run the risk of missing ranges that may be potential second
> >hits beacuse they were discarded after being a first hit for other ranges.
> >For examle, if I have these two GRanges:
> >
> >gr1:
> >
> >GRanges with 2 ranges and 1 metadata column:
> >
> > seqnames ranges strand | names
> >
> > <Rle> <IRanges> <Rle> | <factor>
> >
> > [1] chr6 [125284212, 125284212] * | cg00991794
> >
> > [2] chr6 [150465049, 150465049] * | cg02250071
> >
> >
> >
> >
> >
> >gr2:
> >
> >GRanges with 4 ranges and 1 metadata column:
> >
> > seqnames ranges strand | gene
> >
> > <Rle> <IRanges> <Rle> | <factor>
> >
> > [1] chr6 [126284212, 124284212] + | PGM3
> >
> > [2] chr6 [160920998, 150920998] + | PLEKHG1
> >
> > [3] chr6 [ 83903012, 83903012] + | PGM3
> >
> > [4] chr6 [190102159, 170102159] + | WDR27
> >
> >The first hits will be gr2[1] and gr2[2], which will both be discarded.
> >
> >Now if I call precede() again the hits I will get will be gr2[3] and
> >gr[4], instead of getting again gr2[1] for gr1[2] and gr2[2] for
> >gr1[1].
> >
> >
> >I was thinking that applying a function that will do what Steve
> >suggested might do the trick if it can run on one range of gr1 at a
> >time without modifying gr2
> >
> >something along the lines of:
> >
> >two_hits_apply<-function(gr1,gr2)
> >{
> >
> >p1<-precede(gr1,gr2)
> >
> >gr2.less<-gr2[-p1]
> >
> >p2<-precede(gr1,gr2.less)
> >
> >hits<-c(p1,p2)
> >
> >hits
> >
> >}
> >
> >now all I need is a way to apply a functuon over two GRanegs object.
> >If I get it right, I will need somehting like mapply(), but that can
> >actuaaly work on GRanges.
> >
> >Is such a function exists, or alternativly, is there a way to do this
> >with the convential apply functions that I miss?
> >
> >Many thanks in advance
> >Dolev
> >
> >
> >
> >
> >
> >
> >
> >On Wed, Aug 21, 2013 at 7:10 PM, Steve Lianoglou
>
>><lianoglou.steve at gene.com<mailto:lianoglou.steve at gene.com><mailto:lianoglou.steve at gene.com<mailto:lianoglou.steve at gene.
>com>>>wrote:
> >
> >> Hi,
>
> >>
> >> On Wed, Aug 21, 2013 at 7:25 AM, d r
><dolevrahat at gmail.com<mailto:dolevrahat at gmail.com><mailto:dolevrahat at gmail.com<mailto:dolevrahat at gmail.com>>> wrote:
> >> > Hello
> >> >
> >> > I am looking for a way to find the two preceiding/folliwng ranges in one
> >> > GRanges object to each range in a second GRanges object.
> >> >
> >> > In other words, I am looking for a variation on precede() or follow()
> >> that
> >> > will return for each range in x two ranges in subject instead of one: the
> >> > nearest preceidng(following) range and the second nearest preceding
> >> > (following) range.
> >> >
> >> > Is there a way to do this? (sorry for not giving any more constructive
> >> > suggestions, I know relativily little on how GRanges works any therefore
> >> > am at a loss as to how to proceed)
> >>
> >> A first draft way of doing that would be to simply use the results
> >> from the first precede to remove those elements from the second call?
> >>
> >> For instance, if you have two ranges you are querying, gr1 and gr2:
> >>
> >> To get the immediately preceding ranges:
> >>
> >> R> p1 <- precede(gr1, gr2)
> >>
> >> Then to get the ones immediately after that:
> >>
> >> R> gr2.less <- gr2[-p1]
> >> R> p2 <- precede(gr1, gr2.less)
> >>
> >> Then you can see who is who with `gr2[p1]` and the who-who is who with
> >> `gr2.less[p2]` ... that should get you pretty close -- will likely
> >> have to handle edge cases, for instance when there are no preceding
> >> ranges, I think you get an NA (if I recall) so think about what you
> >> want to do with those.
> >>
> >> -steve
> >>
> >> --
> >> Steve Lianoglou
> >> Computational Biologist
> >> Bioinformatics and Computational Biology
> >> Genentech
> >>
> >
> > [[alternative HTML version deleted]]
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at r-project.org<mailto:Bioconductor at r-project.org><mailto:Bioconductor at r-project.org<mailto:Bioconductor at r-
>project.org>>
>
> >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>--
>Computational Biology / Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N.
>PO Box 19024 Seattle, WA 98109
>
>Location: Arnold Building M1 B861
>Phone: (206) 667-2793<tel:%28206%29%20667-2793>
>
>
> [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list