[R] speed up subsetting with certain conditions
Duke
duke.lists at gmx.com
Thu Jan 13 00:44:18 CET 2011
On 1/12/11 6:12 PM, Martin Morgan wrote:
> The Bioconductor project has many tools for dealing with
> sequence-related data. With the data
>
> k <- read.table(textConnection(
> "chr1 3237546 3237547 rs52310428 0 +
> chr1 3237549 3237550 rs52097582 0 +
> chr2 4513326 4513327 rs29769280 0 +
> chr2 4513337 4513338 rs33286009 0 +"))
>
> f <- read.table(textConnection(
> "chr1 3213435 G C
> chr1 3237547 T C
> chr1 3237549 G T
> chr2 4513326 A G
> chr2 4513337 C G"))
>
> One might use the GenomicRanges package as
>
> library(GenomicRanges)
> kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
> fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
> olaps <- findOverlaps(fgr, kgr)
> idx <- countOverlaps(fgr, kgr) != 0
>
> resulting in
>
> > idx
> [1] FALSE TRUE TRUE TRUE TRUE
>
> This will be fast.
Thanks so much for your suggestion Martin. I had Bioconductor installed
but I honestly do not know all its applications. Anyway, I am testing
GenomicRanges with my data now. I will report back when I get the result.
>
> One could write foundY with as.data.frame(fgr[idx]) (maybe a little
> editing) but likely one would want to stay in R / Bioc and do
> something more interesting...
>
I suppose foundN <- as.data.frame(fgr[!idx]) and foundY <-
as.data.frame(fgr[idx]) as you suggested, but I dont really understand
your last comment :).
Thanks,
D.
More information about the R-help
mailing list