[R] speed up subsetting with certain conditions

Thu Jan 13 00:44:18 CET 2011

On 1/12/11 6:12 PM, Martin Morgan wrote:
> The Bioconductor project has many tools for dealing with 
> sequence-related data. With the data
>
> k <- read.table(textConnection(
> "chr1    3237546    3237547    rs52310428    0    +
> chr1    3237549    3237550    rs52097582    0    +
> chr2    4513326    4513327    rs29769280    0    +
> chr2    4513337    4513338    rs33286009    0    +"))
>
> f <- read.table(textConnection(
> "chr1    3213435    G    C
> chr1    3237547    T    C
> chr1    3237549    G    T
> chr2    4513326    A    G
> chr2    4513337    C    G"))
>
> One might use the GenomicRanges package as
>
> library(GenomicRanges)
> kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
> fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
> olaps <- findOverlaps(fgr, kgr)
> idx <- countOverlaps(fgr, kgr) != 0
>
> resulting in
>
> > idx
> [1] FALSE  TRUE  TRUE  TRUE  TRUE
>
> This will be fast.

Thanks so much for your suggestion Martin. I had Bioconductor installed 
but I honestly do not know all its applications. Anyway, I am testing 
GenomicRanges with my data now. I will report back when I get the result.

>
> One could write foundY with as.data.frame(fgr[idx]) (maybe a little 
> editing) but likely one would want to stay in R / Bioc and do 
> something more interesting...
>

I suppose foundN <- as.data.frame(fgr[!idx]) and foundY <- 
as.data.frame(fgr[idx]) as you suggested, but I dont really understand 
your last comment :).

Thanks,

D.