[R] range and intersection
Charles C. Berry
cberry at tajo.ucsd.edu
Sun Mar 14 07:45:09 CET 2010
On Sat, 13 Mar 2010, Adrian Johnson wrote:
> Hi:
>
> I have a two large files (over 300K lines).
>
> file 1:
>
> Name X
> UK 199
> UK 230
> UK 139
> ......
> UAE 194
> UAE 94
>
>
>
>
> File 2:
>
> Name X Y
> UK 140 180
> UK 195 240
> UK 304 340
> ....
>
>
> I want to select X of File 1 and search if it falls in range of X and
> Y of File 2 and Print only those lines of File 1 that are in range of
> File 2 X and Y
Probably, I'd use findOverlaps() in the IRanges BioConductor package.
If you want to do the UK search apart from the UAE search and so on, the
use of RangeData objects provided by IRanges is nice, clean way to go.
Something like:
library(IRanges)
file1 <- read.table("File1", header=TRUE)
file2 <- read.table("File2", header=TRUE)
file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
space = Name )
find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) )
new.rl <- cbind( file1.rl[ find.1.in.2[,1], ],
file2X = start(file2.rl)[ find.1.in.2[,2] ],
file2Y = end(file2.rl)[ find.1.in.2[,2] ])
find.1.in.2 will be a matrix with one row for every match. The first
column will be the index of the row in file1.rl and the second that of
file2.rl.
new.rl will have on row for each match.
The order of the rows in the RangedData objects may not match the original
data frames, so beware.
For 300K rows, this would run pretty fast, I think.
(caveat: This is all untested code.)
Otherwise, without the IRanges package something like
gt.x <- findInterval( file1$X, file2$X )
gt.y <- findInterval( file1$X, file2$Y )
is.in.interval <- gt.x == gt.y + 1
will work iff the intervals defined in file2 do not overlap one another.
If you need to keep 'Name's separate, rolling this into mapply() would be
needed.
HTH,
Chuck
>
>
> How can it be done it in R.
>
> thanks
> Adrian
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list