[R] range and intersection

Charles C. Berry cberry at tajo.ucsd.edu
Sun Mar 14 07:45:09 CET 2010


On Sat, 13 Mar 2010, Adrian Johnson wrote:

> Hi:
>
> I have a two large files (over 300K lines).
>
> file 1:
>
> Name    X
> UK       199
> UK       230
> UK       139
> ......
> UAE    194
> UAE     94
>
>
>
>
> File 2:
>
> Name   X    Y
> UK    140   180
> UK    195    240
> UK    304    340
> ....
>
>
> I want to select X of File 1 and search if it falls in range of X and
> Y of File 2 and Print only those lines of File 1 that are in range of
> File 2 X and Y

Probably, I'd use findOverlaps() in the IRanges BioConductor package.

If you want to do the UK search apart from the UAE search and so on, the 
use of RangeData objects provided by IRanges is nice, clean way to go.

Something like:

library(IRanges)

file1 <- read.table("File1", header=TRUE)
file2 <- read.table("File2", header=TRUE)

file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
 			space = Name )

find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) )

new.rl <- cbind( file1.rl[ find.1.in.2[,1], ],
 			file2X = start(file2.rl)[ find.1.in.2[,2] ],
 			file2Y = end(file2.rl)[ find.1.in.2[,2] ])

find.1.in.2 will be a matrix with one row for every match. The first 
column will be the index of the row in file1.rl and the second that of 
file2.rl.

new.rl will have on row for each match.

The order of the rows in the RangedData objects may not match the original 
data frames, so beware.

For 300K rows, this would run pretty fast, I think.

(caveat: This is all untested code.)

Otherwise, without the IRanges package something like


gt.x <- findInterval( file1$X, file2$X )
gt.y <- findInterval( file1$X, file2$Y )

is.in.interval <- gt.x == gt.y + 1

will work iff the intervals defined in file2 do not overlap one another.

If you need to keep 'Name's  separate, rolling this into mapply() would be 
needed.

HTH,

Chuck

>
>
> How can it be done it in R.
>
> thanks
> Adrian
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list