[R] range and intersection

Charles C. Berry cberry at tajo.ucsd.edu
Sun Mar 14 07:56:04 CET 2010


Typo corrected below.

On Sat, 13 Mar 2010, Charles C. Berry wrote:

> On Sat, 13 Mar 2010, Adrian Johnson wrote:
>
>>  Hi:
>>
>>  I have a two large files (over 300K lines).
>>
>>  file 1:
>>
>>  Name    X
>>  UK       199
>>  UK       230
>>  UK       139
>>  ......
>>  UAE    194
>>  UAE     94
>> 
>> 
>> 
>>
>>  File 2:
>>
>>  Name   X    Y
>>  UK    140   180
>>  UK    195    240
>>  UK    304    340
>>  ....
>> 
>>
>>  I want to select X of File 1 and search if it falls in range of X and
>>  Y of File 2 and Print only those lines of File 1 that are in range of
>>  File 2 X and Y
>
> Probably, I'd use findOverlaps() in the IRanges BioConductor package.
>
> If you want to do the UK search apart from the UAE search and so on, the use 
> of RangeData objects provided by IRanges is nice, clean way to go.
>
> Something like:
>
> library(IRanges)
>
> file1 <- read.table("File1", header=TRUE)
> file2 <- read.table("File2", header=TRUE)
>
> file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name )
> file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
> 			 space = Name )
>

Correct the above to:

file1.rl <- RangedData( IRanges(start=file1$X, width=1),
 			space = file1$Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
  			 space = file2$Name )

Chuck


> find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) )
>
> new.rl <- cbind( file1.rl[ find.1.in.2[,1], ],
> 			 file2X = start(file2.rl)[ find.1.in.2[,2] ],
> 			 file2Y = end(file2.rl)[ find.1.in.2[,2] ])
>
> find.1.in.2 will be a matrix with one row for every match. The first column 
> will be the index of the row in file1.rl and the second that of file2.rl.
>
> new.rl will have on row for each match.
>
> The order of the rows in the RangedData objects may not match the original 
> data frames, so beware.
>
> For 300K rows, this would run pretty fast, I think.
>
> (caveat: This is all untested code.)
>
> Otherwise, without the IRanges package something like
>
>
> gt.x <- findInterval( file1$X, file2$X )
> gt.y <- findInterval( file1$X, file2$Y )
>
> is.in.interval <- gt.x == gt.y + 1
>
> will work iff the intervals defined in file2 do not overlap one another.
>
> If you need to keep 'Name's  separate, rolling this into mapply() would be 
> needed.
>
> HTH,
>
> Chuck
>
>> 
>>
>>  How can it be done it in R.
>>
>>  thanks
>>  Adrian
>>
>>  ______________________________________________
>>  R-help at r-project.org mailing list
>>  https://stat.ethz.ch/mailman/listinfo/r-help
>>  PLEASE do read the posting guide
>>  http://www.R-project.org/posting-guide.html
>>  and provide commented, minimal, self-contained, reproducible code.
>> 
>
> Charles C. Berry                            (858) 534-2098
>                                            Dept of Family/Preventive 
> Medicine
> E mailto:cberry at tajo.ucsd.edu	            UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>
>
>
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list