[R] range and intersection
Charles C. Berry
cberry at tajo.ucsd.edu
Sun Mar 14 07:56:04 CET 2010
Typo corrected below.
On Sat, 13 Mar 2010, Charles C. Berry wrote:
> On Sat, 13 Mar 2010, Adrian Johnson wrote:
>
>> Hi:
>>
>> I have a two large files (over 300K lines).
>>
>> file 1:
>>
>> Name X
>> UK 199
>> UK 230
>> UK 139
>> ......
>> UAE 194
>> UAE 94
>>
>>
>>
>>
>> File 2:
>>
>> Name X Y
>> UK 140 180
>> UK 195 240
>> UK 304 340
>> ....
>>
>>
>> I want to select X of File 1 and search if it falls in range of X and
>> Y of File 2 and Print only those lines of File 1 that are in range of
>> File 2 X and Y
>
> Probably, I'd use findOverlaps() in the IRanges BioConductor package.
>
> If you want to do the UK search apart from the UAE search and so on, the use
> of RangeData objects provided by IRanges is nice, clean way to go.
>
> Something like:
>
> library(IRanges)
>
> file1 <- read.table("File1", header=TRUE)
> file2 <- read.table("File2", header=TRUE)
>
> file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name )
> file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
> space = Name )
>
Correct the above to:
file1.rl <- RangedData( IRanges(start=file1$X, width=1),
space = file1$Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
space = file2$Name )
Chuck
> find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) )
>
> new.rl <- cbind( file1.rl[ find.1.in.2[,1], ],
> file2X = start(file2.rl)[ find.1.in.2[,2] ],
> file2Y = end(file2.rl)[ find.1.in.2[,2] ])
>
> find.1.in.2 will be a matrix with one row for every match. The first column
> will be the index of the row in file1.rl and the second that of file2.rl.
>
> new.rl will have on row for each match.
>
> The order of the rows in the RangedData objects may not match the original
> data frames, so beware.
>
> For 300K rows, this would run pretty fast, I think.
>
> (caveat: This is all untested code.)
>
> Otherwise, without the IRanges package something like
>
>
> gt.x <- findInterval( file1$X, file2$X )
> gt.y <- findInterval( file1$X, file2$Y )
>
> is.in.interval <- gt.x == gt.y + 1
>
> will work iff the intervals defined in file2 do not overlap one another.
>
> If you need to keep 'Name's separate, rolling this into mapply() would be
> needed.
>
> HTH,
>
> Chuck
>
>>
>>
>> How can it be done it in R.
>>
>> thanks
>> Adrian
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry (858) 534-2098
> Dept of Family/Preventive
> Medicine
> E mailto:cberry at tajo.ucsd.edu UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>
>
>
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list