[BioC] IRanges package: findOverlaps on blobs
Hervé Pagès
hpages at fhcrc.org
Wed Jun 8 22:56:42 CEST 2011
Hi Fahim,
On 11-05-31 10:23 AM, Fahim Mohammad wrote:
> Hi
> Thanks for reply
>
> RefSeqID targetName strand blockSizes
> queryStart targetStart
> XM_001065892.1 chr4 + 127,986,
> 0,127, 124513961,124514706,
> XM_578205.2 chr2 - 535,137,148,
> 0,535,672, 155875533,155879894,155895543,
> NM_012543.2 chr1 + 506,411,212,494,
> 0,506,917,1129, 96173572,96174920,96176574,96177991,
>
> Is each "block" for each RefSeqID treated individually, or do you want one
> overlap to count for all of them?
> *Ans: I am looking for the latter option and want one overlap to count for
> all of the blocks. For example, If I am given the following sequence with
> its alignment
> UnknownSeq targetName strand blockSizes
> queryStart targetStart
> XM_ABCD chr4 + 100, 200, 500
> 0,200, 1200 124513961,124514706, 124515900
>
> I am interested in finding which of the three RefSeqIDs above overlaps with
> the given unknown sequence. The obvious answer is the first refseqID
> (XM_001065892.1).
>
>
> Lastly -- how are these ranges defined?
> Is the first row supposed to be turned into:
> * The intervals on the target genome for each unknown sequence can be found
> using blockSizes and targetStart. For me, the "queryStart" field above is
> redundant and wont be used. The first row may be
> The intervals in the first row may be found using (targetStart + blockSizes
> -1)
> chr4 (124513961) -- (124513961 + 127 -1)
> chr4 (124514706) -- (124514706 + 986 -1)
So it looks like the first thing you might want to do is to import
your file into a GRangesList object. Which can be done with something
like:
library(GenomicRanges)
refseqs <- read.table("RefSeqs.txt", header=TRUE,
stringsAsFactors=FALSE)
starts <- strsplitAsListOfIntegerVectors(refseqs$targetStart)
widths <- strsplitAsListOfIntegerVectors(refseqs$blockSizes)
ranges <- IRanges(start=unlist(starts), width=unlist(widths))
seqnames <- Rle(factor(refseqs$targetName), elementLengths(starts))
strand <- Rle(strand(refseqs$strand), elementLengths(starts))
gr <- GRanges(seqnames, ranges, strand)
grl <- split(gr, rep.int(seq_len(length(starts)),
elementLengths(starts)))
names(grl) <- refseqs$RefSeqID
You should end up with something that looks like this:
> grl
GRangesList of length 3
$XM_001065892.1
GRanges with 2 ranges and 0 elementMetadata values
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
[1] chr4 [124513961, 124514087] + |
[2] chr4 [124514706, 124515691] + |
$XM_578205.2
GRanges with 3 ranges and 0 elementMetadata values
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
[1] chr2 [155875533, 155876067] - |
[2] chr2 [155879894, 155880030] - |
[3] chr2 [155895543, 155895690] - |
$NM_012543.2
GRanges with 4 ranges and 0 elementMetadata values
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
[1] chr1 [96173572, 96174077] + |
[2] chr1 [96174920, 96175330] + |
[3] chr1 [96176574, 96176785] + |
[4] chr1 [96177991, 96178484] + |
seqlengths
chr1 chr2 chr4
NA NA NA
which you can then use for finding overlaps with any other
range-based object that you have.
Hope this helps.
Cheers,
H.
>
> Thanks again
> Fahim
>
>
> On Tue, May 31, 2011 at 12:05 PM, Steve Lianoglou<
> mailinglist.honeypot at gmail.com> wrote:
>
>> Hi,
>>
>> On Tue, May 31, 2011 at 11:08 AM, Fahim Mohammad<fahim.md at gmail.com>
>> wrote:
>>> Hello
>>> I have a file containing a list of aligned Refseq identifier in the
>>> following format. I want to convert this file into RangedData format in
>>> such a way that "findOverlaps" function be as efficient as possible. I
>> know
>>> how to do this when each of the identifiers has just a start and an end
>>> coordinate. IRanges package is very efficient in finding the overlapping
>>> intervals in such case. But when the structure is in the form of blobs
>> (as
>>> shown below in the last three fields), I am not sure how to convert this
>>> structure into RangedData format and how to subsequently call the
>>> "findOverlaps" function.
>>>
>>> RefSeqID targetName strand blockSizes
>>> queryStart targetStart
>>> XM_001065892.1 chr4 + 127,986,
>>> 0,127, 124513961,124514706,
>>> XM_578205.2 chr2 - 535,137,148,
>>> 0,535,672, 155875533,155879894,155895543,
>>> NM_012543.2 chr1 + 506,411,212,494,
>>> 0,506,917,1129, 96173572,96174920,96176574,96177991,
>>
>> I guess the answer lies in how you want overlaps to be calculated.
>>
>> Is each "block" for each RefSeqID treated individually, or do you want
>> one overlap to count for all of them?
>>
>> If the answer is the latter, you might consider putting the intervals
>> into a GRangesList object, where each element in the list is a GRanges
>> object that has all the ranges for the particular refseq id ... if you
>> want them all individually, then you just have to parse it into a
>> GRanges object.
>>
>> If you're asking *how* to parse the file, there are many ways. I'd
>> maybe use a read.table to get this thing into its respective columns,
>> then iterate over each row converting it into a GRanges object that
>> has as many ranges as "blocks" -- look at ?strsplit if you don't know
>> how to do that.
>>
>> Lastly -- how are these ranges defined?
>> Is the first row supposed to be turned into:
>>
>> chr4 (124513961) -- (124513961 + 0 + 127)
>> chr4 (124514706 + 127) -- (124514706 + 127 + 986)
>>
>> or?
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list