[BioC] IRanges package: findOverlaps on blobs

Wed Jun 8 22:56:42 CEST 2011

Hi Fahim,

On 11-05-31 10:23 AM, Fahim Mohammad wrote:
> Hi
> Thanks for reply
>
> RefSeqID               targetName    strand        blockSizes
> queryStart             targetStart
> XM_001065892.1    chr4                 +            127,986,
> 0,127,                  124513961,124514706,
> XM_578205.2         chr2                  -             535,137,148,
>     0,535,672,            155875533,155879894,155895543,
> NM_012543.2         chr1                 +             506,411,212,494,
> 0,506,917,1129,    96173572,96174920,96176574,96177991,
>
> Is each "block" for each RefSeqID treated individually, or do you want one
> overlap to count for all of them?
> *Ans: I am looking for the  latter option and want one overlap to count for
> all of the blocks.  For example, If I am given the following  sequence with
> its alignment
> UnknownSeq           targetName    strand        blockSizes
>                    queryStart                      targetStart
> XM_ABCD                chr4                 +            100, 200, 500
>            0,200, 1200                  124513961,124514706, 124515900
>
> I am interested in finding which of the three RefSeqIDs above overlaps with
> the given unknown sequence. The obvious answer  is the first refseqID
> (XM_001065892.1).
>
>
> Lastly -- how are these ranges defined?
> Is the first row supposed to be turned into:
> * The intervals on the target genome for each unknown sequence can be found
> using blockSizes and targetStart. For me, the "queryStart" field  above is
> redundant and wont be used. The first row may be
> The intervals in the first row may be found using (targetStart + blockSizes
> -1)
> chr4 (124513961) -- (124513961 + 127 -1)
> chr4 (124514706) -- (124514706 + 986 -1)

So it looks like the first thing you might want to do is to import
your file into a GRangesList object. Which can be done with something
like:

   library(GenomicRanges)
   refseqs <- read.table("RefSeqs.txt", header=TRUE,
                         stringsAsFactors=FALSE)
   starts <- strsplitAsListOfIntegerVectors(refseqs$targetStart)
   widths <- strsplitAsListOfIntegerVectors(refseqs$blockSizes)
   ranges <- IRanges(start=unlist(starts), width=unlist(widths))
   seqnames <- Rle(factor(refseqs$targetName), elementLengths(starts))
   strand <- Rle(strand(refseqs$strand), elementLengths(starts))
   gr <- GRanges(seqnames, ranges, strand)
   grl <- split(gr, rep.int(seq_len(length(starts)),
                            elementLengths(starts)))
   names(grl) <- refseqs$RefSeqID

You should end up with something that looks like this:

   > grl
   GRangesList of length 3
   $XM_001065892.1
   GRanges with 2 ranges and 0 elementMetadata values
       seqnames                 ranges strand |
          <Rle>              <IRanges>  <Rle> |
   [1]     chr4 [124513961, 124514087]      + |
   [2]     chr4 [124514706, 124515691]      + |

   $XM_578205.2
   GRanges with 3 ranges and 0 elementMetadata values
       seqnames                 ranges strand |
          <Rle>              <IRanges>  <Rle> |
   [1]     chr2 [155875533, 155876067]      - |
   [2]     chr2 [155879894, 155880030]      - |
   [3]     chr2 [155895543, 155895690]      - |

   $NM_012543.2
   GRanges with 4 ranges and 0 elementMetadata values
       seqnames               ranges strand |
          <Rle>            <IRanges>  <Rle> |
   [1]     chr1 [96173572, 96174077]      + |
   [2]     chr1 [96174920, 96175330]      + |
   [3]     chr1 [96176574, 96176785]      + |
   [4]     chr1 [96177991, 96178484]      + |

   seqlengths
    chr1 chr2 chr4
      NA   NA   NA

which you can then use for finding overlaps with any other
range-based object that you have.

Hope this helps.

Cheers,
H.

>
> Thanks again
> Fahim
>
>
> On Tue, May 31, 2011 at 12:05 PM, Steve Lianoglou<
> mailinglist.honeypot at gmail.com>  wrote:
>
>> Hi,
>>
>> On Tue, May 31, 2011 at 11:08 AM, Fahim Mohammad<fahim.md at gmail.com>
>> wrote:
>>> Hello
>>> I have a file containing a list of aligned Refseq identifier in the
>>> following format.  I want to convert this file into RangedData format in
>>> such a way that  "findOverlaps" function be as efficient as possible. I
>> know
>>> how to do this when each of the identifiers has just a start and an end
>>> coordinate. IRanges package is very efficient in finding the overlapping
>>> intervals in such case. But when the structure is in the form of blobs
>> (as
>>> shown below in the last three fields), I am not sure how to convert this
>>> structure into RangedData format and how to subsequently call the
>>> "findOverlaps" function.
>>>
>>> RefSeqID               targetName    strand        blockSizes
>>> queryStart             targetStart
>>> XM_001065892.1    chr4                 +            127,986,
>>> 0,127,                  124513961,124514706,
>>> XM_578205.2         chr2                  -             535,137,148,
>>>    0,535,672,            155875533,155879894,155895543,
>>> NM_012543.2         chr1                 +             506,411,212,494,
>>> 0,506,917,1129,    96173572,96174920,96176574,96177991,
>>
>> I guess the answer lies in how you want overlaps to be calculated.
>>
>> Is each "block" for each RefSeqID treated individually, or do you want
>> one overlap to count for all of them?
>>
>> If the answer is the latter, you might consider putting the intervals
>> into a GRangesList object, where each element in the list is a GRanges
>> object that has all the ranges for the particular refseq id ... if you
>> want them all individually, then you just have to parse it into a
>> GRanges object.
>>
>> If you're asking *how* to parse the file, there are many ways. I'd
>> maybe use a read.table to get this thing into its respective columns,
>> then iterate over each row converting it into a GRanges object that
>> has as many ranges as "blocks" -- look at ?strsplit if you don't know
>> how to do that.
>>
>> Lastly -- how are these ranges defined?
>> Is the first row supposed to be turned into:
>>
>> chr4 (124513961) -- (124513961 + 0 + 127)
>> chr4 (124514706 + 127) -- (124514706 + 127 + 986)
>>
>> or?
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>   | Memorial Sloan-Kettering Cancer Center
>>   | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319