[BioC] [GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?

Wed Aug 21 23:32:14 CEST 2013

Hi Valerie,

I just tested your code with toy data and can't try it with real data
until next week, but it seems to be doing exactly what I need. That
call to CharacterList(split()) does some serious magic!

Thanks a lot! I would have never figured this out myself.

Best,

On 21 August 2013 16:39, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> To handle multiple mappings you can use a CharacterList.
>
>> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15),
>> names=paste0("rsid:", letters[1:3])), score=1:3)
>> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50),
>> names=paste0("dnase:", letters[1:3])), score=4:6)
>> ranges <- subsetByOverlaps(gr2,gr1)
>> hits <- findOverlaps(gr2, gr1)
>> rsid <- CharacterList(split(names(gr1)[subjectHits(hits)],
>> queryHits(hits)))
>> snpscore <- CharacterList(split(gr1$score[subjectHits(hits)],
>> queryHits(hits)))
>> mcols(ranges) <- DataFrame(mcols(ranges), rsid, snpscore)
>
>
>> ranges
>> GRanges with 2 ranges and 3 metadata columns:
>>           seqnames    ranges strand |     score            rsid
>> snpscore
>>              <Rle> <IRanges>  <Rle> | <integer> <CharacterList>
>> <CharacterList>
>>   dnase:a     chr1   [4,  8]      * |         4          rsid:a
>> 1
>>   dnase:b     chr1   [6, 20]      * |         5   rsid:b,rsid:c
>> 2,3
>
>
>
> Valerie
>
>
>
>
> On 08/21/2013 04:10 AM, Enrico Ferrero wrote:
>>
>> Hi Valerie,
>>
>> Thanks very much for your reply.
>> You propose an interesting approach, which unfortunately doesn't work
>> for my specific case.
>>
>> In short, the problem is that multiple SNPs (single nucleotides) can
>> map  to a single DNase HS site (often very large regions).
>> I have edited your example a bit so that it's easier to reproduce the
>> issue:
>>
>> ##########
>>>
>>> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15),
>>> names=paste0("rsid:", letters[1:3])), score=rep(0,3))
>>> gr1
>>
>> GRanges with 3 ranges and 1 metadata column:
>>           seqnames    ranges strand |     score
>>              <Rle> <IRanges>  <Rle> | <numeric>
>>    rsid:a     chr1  [ 5,  5]      * |         0
>>    rsid:b     chr1  [10, 10]      * |         0
>>    rsid:c     chr1  [15, 15]      * |         0
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>>> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50),
>>> names=paste0("dnase:", letters[1:3])), score=rep(0,3))
>>> gr2
>>
>> GRanges with 3 ranges and 1 metadata column:
>>            seqnames    ranges strand |     score
>>               <Rle> <IRanges>  <Rle> | <numeric>
>>    dnase:a     chr1  [ 4,  8]      * |         0
>>    dnase:b     chr1  [ 6, 20]      * |         0
>>    dnase:c     chr1  [45, 50]      * |         0
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>>> ranges <- subsetByOverlaps(gr2,gr1)
>>> ranges
>>
>> GRanges with 2 ranges and 1 metadata column:
>>            seqnames    ranges strand |     score
>>               <Rle> <IRanges>  <Rle> | <numeric>
>>    dnase:a     chr1   [4,  8]      * |         0
>>    dnase:b     chr1   [6, 20]      * |         0
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>>> revRanges <- subsetByOverlaps(gr1,gr2)
>>> revRanges
>>
>> GRanges with 3 ranges and 1 metadata column:
>>           seqnames    ranges strand |     score
>>              <Rle> <IRanges>  <Rle> | <numeric>
>>    rsid:a     chr1  [ 5,  5]      * |         0
>>    rsid:b     chr1  [10, 10]      * |         0
>>    rsid:c     chr1  [15, 15]      * |         0
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>>> hits <- findOverlaps(gr2, gr1)
>>> idx <- unique(subjectHits(hits))
>>> values <- DataFrame(rsid=names(gr1)[idx])
>>> mcols(ranges) <- c(mcols(ranges), values)
>>> ranges
>>
>> GRanges with 2 ranges and 2 metadata columns:
>> Error in (function (..., row.names = NULL, check.rows = FALSE,
>> check.names = TRUE,  :
>>    arguments imply differing number of rows: 2, 3
>> ##########
>>
>> As you can see, rsid:a is in the dnase:a region, but both rsid:b and
>> rsid:c hit the dnase:b region. As a result, when you try to add the
>> SNPs IDs from gr1 as metacolumns of ranges, you create a GRanges
>> object with different number of rows.
>>
>> Is there any workaround for this?
>>
>> At the moment I'm creating two GRanges objects for each comparison,
>> one keeping the SNPs IDs and coordinates, and the other storing the
>> DNase regions hit by the SNPs, but this is less than ideal and
>> produces a lot of unnecessary output (R objects and exported BED
>> files). Ideally, I'd like to have a GRanges object where each row is a
>> DNase HS site hit by a specific SNP.
>>
>> Thank you.
>> Best,
>>
>> On 20 August 2013 17:57, Valerie Obenchain <vobencha at fhcrc.org> wrote:
>>>
>>> Hi Enrico,
>>>
>>> Here is a toy example of two GRanges, one with snp locations and the
>>> other
>>> with dnase.
>>>
>>> gr1 <- GRanges("chr1", IRanges(1:5, width=5,
>>>                 names=paste0("rsid:", letters[1:5])), score=1:5)
>>>>
>>>> gr1
>>>
>>>
>>> GRanges with 5 ranges and 1 metadata column:
>>>           seqnames    ranges strand |     score
>>>              <Rle> <IRanges>  <Rle> | <integer>
>>>    rsid:a     chr1    [1, 5]      * |         1
>>>    rsid:b     chr1    [2, 6]      * |         2
>>>    rsid:c     chr1    [3, 7]      * |         3
>>>    rsid:d     chr1    [4, 8]      * |         4
>>>    rsid:e     chr1    [5, 9]      * |         5
>>>    ---
>>>    seqlengths:
>>>     chr1
>>>       NA
>>>
>>> gr2 <- GRanges("chr1", IRanges(8:12, width=5,
>>>                 names=paste0("dnase:", letters[1:5])), score=10:14)
>>>>
>>>> gr2
>>>
>>>
>>> GRanges with 5 ranges and 1 metadata column:
>>>            seqnames    ranges strand |     score
>>>               <Rle> <IRanges>  <Rle> | <integer>
>>>    dnase:a     chr1  [ 8, 12]      * |        10
>>>    dnase:b     chr1  [ 9, 13]      * |        11
>>>    dnase:c     chr1  [10, 14]      * |        12
>>>    dnase:d     chr1  [11, 15]      * |        13
>>>    dnase:e     chr1  [12, 16]      * |        14
>>>    ---
>>>    seqlengths:
>>>     chr1
>>>       NA
>>>
>>>
>>> ranges <- subsetByOverlaps(gr2, gr1)
>>>>
>>>> ranges
>>>
>>> GRanges with 2 ranges and 1 metadata column:
>>>
>>>            seqnames    ranges strand |     score
>>>               <Rle> <IRanges>  <Rle> | <integer>
>>>    dnase:a     chr1   [8, 12]      * |        10
>>>    dnase:b     chr1   [9, 13]      * |        11
>>>    ---
>>>    seqlengths:
>>>     chr1
>>>       NA
>>>
>>> The function called by subsetByOverlaps is findOverlaps (described on
>>> same
>>> man page as ?subsetByOverlaps). The output of findOverlaps is a 'Hits'
>>> object indicating which of the query and subject overlap.
>>>
>>> hits <- findOverlaps(gr2, gr1)
>>>>
>>>> hits
>>>
>>> Hits of length 3
>>> queryLength: 5
>>> subjectLength: 5
>>>    queryHits subjectHits
>>>     <integer>   <integer>
>>>   1         1           4
>>>   2         1           5
>>>   3         2           5
>>>
>>> We want meta information from gr1 (snps). In the call to findOverlaps gr1
>>> was the subject so we use the 'subjectHits' indices to subset the snp
>>> GRanges.
>>>
>>> idx <- unique(subjectHits(hits))
>>> values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1$score[idx])
>>>
>>> Add the metadata to the dnase ranges.
>>>
>>> mcols(ranges) <- c(mcols(ranges), values)
>>>>
>>>> ranges
>>>
>>> GRanges with 2 ranges and 3 metadata columns:
>>>            seqnames    ranges strand |     score        rsid  snpscore
>>>               <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
>>>    dnase:a     chr1   [8, 12]      * |        10      rsid:d         4
>>>    dnase:b     chr1   [9, 13]      * |        11      rsid:e         5
>>>    ---
>>>    seqlengths:
>>>     chr1
>>>       NA
>>>
>>>
>>> Valerie
>>>
>>>
>>> On 08/20/2013 03:51 AM, Enrico Ferrero wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I have two GRanges objects, the first one is a list of SNPs, the
>>>> second one are DNase hypersensitivity sites:
>>>>
>>>> ##########
>>>> library(GenomicRanges)
>>>> ...
>>>>
>>>>> snp
>>>>
>>>>
>>>> GRanges with 192 ranges and 1 metadata column:
>>>>                seqnames                 ranges strand   |     score
>>>>                   <Rle>              <IRanges>  <Rle>   | <integer>
>>>>       rs000001     chr1 [ 37967779,  37967780]      +   |         0
>>>>       rs000002     chr1 [165967416, 165967417]      -   |         0
>>>>       rs000003     chr1 [218860069, 218860070]      -   |         0
>>>>      rs000004     chr1 [ 17306673,  17306674]      -   |         0
>>>>      rs000005     chr1 [ 41293414,  41293415]      +   |         0
>>>>            ...      ...                    ...    ... ...       ...
>>>>      rs000188     chr8 [ 97522507,  97522508]      -   |         0
>>>>      rs000189     chr8 [ 15532582,  15532583]      +   |         0
>>>>      rs000190     chr8 [ 72270031,  72270032]      +   |         0
>>>>     rs000191     chr9 [126511086, 126511087]      +   |         0
>>>>     rs000192     chr9 [ 98231008,  98231009]      +   |         0
>>>>     ---
>>>>     seqlengths:
>>>>       chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 ... chr21
>>>> chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9
>>>>         NA    NA    NA    NA    NA    NA    NA    NA    NA ...    NA
>>>> NA    NA    NA    NA    NA    NA    NA    NA
>>>>
>>>>> dnase
>>>>
>>>>
>>>> GRanges with 145038 ranges and 1 metadata column:
>>>>              seqnames                 ranges strand   |     score
>>>>                 <Rle>              <IRanges>  <Rle>   | <integer>
>>>>          [1]     chr1       [ 10120,  10270]      *   |         0
>>>>          [2]     chr1       [237700, 237850]      *   |         0
>>>>          [3]     chr1       [521440, 521590]      *   |         0
>>>>          [4]     chr1       [565560, 565710]      *   |         0
>>>>          [5]     chr1       [565860, 566010]      *   |         0
>>>>          ...      ...                    ...    ... ...       ...
>>>>     [145034]     chrX [154543640, 154543790]      *   |         0
>>>>     [145035]     chrX [154560420, 154560570]      *   |         0
>>>>     [145036]     chrX [154563960, 154564110]      *   |         0
>>>>     [145037]     chrX [154842100, 154842250]      *   |         0
>>>>     [145038]     chrX [154862200, 154862350]      *   |         0
>>>>
>>>> ---
>>>> seqlengths:
>>>>     chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>>> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
>>>>    chrY
>>>>       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>> NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>>    NA
>>>> ##########
>>>>
>>>> I can use subsetByOverlaps() in both directions to compute the overlap
>>>> between them and return a GRanges object:
>>>>
>>>> ##########
>>>>>
>>>>>
>>>>> subsetByOverlaps(dnase, snp)
>>>>
>>>>
>>>> GRanges with 5 ranges and 1 metadata column:
>>>>         seqnames                 ranges strand |     score
>>>>            <Rle>              <IRanges>  <Rle> | <integer>
>>>>     [1]     chr1 [ 17306560,  17306710]      * |         0
>>>>     [2]     chr2 [169869820, 169869970]      * |         0
>>>>     [3]     chr4 [145506440, 145506590]      * |         0
>>>>     [4]     chr5 [ 15014080,  15014230]      * |         0
>>>>     [5]     chr5 [ 15117400,  15117550]      * |         0
>>>>     ---
>>>>     seqlengths:
>>>>       chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>>> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
>>>>    chrY
>>>>         NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>>     NA
>>>>
>>>>> subsetByOverlaps(snp, dnase)
>>>>
>>>>
>>>> GRanges with 6 ranges and 1 metadata column:
>>>>                seqnames                 ranges strand |     score
>>>>                   <Rle>              <IRanges>  <Rle> | <integer>
>>>>      rs2235746     chr1 [ 17306671,  17306672]      - |         0
>>>>      rs4157777     chr2 [169869904, 169869904]      - |         0
>>>>      rs6858330     chr4 [145506558, 145506559]      + |         0
>>>>     rs13146741     chr4 [145506453, 145506454]      + |         0
>>>>        rs32847     chr5 [ 15117438,  15117439]      + |         0
>>>>      rs7341842     chr5 [ 15014184,  15014185]      + |         0
>>>>     ---
>>>>     seqlengths:
>>>>       chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>>> chr2 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9
>>>>         NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>> ##########
>>>>
>>>> The first GRanges objects stores the DNase genomic locations
>>>> overlapping with the SNPs, while the second one contains the SNPs IDs
>>>> (as GRanges names) and genomic locations overlapping with the DNase
>>>> dataset.
>>>>
>>>> Now, what I actually need is a GRanges object that stores the SNPs IDs
>>>> and the DNase genomic locations. Is this possible?
>>>>
>>>> Thank you.
>>>> Best,
>>>>
>>>>
>>>
>>
>>
>>
>

-- 
Enrico Ferrero
Department of Genetics
Cambridge Systems Biology Centre
University of Cambridge