[BioC] [GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?

Wed Aug 21 17:39:17 CEST 2013

To handle multiple mappings you can use a CharacterList.

> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15), names=paste0("rsid:", letters[1:3])), score=1:3)
> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50), names=paste0("dnase:", letters[1:3])), score=4:6)
> ranges <- subsetByOverlaps(gr2,gr1)
> hits <- findOverlaps(gr2, gr1)
> rsid <- CharacterList(split(names(gr1)[subjectHits(hits)], queryHits(hits)))
> snpscore <- CharacterList(split(gr1$score[subjectHits(hits)], queryHits(hits)))
> mcols(ranges) <- DataFrame(mcols(ranges), rsid, snpscore)

> ranges
> GRanges with 2 ranges and 3 metadata columns:
>           seqnames    ranges strand |     score            rsid        snpscore
>              <Rle> <IRanges>  <Rle> | <integer> <CharacterList> <CharacterList>
>   dnase:a     chr1   [4,  8]      * |         4          rsid:a               1
>   dnase:b     chr1   [6, 20]      * |         5   rsid:b,rsid:c             2,3

Valerie

On 08/21/2013 04:10 AM, Enrico Ferrero wrote:
> Hi Valerie,
>
> Thanks very much for your reply.
> You propose an interesting approach, which unfortunately doesn't work
> for my specific case.
>
> In short, the problem is that multiple SNPs (single nucleotides) can
> map  to a single DNase HS site (often very large regions).
> I have edited your example a bit so that it's easier to reproduce the issue:
>
> ##########
>> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15), names=paste0("rsid:", letters[1:3])), score=rep(0,3))
>> gr1
> GRanges with 3 ranges and 1 metadata column:
>           seqnames    ranges strand |     score
>              <Rle> <IRanges>  <Rle> | <numeric>
>    rsid:a     chr1  [ 5,  5]      * |         0
>    rsid:b     chr1  [10, 10]      * |         0
>    rsid:c     chr1  [15, 15]      * |         0
>    ---
>    seqlengths:
>     chr1
>       NA
>
>> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50), names=paste0("dnase:", letters[1:3])), score=rep(0,3))
>> gr2
> GRanges with 3 ranges and 1 metadata column:
>            seqnames    ranges strand |     score
>               <Rle> <IRanges>  <Rle> | <numeric>
>    dnase:a     chr1  [ 4,  8]      * |         0
>    dnase:b     chr1  [ 6, 20]      * |         0
>    dnase:c     chr1  [45, 50]      * |         0
>    ---
>    seqlengths:
>     chr1
>       NA
>
>> ranges <- subsetByOverlaps(gr2,gr1)
>> ranges
> GRanges with 2 ranges and 1 metadata column:
>            seqnames    ranges strand |     score
>               <Rle> <IRanges>  <Rle> | <numeric>
>    dnase:a     chr1   [4,  8]      * |         0
>    dnase:b     chr1   [6, 20]      * |         0
>    ---
>    seqlengths:
>     chr1
>       NA
>
>> revRanges <- subsetByOverlaps(gr1,gr2)
>> revRanges
> GRanges with 3 ranges and 1 metadata column:
>           seqnames    ranges strand |     score
>              <Rle> <IRanges>  <Rle> | <numeric>
>    rsid:a     chr1  [ 5,  5]      * |         0
>    rsid:b     chr1  [10, 10]      * |         0
>    rsid:c     chr1  [15, 15]      * |         0
>    ---
>    seqlengths:
>     chr1
>       NA
>
>> hits <- findOverlaps(gr2, gr1)
>> idx <- unique(subjectHits(hits))
>> values <- DataFrame(rsid=names(gr1)[idx])
>> mcols(ranges) <- c(mcols(ranges), values)
>> ranges
> GRanges with 2 ranges and 2 metadata columns:
> Error in (function (..., row.names = NULL, check.rows = FALSE,
> check.names = TRUE,  :
>    arguments imply differing number of rows: 2, 3
> ##########
>
> As you can see, rsid:a is in the dnase:a region, but both rsid:b and
> rsid:c hit the dnase:b region. As a result, when you try to add the
> SNPs IDs from gr1 as metacolumns of ranges, you create a GRanges
> object with different number of rows.
>
> Is there any workaround for this?
>
> At the moment I'm creating two GRanges objects for each comparison,
> one keeping the SNPs IDs and coordinates, and the other storing the
> DNase regions hit by the SNPs, but this is less than ideal and
> produces a lot of unnecessary output (R objects and exported BED
> files). Ideally, I'd like to have a GRanges object where each row is a
> DNase HS site hit by a specific SNP.
>
> Thank you.
> Best,
>
> On 20 August 2013 17:57, Valerie Obenchain <vobencha at fhcrc.org> wrote:
>> Hi Enrico,
>>
>> Here is a toy example of two GRanges, one with snp locations and the other
>> with dnase.
>>
>> gr1 <- GRanges("chr1", IRanges(1:5, width=5,
>>                 names=paste0("rsid:", letters[1:5])), score=1:5)
>>> gr1
>>
>> GRanges with 5 ranges and 1 metadata column:
>>           seqnames    ranges strand |     score
>>              <Rle> <IRanges>  <Rle> | <integer>
>>    rsid:a     chr1    [1, 5]      * |         1
>>    rsid:b     chr1    [2, 6]      * |         2
>>    rsid:c     chr1    [3, 7]      * |         3
>>    rsid:d     chr1    [4, 8]      * |         4
>>    rsid:e     chr1    [5, 9]      * |         5
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>> gr2 <- GRanges("chr1", IRanges(8:12, width=5,
>>                 names=paste0("dnase:", letters[1:5])), score=10:14)
>>> gr2
>>
>> GRanges with 5 ranges and 1 metadata column:
>>            seqnames    ranges strand |     score
>>               <Rle> <IRanges>  <Rle> | <integer>
>>    dnase:a     chr1  [ 8, 12]      * |        10
>>    dnase:b     chr1  [ 9, 13]      * |        11
>>    dnase:c     chr1  [10, 14]      * |        12
>>    dnase:d     chr1  [11, 15]      * |        13
>>    dnase:e     chr1  [12, 16]      * |        14
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>>
>> ranges <- subsetByOverlaps(gr2, gr1)
>>> ranges
>> GRanges with 2 ranges and 1 metadata column:
>>
>>            seqnames    ranges strand |     score
>>               <Rle> <IRanges>  <Rle> | <integer>
>>    dnase:a     chr1   [8, 12]      * |        10
>>    dnase:b     chr1   [9, 13]      * |        11
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>> The function called by subsetByOverlaps is findOverlaps (described on same
>> man page as ?subsetByOverlaps). The output of findOverlaps is a 'Hits'
>> object indicating which of the query and subject overlap.
>>
>> hits <- findOverlaps(gr2, gr1)
>>> hits
>> Hits of length 3
>> queryLength: 5
>> subjectLength: 5
>>    queryHits subjectHits
>>     <integer>   <integer>
>>   1         1           4
>>   2         1           5
>>   3         2           5
>>
>> We want meta information from gr1 (snps). In the call to findOverlaps gr1
>> was the subject so we use the 'subjectHits' indices to subset the snp
>> GRanges.
>>
>> idx <- unique(subjectHits(hits))
>> values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1$score[idx])
>>
>> Add the metadata to the dnase ranges.
>>
>> mcols(ranges) <- c(mcols(ranges), values)
>>> ranges
>> GRanges with 2 ranges and 3 metadata columns:
>>            seqnames    ranges strand |     score        rsid  snpscore
>>               <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
>>    dnase:a     chr1   [8, 12]      * |        10      rsid:d         4
>>    dnase:b     chr1   [9, 13]      * |        11      rsid:e         5
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>>
>> Valerie
>>
>>
>> On 08/20/2013 03:51 AM, Enrico Ferrero wrote:
>>>
>>> Hi,
>>>
>>> I have two GRanges objects, the first one is a list of SNPs, the
>>> second one are DNase hypersensitivity sites:
>>>
>>> ##########
>>> library(GenomicRanges)
>>> ...
>>>
>>>> snp
>>>
>>> GRanges with 192 ranges and 1 metadata column:
>>>                seqnames                 ranges strand   |     score
>>>                   <Rle>              <IRanges>  <Rle>   | <integer>
>>>       rs000001     chr1 [ 37967779,  37967780]      +   |         0
>>>       rs000002     chr1 [165967416, 165967417]      -   |         0
>>>       rs000003     chr1 [218860069, 218860070]      -   |         0
>>>      rs000004     chr1 [ 17306673,  17306674]      -   |         0
>>>      rs000005     chr1 [ 41293414,  41293415]      +   |         0
>>>            ...      ...                    ...    ... ...       ...
>>>      rs000188     chr8 [ 97522507,  97522508]      -   |         0
>>>      rs000189     chr8 [ 15532582,  15532583]      +   |         0
>>>      rs000190     chr8 [ 72270031,  72270032]      +   |         0
>>>     rs000191     chr9 [126511086, 126511087]      +   |         0
>>>     rs000192     chr9 [ 98231008,  98231009]      +   |         0
>>>     ---
>>>     seqlengths:
>>>       chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 ... chr21
>>> chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9
>>>         NA    NA    NA    NA    NA    NA    NA    NA    NA ...    NA
>>> NA    NA    NA    NA    NA    NA    NA    NA
>>>
>>>> dnase
>>>
>>> GRanges with 145038 ranges and 1 metadata column:
>>>              seqnames                 ranges strand   |     score
>>>                 <Rle>              <IRanges>  <Rle>   | <integer>
>>>          [1]     chr1       [ 10120,  10270]      *   |         0
>>>          [2]     chr1       [237700, 237850]      *   |         0
>>>          [3]     chr1       [521440, 521590]      *   |         0
>>>          [4]     chr1       [565560, 565710]      *   |         0
>>>          [5]     chr1       [565860, 566010]      *   |         0
>>>          ...      ...                    ...    ... ...       ...
>>>     [145034]     chrX [154543640, 154543790]      *   |         0
>>>     [145035]     chrX [154560420, 154560570]      *   |         0
>>>     [145036]     chrX [154563960, 154564110]      *   |         0
>>>     [145037]     chrX [154842100, 154842250]      *   |         0
>>>     [145038]     chrX [154862200, 154862350]      *   |         0
>>>
>>> ---
>>> seqlengths:
>>>     chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
>>>    chrY
>>>       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>> NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>    NA
>>> ##########
>>>
>>> I can use subsetByOverlaps() in both directions to compute the overlap
>>> between them and return a GRanges object:
>>>
>>> ##########
>>>>
>>>> subsetByOverlaps(dnase, snp)
>>>
>>> GRanges with 5 ranges and 1 metadata column:
>>>         seqnames                 ranges strand |     score
>>>            <Rle>              <IRanges>  <Rle> | <integer>
>>>     [1]     chr1 [ 17306560,  17306710]      * |         0
>>>     [2]     chr2 [169869820, 169869970]      * |         0
>>>     [3]     chr4 [145506440, 145506590]      * |         0
>>>     [4]     chr5 [ 15014080,  15014230]      * |         0
>>>     [5]     chr5 [ 15117400,  15117550]      * |         0
>>>     ---
>>>     seqlengths:
>>>       chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
>>>    chrY
>>>         NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>     NA
>>>
>>>> subsetByOverlaps(snp, dnase)
>>>
>>> GRanges with 6 ranges and 1 metadata column:
>>>                seqnames                 ranges strand |     score
>>>                   <Rle>              <IRanges>  <Rle> | <integer>
>>>      rs2235746     chr1 [ 17306671,  17306672]      - |         0
>>>      rs4157777     chr2 [169869904, 169869904]      - |         0
>>>      rs6858330     chr4 [145506558, 145506559]      + |         0
>>>     rs13146741     chr4 [145506453, 145506454]      + |         0
>>>        rs32847     chr5 [ 15117438,  15117439]      + |         0
>>>      rs7341842     chr5 [ 15014184,  15014185]      + |         0
>>>     ---
>>>     seqlengths:
>>>       chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>> chr2 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9
>>>         NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>>    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>> ##########
>>>
>>> The first GRanges objects stores the DNase genomic locations
>>> overlapping with the SNPs, while the second one contains the SNPs IDs
>>> (as GRanges names) and genomic locations overlapping with the DNase
>>> dataset.
>>>
>>> Now, what I actually need is a GRanges object that stores the SNPs IDs
>>> and the DNase genomic locations. Is this possible?
>>>
>>> Thank you.
>>> Best,
>>>
>>>
>>
>
>
>