[BioC] [GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?
Valerie Obenchain
vobencha at fhcrc.org
Wed Aug 21 17:39:17 CEST 2013
To handle multiple mappings you can use a CharacterList.
> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15), names=paste0("rsid:", letters[1:3])), score=1:3)
> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50), names=paste0("dnase:", letters[1:3])), score=4:6)
> ranges <- subsetByOverlaps(gr2,gr1)
> hits <- findOverlaps(gr2, gr1)
> rsid <- CharacterList(split(names(gr1)[subjectHits(hits)], queryHits(hits)))
> snpscore <- CharacterList(split(gr1$score[subjectHits(hits)], queryHits(hits)))
> mcols(ranges) <- DataFrame(mcols(ranges), rsid, snpscore)
> ranges
> GRanges with 2 ranges and 3 metadata columns:
> seqnames ranges strand | score rsid snpscore
> <Rle> <IRanges> <Rle> | <integer> <CharacterList> <CharacterList>
> dnase:a chr1 [4, 8] * | 4 rsid:a 1
> dnase:b chr1 [6, 20] * | 5 rsid:b,rsid:c 2,3
Valerie
On 08/21/2013 04:10 AM, Enrico Ferrero wrote:
> Hi Valerie,
>
> Thanks very much for your reply.
> You propose an interesting approach, which unfortunately doesn't work
> for my specific case.
>
> In short, the problem is that multiple SNPs (single nucleotides) can
> map to a single DNase HS site (often very large regions).
> I have edited your example a bit so that it's easier to reproduce the issue:
>
> ##########
>> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15), names=paste0("rsid:", letters[1:3])), score=rep(0,3))
>> gr1
> GRanges with 3 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <numeric>
> rsid:a chr1 [ 5, 5] * | 0
> rsid:b chr1 [10, 10] * | 0
> rsid:c chr1 [15, 15] * | 0
> ---
> seqlengths:
> chr1
> NA
>
>> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50), names=paste0("dnase:", letters[1:3])), score=rep(0,3))
>> gr2
> GRanges with 3 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <numeric>
> dnase:a chr1 [ 4, 8] * | 0
> dnase:b chr1 [ 6, 20] * | 0
> dnase:c chr1 [45, 50] * | 0
> ---
> seqlengths:
> chr1
> NA
>
>> ranges <- subsetByOverlaps(gr2,gr1)
>> ranges
> GRanges with 2 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <numeric>
> dnase:a chr1 [4, 8] * | 0
> dnase:b chr1 [6, 20] * | 0
> ---
> seqlengths:
> chr1
> NA
>
>> revRanges <- subsetByOverlaps(gr1,gr2)
>> revRanges
> GRanges with 3 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <numeric>
> rsid:a chr1 [ 5, 5] * | 0
> rsid:b chr1 [10, 10] * | 0
> rsid:c chr1 [15, 15] * | 0
> ---
> seqlengths:
> chr1
> NA
>
>> hits <- findOverlaps(gr2, gr1)
>> idx <- unique(subjectHits(hits))
>> values <- DataFrame(rsid=names(gr1)[idx])
>> mcols(ranges) <- c(mcols(ranges), values)
>> ranges
> GRanges with 2 ranges and 2 metadata columns:
> Error in (function (..., row.names = NULL, check.rows = FALSE,
> check.names = TRUE, :
> arguments imply differing number of rows: 2, 3
> ##########
>
> As you can see, rsid:a is in the dnase:a region, but both rsid:b and
> rsid:c hit the dnase:b region. As a result, when you try to add the
> SNPs IDs from gr1 as metacolumns of ranges, you create a GRanges
> object with different number of rows.
>
> Is there any workaround for this?
>
> At the moment I'm creating two GRanges objects for each comparison,
> one keeping the SNPs IDs and coordinates, and the other storing the
> DNase regions hit by the SNPs, but this is less than ideal and
> produces a lot of unnecessary output (R objects and exported BED
> files). Ideally, I'd like to have a GRanges object where each row is a
> DNase HS site hit by a specific SNP.
>
> Thank you.
> Best,
>
> On 20 August 2013 17:57, Valerie Obenchain <vobencha at fhcrc.org> wrote:
>> Hi Enrico,
>>
>> Here is a toy example of two GRanges, one with snp locations and the other
>> with dnase.
>>
>> gr1 <- GRanges("chr1", IRanges(1:5, width=5,
>> names=paste0("rsid:", letters[1:5])), score=1:5)
>>> gr1
>>
>> GRanges with 5 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <integer>
>> rsid:a chr1 [1, 5] * | 1
>> rsid:b chr1 [2, 6] * | 2
>> rsid:c chr1 [3, 7] * | 3
>> rsid:d chr1 [4, 8] * | 4
>> rsid:e chr1 [5, 9] * | 5
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>> gr2 <- GRanges("chr1", IRanges(8:12, width=5,
>> names=paste0("dnase:", letters[1:5])), score=10:14)
>>> gr2
>>
>> GRanges with 5 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <integer>
>> dnase:a chr1 [ 8, 12] * | 10
>> dnase:b chr1 [ 9, 13] * | 11
>> dnase:c chr1 [10, 14] * | 12
>> dnase:d chr1 [11, 15] * | 13
>> dnase:e chr1 [12, 16] * | 14
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>>
>> ranges <- subsetByOverlaps(gr2, gr1)
>>> ranges
>> GRanges with 2 ranges and 1 metadata column:
>>
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <integer>
>> dnase:a chr1 [8, 12] * | 10
>> dnase:b chr1 [9, 13] * | 11
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>> The function called by subsetByOverlaps is findOverlaps (described on same
>> man page as ?subsetByOverlaps). The output of findOverlaps is a 'Hits'
>> object indicating which of the query and subject overlap.
>>
>> hits <- findOverlaps(gr2, gr1)
>>> hits
>> Hits of length 3
>> queryLength: 5
>> subjectLength: 5
>> queryHits subjectHits
>> <integer> <integer>
>> 1 1 4
>> 2 1 5
>> 3 2 5
>>
>> We want meta information from gr1 (snps). In the call to findOverlaps gr1
>> was the subject so we use the 'subjectHits' indices to subset the snp
>> GRanges.
>>
>> idx <- unique(subjectHits(hits))
>> values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1$score[idx])
>>
>> Add the metadata to the dnase ranges.
>>
>> mcols(ranges) <- c(mcols(ranges), values)
>>> ranges
>> GRanges with 2 ranges and 3 metadata columns:
>> seqnames ranges strand | score rsid snpscore
>> <Rle> <IRanges> <Rle> | <integer> <character> <integer>
>> dnase:a chr1 [8, 12] * | 10 rsid:d 4
>> dnase:b chr1 [9, 13] * | 11 rsid:e 5
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>>
>> Valerie
>>
>>
>> On 08/20/2013 03:51 AM, Enrico Ferrero wrote:
>>>
>>> Hi,
>>>
>>> I have two GRanges objects, the first one is a list of SNPs, the
>>> second one are DNase hypersensitivity sites:
>>>
>>> ##########
>>> library(GenomicRanges)
>>> ...
>>>
>>>> snp
>>>
>>> GRanges with 192 ranges and 1 metadata column:
>>> seqnames ranges strand | score
>>> <Rle> <IRanges> <Rle> | <integer>
>>> rs000001 chr1 [ 37967779, 37967780] + | 0
>>> rs000002 chr1 [165967416, 165967417] - | 0
>>> rs000003 chr1 [218860069, 218860070] - | 0
>>> rs000004 chr1 [ 17306673, 17306674] - | 0
>>> rs000005 chr1 [ 41293414, 41293415] + | 0
>>> ... ... ... ... ... ...
>>> rs000188 chr8 [ 97522507, 97522508] - | 0
>>> rs000189 chr8 [ 15532582, 15532583] + | 0
>>> rs000190 chr8 [ 72270031, 72270032] + | 0
>>> rs000191 chr9 [126511086, 126511087] + | 0
>>> rs000192 chr9 [ 98231008, 98231009] + | 0
>>> ---
>>> seqlengths:
>>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 ... chr21
>>> chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
>>> NA NA NA NA NA NA NA NA NA ... NA
>>> NA NA NA NA NA NA NA NA
>>>
>>>> dnase
>>>
>>> GRanges with 145038 ranges and 1 metadata column:
>>> seqnames ranges strand | score
>>> <Rle> <IRanges> <Rle> | <integer>
>>> [1] chr1 [ 10120, 10270] * | 0
>>> [2] chr1 [237700, 237850] * | 0
>>> [3] chr1 [521440, 521590] * | 0
>>> [4] chr1 [565560, 565710] * | 0
>>> [5] chr1 [565860, 566010] * | 0
>>> ... ... ... ... ... ...
>>> [145034] chrX [154543640, 154543790] * | 0
>>> [145035] chrX [154560420, 154560570] * | 0
>>> [145036] chrX [154563960, 154564110] * | 0
>>> [145037] chrX [154842100, 154842250] * | 0
>>> [145038] chrX [154862200, 154862350] * | 0
>>>
>>> ---
>>> seqlengths:
>>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
>>> chrY
>>> NA NA NA NA NA NA NA NA NA NA NA
>>> NA NA NA NA NA NA NA NA NA NA NA NA
>>> NA
>>> ##########
>>>
>>> I can use subsetByOverlaps() in both directions to compute the overlap
>>> between them and return a GRanges object:
>>>
>>> ##########
>>>>
>>>> subsetByOverlaps(dnase, snp)
>>>
>>> GRanges with 5 ranges and 1 metadata column:
>>> seqnames ranges strand | score
>>> <Rle> <IRanges> <Rle> | <integer>
>>> [1] chr1 [ 17306560, 17306710] * | 0
>>> [2] chr2 [169869820, 169869970] * | 0
>>> [3] chr4 [145506440, 145506590] * | 0
>>> [4] chr5 [ 15014080, 15014230] * | 0
>>> [5] chr5 [ 15117400, 15117550] * | 0
>>> ---
>>> seqlengths:
>>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
>>> chrY
>>> NA NA NA NA NA NA NA NA NA NA NA
>>> NA NA NA NA NA NA NA NA NA NA NA NA
>>> NA
>>>
>>>> subsetByOverlaps(snp, dnase)
>>>
>>> GRanges with 6 ranges and 1 metadata column:
>>> seqnames ranges strand | score
>>> <Rle> <IRanges> <Rle> | <integer>
>>> rs2235746 chr1 [ 17306671, 17306672] - | 0
>>> rs4157777 chr2 [169869904, 169869904] - | 0
>>> rs6858330 chr4 [145506558, 145506559] + | 0
>>> rs13146741 chr4 [145506453, 145506454] + | 0
>>> rs32847 chr5 [ 15117438, 15117439] + | 0
>>> rs7341842 chr5 [ 15014184, 15014185] + | 0
>>> ---
>>> seqlengths:
>>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>> chr2 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
>>> NA NA NA NA NA NA NA NA NA NA NA
>>> NA NA NA NA NA NA NA NA NA NA
>>> ##########
>>>
>>> The first GRanges objects stores the DNase genomic locations
>>> overlapping with the SNPs, while the second one contains the SNPs IDs
>>> (as GRanges names) and genomic locations overlapping with the DNase
>>> dataset.
>>>
>>> Now, what I actually need is a GRanges object that stores the SNPs IDs
>>> and the DNase genomic locations. Is this possible?
>>>
>>> Thank you.
>>> Best,
>>>
>>>
>>
>
>
>
More information about the Bioconductor
mailing list