[BioC] [GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?
Enrico Ferrero
enricoferrero86 at gmail.com
Wed Aug 21 23:32:14 CEST 2013
Hi Valerie,
I just tested your code with toy data and can't try it with real data
until next week, but it seems to be doing exactly what I need. That
call to CharacterList(split()) does some serious magic!
Thanks a lot! I would have never figured this out myself.
Best,
On 21 August 2013 16:39, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> To handle multiple mappings you can use a CharacterList.
>
>> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15),
>> names=paste0("rsid:", letters[1:3])), score=1:3)
>> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50),
>> names=paste0("dnase:", letters[1:3])), score=4:6)
>> ranges <- subsetByOverlaps(gr2,gr1)
>> hits <- findOverlaps(gr2, gr1)
>> rsid <- CharacterList(split(names(gr1)[subjectHits(hits)],
>> queryHits(hits)))
>> snpscore <- CharacterList(split(gr1$score[subjectHits(hits)],
>> queryHits(hits)))
>> mcols(ranges) <- DataFrame(mcols(ranges), rsid, snpscore)
>
>
>> ranges
>> GRanges with 2 ranges and 3 metadata columns:
>> seqnames ranges strand | score rsid
>> snpscore
>> <Rle> <IRanges> <Rle> | <integer> <CharacterList>
>> <CharacterList>
>> dnase:a chr1 [4, 8] * | 4 rsid:a
>> 1
>> dnase:b chr1 [6, 20] * | 5 rsid:b,rsid:c
>> 2,3
>
>
>
> Valerie
>
>
>
>
> On 08/21/2013 04:10 AM, Enrico Ferrero wrote:
>>
>> Hi Valerie,
>>
>> Thanks very much for your reply.
>> You propose an interesting approach, which unfortunately doesn't work
>> for my specific case.
>>
>> In short, the problem is that multiple SNPs (single nucleotides) can
>> map to a single DNase HS site (often very large regions).
>> I have edited your example a bit so that it's easier to reproduce the
>> issue:
>>
>> ##########
>>>
>>> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15),
>>> names=paste0("rsid:", letters[1:3])), score=rep(0,3))
>>> gr1
>>
>> GRanges with 3 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <numeric>
>> rsid:a chr1 [ 5, 5] * | 0
>> rsid:b chr1 [10, 10] * | 0
>> rsid:c chr1 [15, 15] * | 0
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>>> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50),
>>> names=paste0("dnase:", letters[1:3])), score=rep(0,3))
>>> gr2
>>
>> GRanges with 3 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <numeric>
>> dnase:a chr1 [ 4, 8] * | 0
>> dnase:b chr1 [ 6, 20] * | 0
>> dnase:c chr1 [45, 50] * | 0
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>>> ranges <- subsetByOverlaps(gr2,gr1)
>>> ranges
>>
>> GRanges with 2 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <numeric>
>> dnase:a chr1 [4, 8] * | 0
>> dnase:b chr1 [6, 20] * | 0
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>>> revRanges <- subsetByOverlaps(gr1,gr2)
>>> revRanges
>>
>> GRanges with 3 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <numeric>
>> rsid:a chr1 [ 5, 5] * | 0
>> rsid:b chr1 [10, 10] * | 0
>> rsid:c chr1 [15, 15] * | 0
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>>> hits <- findOverlaps(gr2, gr1)
>>> idx <- unique(subjectHits(hits))
>>> values <- DataFrame(rsid=names(gr1)[idx])
>>> mcols(ranges) <- c(mcols(ranges), values)
>>> ranges
>>
>> GRanges with 2 ranges and 2 metadata columns:
>> Error in (function (..., row.names = NULL, check.rows = FALSE,
>> check.names = TRUE, :
>> arguments imply differing number of rows: 2, 3
>> ##########
>>
>> As you can see, rsid:a is in the dnase:a region, but both rsid:b and
>> rsid:c hit the dnase:b region. As a result, when you try to add the
>> SNPs IDs from gr1 as metacolumns of ranges, you create a GRanges
>> object with different number of rows.
>>
>> Is there any workaround for this?
>>
>> At the moment I'm creating two GRanges objects for each comparison,
>> one keeping the SNPs IDs and coordinates, and the other storing the
>> DNase regions hit by the SNPs, but this is less than ideal and
>> produces a lot of unnecessary output (R objects and exported BED
>> files). Ideally, I'd like to have a GRanges object where each row is a
>> DNase HS site hit by a specific SNP.
>>
>> Thank you.
>> Best,
>>
>> On 20 August 2013 17:57, Valerie Obenchain <vobencha at fhcrc.org> wrote:
>>>
>>> Hi Enrico,
>>>
>>> Here is a toy example of two GRanges, one with snp locations and the
>>> other
>>> with dnase.
>>>
>>> gr1 <- GRanges("chr1", IRanges(1:5, width=5,
>>> names=paste0("rsid:", letters[1:5])), score=1:5)
>>>>
>>>> gr1
>>>
>>>
>>> GRanges with 5 ranges and 1 metadata column:
>>> seqnames ranges strand | score
>>> <Rle> <IRanges> <Rle> | <integer>
>>> rsid:a chr1 [1, 5] * | 1
>>> rsid:b chr1 [2, 6] * | 2
>>> rsid:c chr1 [3, 7] * | 3
>>> rsid:d chr1 [4, 8] * | 4
>>> rsid:e chr1 [5, 9] * | 5
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>> gr2 <- GRanges("chr1", IRanges(8:12, width=5,
>>> names=paste0("dnase:", letters[1:5])), score=10:14)
>>>>
>>>> gr2
>>>
>>>
>>> GRanges with 5 ranges and 1 metadata column:
>>> seqnames ranges strand | score
>>> <Rle> <IRanges> <Rle> | <integer>
>>> dnase:a chr1 [ 8, 12] * | 10
>>> dnase:b chr1 [ 9, 13] * | 11
>>> dnase:c chr1 [10, 14] * | 12
>>> dnase:d chr1 [11, 15] * | 13
>>> dnase:e chr1 [12, 16] * | 14
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>>
>>> ranges <- subsetByOverlaps(gr2, gr1)
>>>>
>>>> ranges
>>>
>>> GRanges with 2 ranges and 1 metadata column:
>>>
>>> seqnames ranges strand | score
>>> <Rle> <IRanges> <Rle> | <integer>
>>> dnase:a chr1 [8, 12] * | 10
>>> dnase:b chr1 [9, 13] * | 11
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>> The function called by subsetByOverlaps is findOverlaps (described on
>>> same
>>> man page as ?subsetByOverlaps). The output of findOverlaps is a 'Hits'
>>> object indicating which of the query and subject overlap.
>>>
>>> hits <- findOverlaps(gr2, gr1)
>>>>
>>>> hits
>>>
>>> Hits of length 3
>>> queryLength: 5
>>> subjectLength: 5
>>> queryHits subjectHits
>>> <integer> <integer>
>>> 1 1 4
>>> 2 1 5
>>> 3 2 5
>>>
>>> We want meta information from gr1 (snps). In the call to findOverlaps gr1
>>> was the subject so we use the 'subjectHits' indices to subset the snp
>>> GRanges.
>>>
>>> idx <- unique(subjectHits(hits))
>>> values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1$score[idx])
>>>
>>> Add the metadata to the dnase ranges.
>>>
>>> mcols(ranges) <- c(mcols(ranges), values)
>>>>
>>>> ranges
>>>
>>> GRanges with 2 ranges and 3 metadata columns:
>>> seqnames ranges strand | score rsid snpscore
>>> <Rle> <IRanges> <Rle> | <integer> <character> <integer>
>>> dnase:a chr1 [8, 12] * | 10 rsid:d 4
>>> dnase:b chr1 [9, 13] * | 11 rsid:e 5
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>>
>>> Valerie
>>>
>>>
>>> On 08/20/2013 03:51 AM, Enrico Ferrero wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I have two GRanges objects, the first one is a list of SNPs, the
>>>> second one are DNase hypersensitivity sites:
>>>>
>>>> ##########
>>>> library(GenomicRanges)
>>>> ...
>>>>
>>>>> snp
>>>>
>>>>
>>>> GRanges with 192 ranges and 1 metadata column:
>>>> seqnames ranges strand | score
>>>> <Rle> <IRanges> <Rle> | <integer>
>>>> rs000001 chr1 [ 37967779, 37967780] + | 0
>>>> rs000002 chr1 [165967416, 165967417] - | 0
>>>> rs000003 chr1 [218860069, 218860070] - | 0
>>>> rs000004 chr1 [ 17306673, 17306674] - | 0
>>>> rs000005 chr1 [ 41293414, 41293415] + | 0
>>>> ... ... ... ... ... ...
>>>> rs000188 chr8 [ 97522507, 97522508] - | 0
>>>> rs000189 chr8 [ 15532582, 15532583] + | 0
>>>> rs000190 chr8 [ 72270031, 72270032] + | 0
>>>> rs000191 chr9 [126511086, 126511087] + | 0
>>>> rs000192 chr9 [ 98231008, 98231009] + | 0
>>>> ---
>>>> seqlengths:
>>>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 ... chr21
>>>> chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
>>>> NA NA NA NA NA NA NA NA NA ... NA
>>>> NA NA NA NA NA NA NA NA
>>>>
>>>>> dnase
>>>>
>>>>
>>>> GRanges with 145038 ranges and 1 metadata column:
>>>> seqnames ranges strand | score
>>>> <Rle> <IRanges> <Rle> | <integer>
>>>> [1] chr1 [ 10120, 10270] * | 0
>>>> [2] chr1 [237700, 237850] * | 0
>>>> [3] chr1 [521440, 521590] * | 0
>>>> [4] chr1 [565560, 565710] * | 0
>>>> [5] chr1 [565860, 566010] * | 0
>>>> ... ... ... ... ... ...
>>>> [145034] chrX [154543640, 154543790] * | 0
>>>> [145035] chrX [154560420, 154560570] * | 0
>>>> [145036] chrX [154563960, 154564110] * | 0
>>>> [145037] chrX [154842100, 154842250] * | 0
>>>> [145038] chrX [154862200, 154862350] * | 0
>>>>
>>>> ---
>>>> seqlengths:
>>>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>>> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
>>>> chrY
>>>> NA NA NA NA NA NA NA NA NA NA NA
>>>> NA NA NA NA NA NA NA NA NA NA NA NA
>>>> NA
>>>> ##########
>>>>
>>>> I can use subsetByOverlaps() in both directions to compute the overlap
>>>> between them and return a GRanges object:
>>>>
>>>> ##########
>>>>>
>>>>>
>>>>> subsetByOverlaps(dnase, snp)
>>>>
>>>>
>>>> GRanges with 5 ranges and 1 metadata column:
>>>> seqnames ranges strand | score
>>>> <Rle> <IRanges> <Rle> | <integer>
>>>> [1] chr1 [ 17306560, 17306710] * | 0
>>>> [2] chr2 [169869820, 169869970] * | 0
>>>> [3] chr4 [145506440, 145506590] * | 0
>>>> [4] chr5 [ 15014080, 15014230] * | 0
>>>> [5] chr5 [ 15117400, 15117550] * | 0
>>>> ---
>>>> seqlengths:
>>>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>>> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
>>>> chrY
>>>> NA NA NA NA NA NA NA NA NA NA NA
>>>> NA NA NA NA NA NA NA NA NA NA NA NA
>>>> NA
>>>>
>>>>> subsetByOverlaps(snp, dnase)
>>>>
>>>>
>>>> GRanges with 6 ranges and 1 metadata column:
>>>> seqnames ranges strand | score
>>>> <Rle> <IRanges> <Rle> | <integer>
>>>> rs2235746 chr1 [ 17306671, 17306672] - | 0
>>>> rs4157777 chr2 [169869904, 169869904] - | 0
>>>> rs6858330 chr4 [145506558, 145506559] + | 0
>>>> rs13146741 chr4 [145506453, 145506454] + | 0
>>>> rs32847 chr5 [ 15117438, 15117439] + | 0
>>>> rs7341842 chr5 [ 15014184, 15014185] + | 0
>>>> ---
>>>> seqlengths:
>>>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>>>> chr2 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
>>>> NA NA NA NA NA NA NA NA NA NA NA
>>>> NA NA NA NA NA NA NA NA NA NA
>>>> ##########
>>>>
>>>> The first GRanges objects stores the DNase genomic locations
>>>> overlapping with the SNPs, while the second one contains the SNPs IDs
>>>> (as GRanges names) and genomic locations overlapping with the DNase
>>>> dataset.
>>>>
>>>> Now, what I actually need is a GRanges object that stores the SNPs IDs
>>>> and the DNase genomic locations. Is this possible?
>>>>
>>>> Thank you.
>>>> Best,
>>>>
>>>>
>>>
>>
>>
>>
>
--
Enrico Ferrero
Department of Genetics
Cambridge Systems Biology Centre
University of Cambridge
More information about the Bioconductor
mailing list