[BioC] [GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?
Enrico Ferrero
enricoferrero86 at gmail.com
Wed Aug 21 13:10:33 CEST 2013
Hi Valerie,
Thanks very much for your reply.
You propose an interesting approach, which unfortunately doesn't work
for my specific case.
In short, the problem is that multiple SNPs (single nucleotides) can
map to a single DNase HS site (often very large regions).
I have edited your example a bit so that it's easier to reproduce the issue:
##########
> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15), names=paste0("rsid:", letters[1:3])), score=rep(0,3))
> gr1
GRanges with 3 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
rsid:a chr1 [ 5, 5] * | 0
rsid:b chr1 [10, 10] * | 0
rsid:c chr1 [15, 15] * | 0
---
seqlengths:
chr1
NA
> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50), names=paste0("dnase:", letters[1:3])), score=rep(0,3))
> gr2
GRanges with 3 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
dnase:a chr1 [ 4, 8] * | 0
dnase:b chr1 [ 6, 20] * | 0
dnase:c chr1 [45, 50] * | 0
---
seqlengths:
chr1
NA
> ranges <- subsetByOverlaps(gr2,gr1)
> ranges
GRanges with 2 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
dnase:a chr1 [4, 8] * | 0
dnase:b chr1 [6, 20] * | 0
---
seqlengths:
chr1
NA
> revRanges <- subsetByOverlaps(gr1,gr2)
> revRanges
GRanges with 3 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
rsid:a chr1 [ 5, 5] * | 0
rsid:b chr1 [10, 10] * | 0
rsid:c chr1 [15, 15] * | 0
---
seqlengths:
chr1
NA
> hits <- findOverlaps(gr2, gr1)
> idx <- unique(subjectHits(hits))
> values <- DataFrame(rsid=names(gr1)[idx])
> mcols(ranges) <- c(mcols(ranges), values)
> ranges
GRanges with 2 ranges and 2 metadata columns:
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, :
arguments imply differing number of rows: 2, 3
##########
As you can see, rsid:a is in the dnase:a region, but both rsid:b and
rsid:c hit the dnase:b region. As a result, when you try to add the
SNPs IDs from gr1 as metacolumns of ranges, you create a GRanges
object with different number of rows.
Is there any workaround for this?
At the moment I'm creating two GRanges objects for each comparison,
one keeping the SNPs IDs and coordinates, and the other storing the
DNase regions hit by the SNPs, but this is less than ideal and
produces a lot of unnecessary output (R objects and exported BED
files). Ideally, I'd like to have a GRanges object where each row is a
DNase HS site hit by a specific SNP.
Thank you.
Best,
On 20 August 2013 17:57, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Hi Enrico,
>
> Here is a toy example of two GRanges, one with snp locations and the other
> with dnase.
>
> gr1 <- GRanges("chr1", IRanges(1:5, width=5,
> names=paste0("rsid:", letters[1:5])), score=1:5)
>> gr1
>
> GRanges with 5 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <integer>
> rsid:a chr1 [1, 5] * | 1
> rsid:b chr1 [2, 6] * | 2
> rsid:c chr1 [3, 7] * | 3
> rsid:d chr1 [4, 8] * | 4
> rsid:e chr1 [5, 9] * | 5
> ---
> seqlengths:
> chr1
> NA
>
> gr2 <- GRanges("chr1", IRanges(8:12, width=5,
> names=paste0("dnase:", letters[1:5])), score=10:14)
>> gr2
>
> GRanges with 5 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <integer>
> dnase:a chr1 [ 8, 12] * | 10
> dnase:b chr1 [ 9, 13] * | 11
> dnase:c chr1 [10, 14] * | 12
> dnase:d chr1 [11, 15] * | 13
> dnase:e chr1 [12, 16] * | 14
> ---
> seqlengths:
> chr1
> NA
>
>
> ranges <- subsetByOverlaps(gr2, gr1)
>> ranges
> GRanges with 2 ranges and 1 metadata column:
>
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <integer>
> dnase:a chr1 [8, 12] * | 10
> dnase:b chr1 [9, 13] * | 11
> ---
> seqlengths:
> chr1
> NA
>
> The function called by subsetByOverlaps is findOverlaps (described on same
> man page as ?subsetByOverlaps). The output of findOverlaps is a 'Hits'
> object indicating which of the query and subject overlap.
>
> hits <- findOverlaps(gr2, gr1)
>> hits
> Hits of length 3
> queryLength: 5
> subjectLength: 5
> queryHits subjectHits
> <integer> <integer>
> 1 1 4
> 2 1 5
> 3 2 5
>
> We want meta information from gr1 (snps). In the call to findOverlaps gr1
> was the subject so we use the 'subjectHits' indices to subset the snp
> GRanges.
>
> idx <- unique(subjectHits(hits))
> values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1$score[idx])
>
> Add the metadata to the dnase ranges.
>
> mcols(ranges) <- c(mcols(ranges), values)
>> ranges
> GRanges with 2 ranges and 3 metadata columns:
> seqnames ranges strand | score rsid snpscore
> <Rle> <IRanges> <Rle> | <integer> <character> <integer>
> dnase:a chr1 [8, 12] * | 10 rsid:d 4
> dnase:b chr1 [9, 13] * | 11 rsid:e 5
> ---
> seqlengths:
> chr1
> NA
>
>
> Valerie
>
>
> On 08/20/2013 03:51 AM, Enrico Ferrero wrote:
>>
>> Hi,
>>
>> I have two GRanges objects, the first one is a list of SNPs, the
>> second one are DNase hypersensitivity sites:
>>
>> ##########
>> library(GenomicRanges)
>> ...
>>
>>> snp
>>
>> GRanges with 192 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <integer>
>> rs000001 chr1 [ 37967779, 37967780] + | 0
>> rs000002 chr1 [165967416, 165967417] - | 0
>> rs000003 chr1 [218860069, 218860070] - | 0
>> rs000004 chr1 [ 17306673, 17306674] - | 0
>> rs000005 chr1 [ 41293414, 41293415] + | 0
>> ... ... ... ... ... ...
>> rs000188 chr8 [ 97522507, 97522508] - | 0
>> rs000189 chr8 [ 15532582, 15532583] + | 0
>> rs000190 chr8 [ 72270031, 72270032] + | 0
>> rs000191 chr9 [126511086, 126511087] + | 0
>> rs000192 chr9 [ 98231008, 98231009] + | 0
>> ---
>> seqlengths:
>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 ... chr21
>> chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
>> NA NA NA NA NA NA NA NA NA ... NA
>> NA NA NA NA NA NA NA NA
>>
>>> dnase
>>
>> GRanges with 145038 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <integer>
>> [1] chr1 [ 10120, 10270] * | 0
>> [2] chr1 [237700, 237850] * | 0
>> [3] chr1 [521440, 521590] * | 0
>> [4] chr1 [565560, 565710] * | 0
>> [5] chr1 [565860, 566010] * | 0
>> ... ... ... ... ... ...
>> [145034] chrX [154543640, 154543790] * | 0
>> [145035] chrX [154560420, 154560570] * | 0
>> [145036] chrX [154563960, 154564110] * | 0
>> [145037] chrX [154842100, 154842250] * | 0
>> [145038] chrX [154862200, 154862350] * | 0
>>
>> ---
>> seqlengths:
>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
>> chrY
>> NA NA NA NA NA NA NA NA NA NA NA
>> NA NA NA NA NA NA NA NA NA NA NA NA
>> NA
>> ##########
>>
>> I can use subsetByOverlaps() in both directions to compute the overlap
>> between them and return a GRanges object:
>>
>> ##########
>>>
>>> subsetByOverlaps(dnase, snp)
>>
>> GRanges with 5 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <integer>
>> [1] chr1 [ 17306560, 17306710] * | 0
>> [2] chr2 [169869820, 169869970] * | 0
>> [3] chr4 [145506440, 145506590] * | 0
>> [4] chr5 [ 15014080, 15014230] * | 0
>> [5] chr5 [ 15117400, 15117550] * | 0
>> ---
>> seqlengths:
>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
>> chrY
>> NA NA NA NA NA NA NA NA NA NA NA
>> NA NA NA NA NA NA NA NA NA NA NA NA
>> NA
>>
>>> subsetByOverlaps(snp, dnase)
>>
>> GRanges with 6 ranges and 1 metadata column:
>> seqnames ranges strand | score
>> <Rle> <IRanges> <Rle> | <integer>
>> rs2235746 chr1 [ 17306671, 17306672] - | 0
>> rs4157777 chr2 [169869904, 169869904] - | 0
>> rs6858330 chr4 [145506558, 145506559] + | 0
>> rs13146741 chr4 [145506453, 145506454] + | 0
>> rs32847 chr5 [ 15117438, 15117439] + | 0
>> rs7341842 chr5 [ 15014184, 15014185] + | 0
>> ---
>> seqlengths:
>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>> chr2 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
>> NA NA NA NA NA NA NA NA NA NA NA
>> NA NA NA NA NA NA NA NA NA NA
>> ##########
>>
>> The first GRanges objects stores the DNase genomic locations
>> overlapping with the SNPs, while the second one contains the SNPs IDs
>> (as GRanges names) and genomic locations overlapping with the DNase
>> dataset.
>>
>> Now, what I actually need is a GRanges object that stores the SNPs IDs
>> and the DNase genomic locations. Is this possible?
>>
>> Thank you.
>> Best,
>>
>>
>
--
Enrico Ferrero
Department of Genetics
Cambridge Systems Biology Centre
University of Cambridge
More information about the Bioconductor
mailing list