[BioC] [GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?
Valerie Obenchain
vobencha at fhcrc.org
Tue Aug 20 18:57:38 CEST 2013
Hi Enrico,
Here is a toy example of two GRanges, one with snp locations and the
other with dnase.
gr1 <- GRanges("chr1", IRanges(1:5, width=5,
names=paste0("rsid:", letters[1:5])), score=1:5)
> gr1
GRanges with 5 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <integer>
rsid:a chr1 [1, 5] * | 1
rsid:b chr1 [2, 6] * | 2
rsid:c chr1 [3, 7] * | 3
rsid:d chr1 [4, 8] * | 4
rsid:e chr1 [5, 9] * | 5
---
seqlengths:
chr1
NA
gr2 <- GRanges("chr1", IRanges(8:12, width=5,
names=paste0("dnase:", letters[1:5])), score=10:14)
> gr2
GRanges with 5 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <integer>
dnase:a chr1 [ 8, 12] * | 10
dnase:b chr1 [ 9, 13] * | 11
dnase:c chr1 [10, 14] * | 12
dnase:d chr1 [11, 15] * | 13
dnase:e chr1 [12, 16] * | 14
---
seqlengths:
chr1
NA
ranges <- subsetByOverlaps(gr2, gr1)
> ranges
GRanges with 2 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <integer>
dnase:a chr1 [8, 12] * | 10
dnase:b chr1 [9, 13] * | 11
---
seqlengths:
chr1
NA
The function called by subsetByOverlaps is findOverlaps (described on
same man page as ?subsetByOverlaps). The output of findOverlaps is a
'Hits' object indicating which of the query and subject overlap.
hits <- findOverlaps(gr2, gr1)
> hits
Hits of length 3
queryLength: 5
subjectLength: 5
queryHits subjectHits
<integer> <integer>
1 1 4
2 1 5
3 2 5
We want meta information from gr1 (snps). In the call to findOverlaps
gr1 was the subject so we use the 'subjectHits' indices to subset the
snp GRanges.
idx <- unique(subjectHits(hits))
values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1$score[idx])
Add the metadata to the dnase ranges.
mcols(ranges) <- c(mcols(ranges), values)
> ranges
GRanges with 2 ranges and 3 metadata columns:
seqnames ranges strand | score rsid snpscore
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
dnase:a chr1 [8, 12] * | 10 rsid:d 4
dnase:b chr1 [9, 13] * | 11 rsid:e 5
---
seqlengths:
chr1
NA
Valerie
On 08/20/2013 03:51 AM, Enrico Ferrero wrote:
> Hi,
>
> I have two GRanges objects, the first one is a list of SNPs, the
> second one are DNase hypersensitivity sites:
>
> ##########
> library(GenomicRanges)
> ...
>
>> snp
> GRanges with 192 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <integer>
> rs000001 chr1 [ 37967779, 37967780] + | 0
> rs000002 chr1 [165967416, 165967417] - | 0
> rs000003 chr1 [218860069, 218860070] - | 0
> rs000004 chr1 [ 17306673, 17306674] - | 0
> rs000005 chr1 [ 41293414, 41293415] + | 0
> ... ... ... ... ... ...
> rs000188 chr8 [ 97522507, 97522508] - | 0
> rs000189 chr8 [ 15532582, 15532583] + | 0
> rs000190 chr8 [ 72270031, 72270032] + | 0
> rs000191 chr9 [126511086, 126511087] + | 0
> rs000192 chr9 [ 98231008, 98231009] + | 0
> ---
> seqlengths:
> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 ... chr21
> chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
> NA NA NA NA NA NA NA NA NA ... NA
> NA NA NA NA NA NA NA NA
>
>> dnase
> GRanges with 145038 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <integer>
> [1] chr1 [ 10120, 10270] * | 0
> [2] chr1 [237700, 237850] * | 0
> [3] chr1 [521440, 521590] * | 0
> [4] chr1 [565560, 565710] * | 0
> [5] chr1 [565860, 566010] * | 0
> ... ... ... ... ... ...
> [145034] chrX [154543640, 154543790] * | 0
> [145035] chrX [154560420, 154560570] * | 0
> [145036] chrX [154563960, 154564110] * | 0
> [145037] chrX [154842100, 154842250] * | 0
> [145038] chrX [154862200, 154862350] * | 0
>
> ---
> seqlengths:
> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
> chrY
> NA NA NA NA NA NA NA NA NA NA NA
> NA NA NA NA NA NA NA NA NA NA NA NA
> NA
> ##########
>
> I can use subsetByOverlaps() in both directions to compute the overlap
> between them and return a GRanges object:
>
> ##########
>> subsetByOverlaps(dnase, snp)
> GRanges with 5 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <integer>
> [1] chr1 [ 17306560, 17306710] * | 0
> [2] chr2 [169869820, 169869970] * | 0
> [3] chr4 [145506440, 145506590] * | 0
> [4] chr5 [ 15014080, 15014230] * | 0
> [5] chr5 [ 15117400, 15117550] * | 0
> ---
> seqlengths:
> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
> chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
> chrY
> NA NA NA NA NA NA NA NA NA NA NA
> NA NA NA NA NA NA NA NA NA NA NA NA
> NA
>
>> subsetByOverlaps(snp, dnase)
> GRanges with 6 ranges and 1 metadata column:
> seqnames ranges strand | score
> <Rle> <IRanges> <Rle> | <integer>
> rs2235746 chr1 [ 17306671, 17306672] - | 0
> rs4157777 chr2 [169869904, 169869904] - | 0
> rs6858330 chr4 [145506558, 145506559] + | 0
> rs13146741 chr4 [145506453, 145506454] + | 0
> rs32847 chr5 [ 15117438, 15117439] + | 0
> rs7341842 chr5 [ 15014184, 15014185] + | 0
> ---
> seqlengths:
> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
> chr2 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
> NA NA NA NA NA NA NA NA NA NA NA
> NA NA NA NA NA NA NA NA NA NA
> ##########
>
> The first GRanges objects stores the DNase genomic locations
> overlapping with the SNPs, while the second one contains the SNPs IDs
> (as GRanges names) and genomic locations overlapping with the DNase
> dataset.
>
> Now, what I actually need is a GRanges object that stores the SNPs IDs
> and the DNase genomic locations. Is this possible?
>
> Thank you.
> Best,
>
>
More information about the Bioconductor
mailing list