[BioC] Help with sliding window analysis on GRanges object

Hervé Pagès hpages at fhcrc.org
Thu Apr 17 19:51:31 CEST 2014


Hi Vince, Michael,

On 04/17/2014 10:40 AM, Michael Lawrence wrote:
> On Thu, Apr 17, 2014 at 10:27 AM, Vince S. Buffalo <vsbuffalo at gmail.com>wrote:
>
>> Sorry to return to this older topic, but I'm curious -- what's the
>> reasoning behind allocating tiles of 1L then using resize?
>>
>> tiles <- unlist(tileGenome(seqinfo(snps), tilewidth=1L))
>> windows <- resize(tiles, 500L) # you will get a warning about trimming
>>
>>
> This is the way to generate sliding windows. The tileGenome generates a
> partitioning, i.e., non-overlapping windows.
>
>
>> Also, in general, why does tileGenome always return a list rather than a
>> GenomicRanges object?
>>
>>
> tileGenome can potentially generate multiple GRanges elements per tile,
> because by default tiles will cross between chromosomes in an effort to
> achieve constant tile width. Even when that feature is disabled, the result
> is still a GRangesList for consistency.

To disable that feature, use 'cut.last.tile.in.chrom=TRUE'. Then it
returns a GRanges. See ?tileGenome

Cheers,
H.

>
> Also, until recent changes in devel, it wasn't possible to lapply over a
> GRanges, which is a typical use case when tiling.
>
>
>> Vince
>>
>>
>> On Mon, Mar 10, 2014 at 8:27 AM, Stefano Iantorno <si3 at sanger.ac.uk>wrote:
>>
>>> Thanks, that worked beautifully. I ended up doing the following:
>>>
>>>
>>>
>>> tileranges <- unlist(tileGenome(seqinfo(snps), tilewidth=500))
>>>
>>> hits.df <- as.data.frame(findOverlaps(tileranges, snps))
>>>
>>>
>>>
>>> I can then subset tileranges and snps with hits.df$queryHits or
>>> hits.df$subjectHits to retrieve all the information in the original
>>> Granges object.
>>>
>>> Although not overlapping sliding windows (these are more "bins") I think
>>> it might be good enough for my purposes.
>>>
>>> Best,
>>>
>>>
>>>
>>> -          Stefano
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Michael Lawrence [mailto:lawrence.michael at gene.com]
>>> Sent: 09 March 2014 00:44
>>> To: Stefano Iantorno
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: [BioC] Help with sliding window analysis on GRanges object
>>>
>>>
>>>
>>> I just realized that this will not scale well for the whole genome. So
>>> you might just want to summarize with the Rle utilities or take 500bp
>>> around each SNP to form your windows. Depends on your goal.
>>>
>>> Michael
>>>
>>>
>>>
>>> On Sat, Mar 8, 2014 at 9:38 PM, Michael Lawrence <michafla at gene.com>
>>> wrote:
>>>
>>> One way would to be generate the GRanges for the sliding windows and use
>>> findOverlaps to get the list of indices.
>>>
>>> Something like this:
>>>
>>> tiles <- unlist(tileGenome(seqinfo(snps), tilewidth=1L))
>>>
>>> windows <- resize(tiles, 500L) # you will get a warning about trimming
>>>
>>> answer <- as.list(findOverlaps(windows, snps))
>>>
>>> Good luck. I also like Martin's answer if all you want is e.g. a count.
>>>
>>>
>>>
>>> We might want to think about an argument to tileGenome or some mechanism
>>> for generating a sliding tiling, in addition to the disjoint tiling.
>>>
>>> Michael
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Mar 8, 2014 at 1:41 PM, Stefano Iantorno <si3 at sanger.ac.uk>
>>> wrote:
>>>
>>> Hello
>>>
>>>
>>>
>>> I am trying to conduct a sliding window analysis on a GRanges object. My
>>> ranges are a list of 60272 single nucleotide positions representing high
>>> confidence SNPs stored as IRanges object. I would like to retrieve the
>>> list of GRanges row IDs for each  500bp window in the genome
>>> (overlapping windows).
>>>
>>>
>>>
>>> All the documentation I could find on sliding window functions such as
>>> runsum, runmean, etc are all for Rle objects.
>>>
>>>
>>>
>>> Any idea where to start from? I can't figure out a way to pick windows
>>> in the IRanges object across intervals, since each interval is
>>> represented by a start and end position (same genomic position since
>>> it's a single nucleotide long).
>>>
>>>
>>>
>>> Any help will be greatly appreciated.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> -          Stefano
>>>
>>>
>>>
>>>
>>> --
>>>   The Wellcome Trust Sanger Institute is operated by Genome Research
>>>   Limited, a charity registered in England with number 1021457 and a
>>>   company registered in England with number 2742969, whose registered
>>>   office is 215 Euston Road, London, NW1 2BE.
>>>
>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>   The Wellcome Trust Sanger Institute is operated by Genome Research
>>>   Limited, a charity registered in England with number 1021457 and a
>>>   company registered in England with number 2742969, whose registered
>>>   office is 215 Euston Road, London, NW1 2BE.
>>>
>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>>
>> --
>> Vince Buffalo
>> Ross-Ibarra Lab (www.rilab.org)
>> Plant Sciences, UC Davis
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list