[Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri Sep 11 20:45:27 CEST 2009


Hi,

On Sep 11, 2009, at 2:18 PM, Patrick Aboyoun wrote:

> Steve,
> Good point. I added a clarifying statement in coverage's sift  
> argument description:
>
> \item{shift}{
>   For most methods, an integer vector (recycled to the length of
>   \code{x}) specifying how each element in \code{x} should be
>   (horizontally) shifted before the coverage is computed. Only shifted
>   indices in the range [1, \code{width}] will be included in the  
> coverage
>   calculation.
>   ...
> }

Yes, that's great.

Thanks again,
-steve

>
>
> Patrick
>
>
> Steve Lianoglou wrote:
>> Hi Patrick,
>>
>> Your suggestions were what I was looking for, they're great, thanks!
>>
>> On Sep 11, 2009, at 12:51 PM, Patrick Aboyoun wrote:
>>
>>> Steve,
>>> I have revised your code to simplify it a bit. I added example  
>>> data as well to make it easier for others to run it.
>>
>> Yeah, I should have pasted the objects in an easy-to-paste-into-R- 
>> format, sorry for the oversight.
>>
>>> The shift and width arguments to coverage are discussed in the  
>>> coverage man page.
>>
>> I was wrestling with them trying to figure out the correct way I  
>> was expected to use them ... it just wasn't all that intuitive, to  
>> be honest.
>>
>> I should have noticed in the examples section that the coverage  
>> isn't calculated for negative ranges (which is what you get when  
>> you shift all of your ranges to start at start(bounds)) ... though  
>> it would be clear if I paid more attention to the very first  
>> example of ``coverage`` usage.
>>
>> Perhaps you might consider making an explicit comment in the  
>> documentation for the width parameter regarding this fact?
>>
>> Thanks again,
>> -steve
>>
>>> If you are unsure what the sift argument will do to a set of  
>>> ranges, you can run the shift operation on them (e.g. shift 
>>> (bounds, shift = - start(bounds) + 1)).
>>
>>>
>>> ==== Revised code ====
>>>
>>> suppressMessages(library(IRanges))
>>>
>>> sample.reads <- IRanges(start = c(11869, 11882, 12245, 12554,  
>>> 12555, 12557), width = 32)
>>> bounds <- IRanges(start=11700, end=12550)
>>>
>>> # Get the the reads from my sample.reads
>>> # object that lie within my gene bounds
>>> gene.reads <- sample.reads[bounds]
>>>
>>> # Build my coverage Rle vector
>>> cov <- coverage(gene.reads, shift = - start(bounds) + 1, width =  
>>> width(bounds))
>>>
>>> # Create the non-Rle coverage vector that I will use down stream
>>> # to make a GenomeGraph::BaseTrack object
>>> counts <- as.numeric(cov)
>>>
>>> # Build the base-position vector and filter out positions
>>> # with 0 counts from both vectors
>>> keep <- counts != 0
>>>
>>> # This is (i) from above
>>> bases <- as.integer(bounds)[keep]
>>>
>>> # This is (ii) from above
>>> counts <- counts[keep]
>>>
>>> =====================
>>>
>>>
>>> > sessionInfo()
>>> R version 2.10.0 Under development (unstable) (2009-09-07 r49613)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> other attached packages:
>>> [1] IRanges_1.3.72
>>>
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.10.0
>>>
>>>
>>>
>>>
>>> Steve Lianoglou wrote:
>>>> Hi Michael,
>>>>
>>>> Thanks for your suggestions so far ... being able to slice a Range
>>>> object w/ another Range object somehow fell off of my radar.
>>>>
>>>> -steve
>>>>
>>>> On Thu, Sep 10, 2009 at 6:25 PM, Michael Lawrence<mflawren at fhcrc.org 
>>>> > wrote:
>>>>
>>>>> On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou
>>>>> <mailinglist.honeypot at gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm finding it hard to shed some of my 10 thumbs when using the
>>>>>> IRanges/Rle classes, so can someone suggest the "correct" way  
>>>>>> to do what my
>>>>>> code does below?
>>>>>>
>>>>>> My goal is to use GenomeGraphs to plot a picture of my gene  
>>>>>> model, along
>>>>>> with the reads from 1 or more samples above it. I've actually  
>>>>>> got that
>>>>>> working, but I feel there's a more idiomatic way to do this, so  
>>>>>> if you have
>>>>>> any comments, I'd be interested in hearing them.
>>>>>>
>>>>>> I'll paste the function in its entirety at the end of the  
>>>>>> email. It works
>>>>>> as advertised, but perhaps not done in the most efficient way.  
>>>>>> For brevity's
>>>>>> sake, however, I've distilled the specific code for your  
>>>>>> feedback.
>>>>>>
>>>>>> Anyway, here's the gist:
>>>>>>
>>>>>> 1. sample.reads : An IRanges object of the reads aligned to a  
>>>>>> particular
>>>>>> chromosome from a particular rna-seq sample:
>>>>>>
>>>>>> R> > head(sample.reads)
>>>>>> IRanges instance:
>>>>>>  start   end width
>>>>>> [1] 11869 11900    32
>>>>>> [2] 11882 11913    32
>>>>>> [3] 12245 12276    32
>>>>>> [4] 12554 12585    32
>>>>>> [5] 12555 12586    32
>>>>>> [6] 12557 12588    32
>>>>>>
>>>>>> 2. bounds: An IRange object that indicates the start and end  
>>>>>> position of
>>>>>> my gene -- I want all the reads that land here:
>>>>>>
>>>>>> R> bounds
>>>>>> IRanges instance:
>>>>>>     start      end width
>>>>>> [1] 12040238 12073571 33334
>>>>>>
>>>>>> Goal:
>>>>>> I wan to build two vectors:
>>>>>> (i) A vector of positions on my chromosome (having length == to  
>>>>>> the length
>>>>>> of my gene)
>>>>>> (ii) A vector of counts over those positions
>>>>>>
>>>>>> Positions with 0 counts are removed from (i) and (ii)
>>>>>>
>>>>>> ==== Here's my code ====
>>>>>>
>>>>>> # Get the index of the reads from my sample.reads
>>>>>> # object that lie within my gene bounds
>>>>>> which.reads <- subjectHits(overlap(sample.reads, bounds))
>>>>>>
>>>>>> # Slice out these reads
>>>>>> gene.reads <- reads[which.reads]
>>>>>>
>>>>>>
>>>>> The above two steps could be simply:
>>>>>
>>>>>> gene.reads <- sample.reads[bounds]
>>>>>>     # Build my coverage Rle vector
>>>>>> cov <- coverage(gene.reads)
>>>>>>
>>>>>> # Only keep the part of the Rle that starts @ the start
>>>>>> # of my gene to the end (I feel like doing this is weird (?))
>>>>>> cov <- cov[start(bounds):length(cov)]
>>>>>>
>>>>>> # Create the non-Rle coverage vector that I will use down stream
>>>>>> # to make a GenomeGraph::BaseTrack object
>>>>>> counts <- numeric(width(bounds))
>>>>>>
>>>>>> # Populate this vector with the correct coverage counts
>>>>>> counts[1:length(cov)] <- as.vector(cov)
>>>>>>
>>>>>>
>>>>> As you say, this looks like it could be done easier using  
>>>>> arguments to
>>>>> coverage(). Maybe someone else could explain this?
>>>>>
>>>>>
>>>>>> # Build the base-position vector and filter out positions
>>>>>> # with 0 counts from both vectors
>>>>>> keep <- counts != 0
>>>>>>
>>>>>> # This is (i) from above
>>>>>> bases <- (start(bounds):end(bounds))[keep]
>>>>>>
>>>>>>
>>>>> A bit shorter:
>>>>> bases <- as.integer(bounds)[keep]
>>>>>
>>>>>
>>>>>> # This is (ii) from above
>>>>>> counts <- counts[keep]
>>>>>>
>>>>>> =====================
>>>>>>
>>>>>> I feel like the coverage interface, using the start/end args  
>>>>>> would help
>>>>>> here, no? Wouldn't this have worked to make things a bit easier?
>>>>>>
>>>>>> cov <- coverage(gene.reads, start=start(bounds), end=end(bounds))
>>>>>>
>>>>>> How would you do that by using the width/shift args? Or  
>>>>>> wouldn't you?
>>>>>> Maybe I'm not clear on what the original start/end args in  
>>>>>> ``coverage`` are
>>>>>> meant to do?
>>>>>>
>>>>>> Anyway, I was just curious if there was a better way to do what  
>>>>>> I need to
>>>>>> do.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -steve
>>>>>>
>>>>>>
>>>>>> ===== Here's the function in all of its glory, use it if you  
>>>>>> like =====
>>>>>>
>>>>>> Here's a sample call:
>>>>>>
>>>>>> plotGenomeGraphReads(list(Sample1=reads.1, Sample2=reads.2),
>>>>>>  gene.id='ENSG00000116688', title="MFN2")
>>>>>>
>>>>>> Here's the function:
>>>>>>
>>>>>> plotGenomeGraphReads <- function(sample.reads, gene.model=NULL,
>>>>>> gene.id=NULL,
>>>>>>                               biomart=NULL, cols=NULL,  
>>>>>> title=NULL) {
>>>>>> # Plots the read counts over a particular gene model.
>>>>>> #
>>>>>> # You can pass your own gene.model as an IRange object, or use  
>>>>>> biomaRt to
>>>>>> # retrieve the model using the Ensembl Gene ID gene.id
>>>>>> #
>>>>>> # Parameters
>>>>>> # ----------
>>>>>> # sample.reads : named list[IRanges] representing reads from  
>>>>>> separate
>>>>>> samples
>>>>>> #                over a given chromosome (the one your gene is  
>>>>>> on!)
>>>>>> # gene.model   : IRanges object of start/end positions of the  
>>>>>> gene
>>>>>> # mart         : biomaRt, if you want to query ensembl for the  
>>>>>> gene model
>>>>>> # cols         : the colors you want to use for the reads in the
>>>>>> respective
>>>>>> #                samples
>>>>>> if (is.null(gene.model) && is.null(gene.id)) {
>>>>>>  stop("Need to get the gene.model from somewhere.")
>>>>>> }
>>>>>>
>>>>>> # Figure out what type of gene.model we're using and get  
>>>>>> appropriate
>>>>>> bounds
>>>>>> if (!is.null(gene.model)) {
>>>>>>  gm <- makeGeneModel(start=start(gene.model), end=end 
>>>>>> (gene.model))
>>>>>>  bounds <- range(gene.model)
>>>>>> } else {
>>>>>>  if (is.null(biomart)) {
>>>>>>    biomart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
>>>>>>  }
>>>>>>  gm <- makeGene(id=gene.id, type='ensembl_gene_id',  
>>>>>> biomart=biomart)
>>>>>>  if (is.null(title)) {
>>>>>>    title <- gene.id
>>>>>>  }
>>>>>>  anno <- gm at ens
>>>>>>  bounds <- range(IRanges(start=anno$exon_chrom_start,
>>>>>>                          end=anno$exon_chrom_end))
>>>>>> }
>>>>>>
>>>>>> sample.names <- names(sample.reads)
>>>>>>
>>>>>> if (is.null(cols)) {
>>>>>>  cols <- rainbow(length(sample.reads))
>>>>>>  names(cols) <- sample.names
>>>>>> }
>>>>>>
>>>>>> if (is.null(title)) {
>>>>>>  title <- 'Gene Reads'
>>>>>> }
>>>>>>
>>>>>> stopifnot(length(cols) == length(sample.reads))
>>>>>> stopifnot(names(cols) == sample.names)
>>>>>>
>>>>>> # Calculate the coverage vectors over the gene model for each  
>>>>>> sample
>>>>>> sample.coverage <- lapply(sample.names, function(name) {
>>>>>>  reads <- sample.reads[[name]]
>>>>>>  which.reads <- subjectHits(overlap(reads, bounds))
>>>>>>  gene.reads <- reads[which.reads]
>>>>>>  cov <- coverage(gene.reads)
>>>>>>  cov <- cov[start(bounds):length(cov)]
>>>>>>
>>>>>>  counts <- numeric(width(bounds))
>>>>>>  counts[1:length(cov)] <- as.vector(cov)
>>>>>>  keep <- counts != 0
>>>>>>  bases <- (start(bounds):end(bounds))[keep]
>>>>>>  counts <- counts[keep]
>>>>>>  list(pos=bases, coverage=counts)
>>>>>> })
>>>>>> names(sample.coverage) <- sample.names
>>>>>>
>>>>>> # Make y-axis the same for each sample
>>>>>> max.counts <- max(sapply(sample.coverage, function(cov)
>>>>>> max(cov$coverage)))
>>>>>>
>>>>>> tracks <- lapply(sample.names, function(name) {
>>>>>>  dp <- DisplayPars(lwd=0.3, color=cols[[name]], ylim=c(0,  
>>>>>> max.counts))
>>>>>>  makeBaseTrack(base=sample.coverage[[name]]$pos,
>>>>>>                value=sample.coverage[[name]]$coverage,
>>>>>>                dp=dp)
>>>>>> })
>>>>>> names(tracks) <- names(sample.reads)
>>>>>>
>>>>>> title <- makeTitle(text=title)
>>>>>> plot.me <- c(title, tracks, list(gm))
>>>>>>
>>>>>> gdPlot(plot.me, minBase=start(bounds), maxBase=end(bounds))
>>>>>> }
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Steve Lianoglou
>>>>>> Graduate Student: Computational Systems Biology
>>>>>> |  Memorial Sloan-Kettering Cancer Center
>>>>>> |  Weill Medical College of Cornell University
>>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>> -- 
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  |  Memorial Sloan-Kettering Cancer Center
>>  |  Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioc-sig-sequencing mailing list