[Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri Sep 11 19:44:07 CEST 2009


Hi Patrick,

Your suggestions were what I was looking for, they're great, thanks!

On Sep 11, 2009, at 12:51 PM, Patrick Aboyoun wrote:

> Steve,
> I have revised your code to simplify it a bit. I added example data  
> as well to make it easier for others to run it.

Yeah, I should have pasted the objects in an easy-to-paste-into-R- 
format, sorry for the oversight.

> The shift and width arguments to coverage are discussed in the  
> coverage man page.

I was wrestling with them trying to figure out the correct way I was  
expected to use them ... it just wasn't all that intuitive, to be  
honest.

I should have noticed in the examples section that the coverage isn't  
calculated for negative ranges (which is what you get when you shift  
all of your ranges to start at start(bounds)) ... though it would be  
clear if I paid more attention to the very first example of  
``coverage`` usage.

Perhaps you might consider making an explicit comment in the  
documentation for the width parameter regarding this fact?

Thanks again,
-steve

> If you are unsure what the sift argument will do to a set of ranges,  
> you can run the shift operation on them (e.g. shift(bounds, shift =  
> - start(bounds) + 1)).

>
> ==== Revised code ====
>
> suppressMessages(library(IRanges))
>
> sample.reads <- IRanges(start = c(11869, 11882, 12245, 12554, 12555,  
> 12557), width = 32)
> bounds <- IRanges(start=11700, end=12550)
>
> # Get the the reads from my sample.reads
> # object that lie within my gene bounds
> gene.reads <- sample.reads[bounds]
>
> # Build my coverage Rle vector
> cov <- coverage(gene.reads, shift = - start(bounds) + 1, width =  
> width(bounds))
>
> # Create the non-Rle coverage vector that I will use down stream
> # to make a GenomeGraph::BaseTrack object
> counts <- as.numeric(cov)
>
> # Build the base-position vector and filter out positions
> # with 0 counts from both vectors
> keep <- counts != 0
>
> # This is (i) from above
> bases <- as.integer(bounds)[keep]
>
> # This is (ii) from above
> counts <- counts[keep]
>
> =====================
>
>
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-09-07 r49613)
> i386-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] IRanges_1.3.72
>
> loaded via a namespace (and not attached):
> [1] tools_2.10.0
>
>
>
>
> Steve Lianoglou wrote:
>> Hi Michael,
>>
>> Thanks for your suggestions so far ... being able to slice a Range
>> object w/ another Range object somehow fell off of my radar.
>>
>> -steve
>>
>> On Thu, Sep 10, 2009 at 6:25 PM, Michael  
>> Lawrence<mflawren at fhcrc.org> wrote:
>>
>>> On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou
>>> <mailinglist.honeypot at gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm finding it hard to shed some of my 10 thumbs when using the
>>>> IRanges/Rle classes, so can someone suggest the "correct" way to  
>>>> do what my
>>>> code does below?
>>>>
>>>> My goal is to use GenomeGraphs to plot a picture of my gene  
>>>> model, along
>>>> with the reads from 1 or more samples above it. I've actually got  
>>>> that
>>>> working, but I feel there's a more idiomatic way to do this, so  
>>>> if you have
>>>> any comments, I'd be interested in hearing them.
>>>>
>>>> I'll paste the function in its entirety at the end of the email.  
>>>> It works
>>>> as advertised, but perhaps not done in the most efficient way.  
>>>> For brevity's
>>>> sake, however, I've distilled the specific code for your feedback.
>>>>
>>>> Anyway, here's the gist:
>>>>
>>>> 1. sample.reads : An IRanges object of the reads aligned to a  
>>>> particular
>>>> chromosome from a particular rna-seq sample:
>>>>
>>>> R> > head(sample.reads)
>>>> IRanges instance:
>>>>   start   end width
>>>> [1] 11869 11900    32
>>>> [2] 11882 11913    32
>>>> [3] 12245 12276    32
>>>> [4] 12554 12585    32
>>>> [5] 12555 12586    32
>>>> [6] 12557 12588    32
>>>>
>>>> 2. bounds: An IRange object that indicates the start and end  
>>>> position of
>>>> my gene -- I want all the reads that land here:
>>>>
>>>> R> bounds
>>>> IRanges instance:
>>>>      start      end width
>>>> [1] 12040238 12073571 33334
>>>>
>>>> Goal:
>>>> I wan to build two vectors:
>>>> (i) A vector of positions on my chromosome (having length == to  
>>>> the length
>>>> of my gene)
>>>> (ii) A vector of counts over those positions
>>>>
>>>> Positions with 0 counts are removed from (i) and (ii)
>>>>
>>>> ==== Here's my code ====
>>>>
>>>> # Get the index of the reads from my sample.reads
>>>> # object that lie within my gene bounds
>>>> which.reads <- subjectHits(overlap(sample.reads, bounds))
>>>>
>>>> # Slice out these reads
>>>> gene.reads <- reads[which.reads]
>>>>
>>>>
>>> The above two steps could be simply:
>>>
>>>> gene.reads <- sample.reads[bounds]
>>>>      # Build my coverage Rle vector
>>>> cov <- coverage(gene.reads)
>>>>
>>>> # Only keep the part of the Rle that starts @ the start
>>>> # of my gene to the end (I feel like doing this is weird (?))
>>>> cov <- cov[start(bounds):length(cov)]
>>>>
>>>> # Create the non-Rle coverage vector that I will use down stream
>>>> # to make a GenomeGraph::BaseTrack object
>>>> counts <- numeric(width(bounds))
>>>>
>>>> # Populate this vector with the correct coverage counts
>>>> counts[1:length(cov)] <- as.vector(cov)
>>>>
>>>>
>>> As you say, this looks like it could be done easier using  
>>> arguments to
>>> coverage(). Maybe someone else could explain this?
>>>
>>>
>>>> # Build the base-position vector and filter out positions
>>>> # with 0 counts from both vectors
>>>> keep <- counts != 0
>>>>
>>>> # This is (i) from above
>>>> bases <- (start(bounds):end(bounds))[keep]
>>>>
>>>>
>>> A bit shorter:
>>> bases <- as.integer(bounds)[keep]
>>>
>>>
>>>> # This is (ii) from above
>>>> counts <- counts[keep]
>>>>
>>>> =====================
>>>>
>>>> I feel like the coverage interface, using the start/end args  
>>>> would help
>>>> here, no? Wouldn't this have worked to make things a bit easier?
>>>>
>>>> cov <- coverage(gene.reads, start=start(bounds), end=end(bounds))
>>>>
>>>> How would you do that by using the width/shift args? Or wouldn't  
>>>> you?
>>>> Maybe I'm not clear on what the original start/end args in  
>>>> ``coverage`` are
>>>> meant to do?
>>>>
>>>> Anyway, I was just curious if there was a better way to do what I  
>>>> need to
>>>> do.
>>>>
>>>> Thanks,
>>>>
>>>> -steve
>>>>
>>>>
>>>> ===== Here's the function in all of its glory, use it if you like  
>>>> =====
>>>>
>>>> Here's a sample call:
>>>>
>>>> plotGenomeGraphReads(list(Sample1=reads.1, Sample2=reads.2),
>>>>   gene.id='ENSG00000116688', title="MFN2")
>>>>
>>>> Here's the function:
>>>>
>>>> plotGenomeGraphReads <- function(sample.reads, gene.model=NULL,
>>>> gene.id=NULL,
>>>>                                biomart=NULL, cols=NULL,  
>>>> title=NULL) {
>>>> # Plots the read counts over a particular gene model.
>>>> #
>>>> # You can pass your own gene.model as an IRange object, or use  
>>>> biomaRt to
>>>> # retrieve the model using the Ensembl Gene ID gene.id
>>>> #
>>>> # Parameters
>>>> # ----------
>>>> # sample.reads : named list[IRanges] representing reads from  
>>>> separate
>>>> samples
>>>> #                over a given chromosome (the one your gene is on!)
>>>> # gene.model   : IRanges object of start/end positions of the gene
>>>> # mart         : biomaRt, if you want to query ensembl for the  
>>>> gene model
>>>> # cols         : the colors you want to use for the reads in the
>>>> respective
>>>> #                samples
>>>> if (is.null(gene.model) && is.null(gene.id)) {
>>>>   stop("Need to get the gene.model from somewhere.")
>>>> }
>>>>
>>>> # Figure out what type of gene.model we're using and get  
>>>> appropriate
>>>> bounds
>>>> if (!is.null(gene.model)) {
>>>>   gm <- makeGeneModel(start=start(gene.model), end=end(gene.model))
>>>>   bounds <- range(gene.model)
>>>> } else {
>>>>   if (is.null(biomart)) {
>>>>     biomart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
>>>>   }
>>>>   gm <- makeGene(id=gene.id, type='ensembl_gene_id',  
>>>> biomart=biomart)
>>>>   if (is.null(title)) {
>>>>     title <- gene.id
>>>>   }
>>>>   anno <- gm at ens
>>>>   bounds <- range(IRanges(start=anno$exon_chrom_start,
>>>>                           end=anno$exon_chrom_end))
>>>> }
>>>>
>>>> sample.names <- names(sample.reads)
>>>>
>>>> if (is.null(cols)) {
>>>>   cols <- rainbow(length(sample.reads))
>>>>   names(cols) <- sample.names
>>>> }
>>>>
>>>> if (is.null(title)) {
>>>>   title <- 'Gene Reads'
>>>> }
>>>>
>>>> stopifnot(length(cols) == length(sample.reads))
>>>> stopifnot(names(cols) == sample.names)
>>>>
>>>> # Calculate the coverage vectors over the gene model for each  
>>>> sample
>>>> sample.coverage <- lapply(sample.names, function(name) {
>>>>   reads <- sample.reads[[name]]
>>>>   which.reads <- subjectHits(overlap(reads, bounds))
>>>>   gene.reads <- reads[which.reads]
>>>>   cov <- coverage(gene.reads)
>>>>   cov <- cov[start(bounds):length(cov)]
>>>>
>>>>   counts <- numeric(width(bounds))
>>>>   counts[1:length(cov)] <- as.vector(cov)
>>>>   keep <- counts != 0
>>>>   bases <- (start(bounds):end(bounds))[keep]
>>>>   counts <- counts[keep]
>>>>   list(pos=bases, coverage=counts)
>>>> })
>>>> names(sample.coverage) <- sample.names
>>>>
>>>> # Make y-axis the same for each sample
>>>> max.counts <- max(sapply(sample.coverage, function(cov)
>>>> max(cov$coverage)))
>>>>
>>>> tracks <- lapply(sample.names, function(name) {
>>>>   dp <- DisplayPars(lwd=0.3, color=cols[[name]], ylim=c(0,  
>>>> max.counts))
>>>>   makeBaseTrack(base=sample.coverage[[name]]$pos,
>>>>                 value=sample.coverage[[name]]$coverage,
>>>>                 dp=dp)
>>>> })
>>>> names(tracks) <- names(sample.reads)
>>>>
>>>> title <- makeTitle(text=title)
>>>> plot.me <- c(title, tracks, list(gm))
>>>>
>>>> gdPlot(plot.me, minBase=start(bounds), maxBase=end(bounds))
>>>> }
>>>>
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>> |  Memorial Sloan-Kettering Cancer Center
>>>> |  Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>
>>
>>
>>
>>
>

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioc-sig-sequencing mailing list