[Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Fri Sep 11 19:09:25 CEST 2009

Michael,
I was recrafting Steve's example, in which he created a numeric vector 
for the non-zero coverage values. He might need a numeric vector for 
GenomeGraphs.

Patrick

Michael Lawrence wrote:
>
>
> On Fri, Sep 11, 2009 at 9:51 AM, Patrick Aboyoun <paboyoun at fhcrc.org 
> <mailto:paboyoun at fhcrc.org>> wrote:
>
>     Steve,
>     I have revised your code to simplify it a bit. I added example
>     data as well to make it easier for others to run it. The shift and
>     width arguments to coverage are discussed in the coverage man
>     page. If you are unsure what the sift argument will do to a set of
>     ranges, you can run the shift operation on them (e.g.
>     shift(bounds, shift = - start(bounds) + 1)).
>
>
>     ==== Revised code ====
>
>     suppressMessages(library(IRanges))
>
>     sample.reads <- IRanges(start = c(11869, 11882, 12245, 12554,
>     12555, 12557), width = 32)
>     bounds <- IRanges(start=11700, end=12550)
>
>     # Get the the reads from my sample.reads
>
>     # object that lie within my gene bounds
>     gene.reads <- sample.reads[bounds]
>
>     # Build my coverage Rle vector
>     cov <- coverage(gene.reads, shift = - start(bounds) + 1, width =
>     width(bounds))
>
>
>     # Create the non-Rle coverage vector that I will use down stream
>     # to make a GenomeGraph::BaseTrack object
>     counts <- as.numeric(cov)
>
>
> Would as.integer() be sufficient here?
>  
>
>
>     # Build the base-position vector and filter out positions
>     # with 0 counts from both vectors
>     keep <- counts != 0
>
>     # This is (i) from above
>     bases <- as.integer(bounds)[keep]
>
>     # This is (ii) from above
>     counts <- counts[keep]
>
>     =====================
>
>
>     > sessionInfo()
>     R version 2.10.0 Under development (unstable) (2009-09-07 r49613)
>     i386-apple-darwin9.8.0
>
>     locale:
>     [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
>     attached base packages:
>     [1] stats     graphics  grDevices utils     datasets  methods  
>     base    
>     other attached packages:
>     [1] IRanges_1.3.72
>
>     loaded via a namespace (and not attached):
>     [1] tools_2.10.0
>
>
>
>
>
>     Steve Lianoglou wrote:
>
>         Hi Michael,
>
>         Thanks for your suggestions so far ... being able to slice a Range
>         object w/ another Range object somehow fell off of my radar.
>
>         -steve
>
>         On Thu, Sep 10, 2009 at 6:25 PM, Michael
>         Lawrence<mflawren at fhcrc.org <mailto:mflawren at fhcrc.org>> wrote:
>          
>
>             On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou
>             <mailinglist.honeypot at gmail.com
>             <mailto:mailinglist.honeypot at gmail.com>> wrote:
>                
>
>                 Hi all,
>
>                 I'm finding it hard to shed some of my 10 thumbs when
>                 using the
>                 IRanges/Rle classes, so can someone suggest the
>                 "correct" way to do what my
>                 code does below?
>
>                 My goal is to use GenomeGraphs to plot a picture of my
>                 gene model, along
>                 with the reads from 1 or more samples above it. I've
>                 actually got that
>                 working, but I feel there's a more idiomatic way to do
>                 this, so if you have
>                 any comments, I'd be interested in hearing them.
>
>                 I'll paste the function in its entirety at the end of
>                 the email. It works
>                 as advertised, but perhaps not done in the most
>                 efficient way. For brevity's
>                 sake, however, I've distilled the specific code for
>                 your feedback.
>
>                 Anyway, here's the gist:
>
>                 1. sample.reads : An IRanges object of the reads
>                 aligned to a particular
>                 chromosome from a particular rna-seq sample:
>
>                 R> > head(sample.reads)
>                 IRanges instance:
>                   start   end width
>                 [1] 11869 11900    32
>                 [2] 11882 11913    32
>                 [3] 12245 12276    32
>                 [4] 12554 12585    32
>                 [5] 12555 12586    32
>                 [6] 12557 12588    32
>
>                 2. bounds: An IRange object that indicates the start
>                 and end position of
>                 my gene -- I want all the reads that land here:
>
>                 R> bounds
>                 IRanges instance:
>                      start      end width
>                 [1] 12040238 12073571 33334
>
>                 Goal:
>                 I wan to build two vectors:
>                 (i) A vector of positions on my chromosome (having
>                 length == to the length
>                 of my gene)
>                 (ii) A vector of counts over those positions
>
>                 Positions with 0 counts are removed from (i) and (ii)
>
>                 ==== Here's my code ====
>
>                 # Get the index of the reads from my sample.reads
>                 # object that lie within my gene bounds
>                 which.reads <- subjectHits(overlap(sample.reads, bounds))
>
>                 # Slice out these reads
>                 gene.reads <- reads[which.reads]
>
>                      
>
>             The above two steps could be simply:
>                
>
>                 gene.reads <- sample.reads[bounds]
>                      # Build my coverage Rle vector
>                 cov <- coverage(gene.reads)
>
>                 # Only keep the part of the Rle that starts @ the start
>                 # of my gene to the end (I feel like doing this is
>                 weird (?))
>                 cov <- cov[start(bounds):length(cov)]
>
>                 # Create the non-Rle coverage vector that I will use
>                 down stream
>                 # to make a GenomeGraph::BaseTrack object
>                 counts <- numeric(width(bounds))
>
>                 # Populate this vector with the correct coverage counts
>                 counts[1:length(cov)] <- as.vector(cov)
>
>                      
>
>             As you say, this looks like it could be done easier using
>             arguments to
>             coverage(). Maybe someone else could explain this?
>
>                
>
>                 # Build the base-position vector and filter out positions
>                 # with 0 counts from both vectors
>                 keep <- counts != 0
>
>                 # This is (i) from above
>                 bases <- (start(bounds):end(bounds))[keep]
>
>                      
>
>             A bit shorter:
>             bases <- as.integer(bounds)[keep]
>
>                
>
>                 # This is (ii) from above
>                 counts <- counts[keep]
>
>                 =====================
>
>                 I feel like the coverage interface, using the
>                 start/end args would help
>                 here, no? Wouldn't this have worked to make things a
>                 bit easier?
>
>                 cov <- coverage(gene.reads, start=start(bounds),
>                 end=end(bounds))
>
>                 How would you do that by using the width/shift args?
>                 Or wouldn't you?
>                 Maybe I'm not clear on what the original start/end
>                 args in ``coverage`` are
>                 meant to do?
>
>                 Anyway, I was just curious if there was a better way
>                 to do what I need to
>                 do.
>
>                 Thanks,
>
>                 -steve
>
>
>                 ===== Here's the function in all of its glory, use it
>                 if you like =====
>
>                 Here's a sample call:
>
>                 plotGenomeGraphReads(list(Sample1=reads.1,
>                 Sample2=reads.2),
>                   gene.id <http://gene.id>='ENSG00000116688',
>                 title="MFN2")
>
>                 Here's the function:
>
>                 plotGenomeGraphReads <- function(sample.reads,
>                 gene.model=NULL,
>                 gene.id <http://gene.id>=NULL,
>                                                biomart=NULL,
>                 cols=NULL, title=NULL) {
>                  # Plots the read counts over a particular gene model.
>                  #
>                  # You can pass your own gene.model as an IRange
>                 object, or use biomaRt to
>                  # retrieve the model using the Ensembl Gene ID
>                 gene.id <http://gene.id>
>                  #
>                  # Parameters
>                  # ----------
>                  # sample.reads : named list[IRanges] representing
>                 reads from separate
>                 samples
>                  #                over a given chromosome (the one
>                 your gene is on!)
>                  # gene.model   : IRanges object of start/end
>                 positions of the gene
>                  # mart         : biomaRt, if you want to query
>                 ensembl for the gene model
>                  # cols         : the colors you want to use for the
>                 reads in the
>                 respective
>                  #                samples
>                  if (is.null(gene.model) && is.null(gene.id
>                 <http://gene.id>)) {
>                   stop("Need to get the gene.model from somewhere.")
>                  }
>
>                  # Figure out what type of gene.model we're using and
>                 get appropriate
>                 bounds
>                  if (!is.null(gene.model)) {
>                   gm <- makeGeneModel(start=start(gene.model),
>                 end=end(gene.model))
>                   bounds <- range(gene.model)
>                  } else {
>                   if (is.null(biomart)) {
>                     biomart <- useMart('ensembl',
>                 dataset='hsapiens_gene_ensembl')
>                   }
>                   gm <- makeGene(id=gene.id <http://gene.id>,
>                 type='ensembl_gene_id', biomart=biomart)
>                   if (is.null(title)) {
>                     title <- gene.id <http://gene.id>
>                   }
>                   anno <- gm at ens
>                   bounds <- range(IRanges(start=anno$exon_chrom_start,
>                                           end=anno$exon_chrom_end))
>                  }
>
>                  sample.names <- names(sample.reads)
>
>                  if (is.null(cols)) {
>                   cols <- rainbow(length(sample.reads))
>                   names(cols) <- sample.names
>                  }
>
>                  if (is.null(title)) {
>                   title <- 'Gene Reads'
>                  }
>
>                  stopifnot(length(cols) == length(sample.reads))
>                  stopifnot(names(cols) == sample.names)
>
>                  # Calculate the coverage vectors over the gene model
>                 for each sample
>                  sample.coverage <- lapply(sample.names, function(name) {
>                   reads <- sample.reads[[name]]
>                   which.reads <- subjectHits(overlap(reads, bounds))
>                   gene.reads <- reads[which.reads]
>                   cov <- coverage(gene.reads)
>                   cov <- cov[start(bounds):length(cov)]
>
>                   counts <- numeric(width(bounds))
>                   counts[1:length(cov)] <- as.vector(cov)
>                   keep <- counts != 0
>                   bases <- (start(bounds):end(bounds))[keep]
>                   counts <- counts[keep]
>                   list(pos=bases, coverage=counts)
>                  })
>                  names(sample.coverage) <- sample.names
>
>                  # Make y-axis the same for each sample
>                  max.counts <- max(sapply(sample.coverage, function(cov)
>                 max(cov$coverage)))
>
>                  tracks <- lapply(sample.names, function(name) {
>                   dp <- DisplayPars(lwd=0.3, color=cols[[name]],
>                 ylim=c(0, max.counts))
>                   makeBaseTrack(base=sample.coverage[[name]]$pos,
>                                 value=sample.coverage[[name]]$coverage,
>                                 dp=dp)
>                  })
>                  names(tracks) <- names(sample.reads)
>
>                  title <- makeTitle(text=title)
>                  plot.me <- c(title, tracks, list(gm))
>
>                  gdPlot(plot.me, minBase=start(bounds),
>                 maxBase=end(bounds))
>                 }
>
>
>                 --
>                 Steve Lianoglou
>                 Graduate Student: Computational Systems Biology
>                  |  Memorial Sloan-Kettering Cancer Center
>                  |  Weill Medical College of Cornell University
>                 Contact Info: http://cbio.mskcc.org/~lianos/contact
>                 <http://cbio.mskcc.org/%7Elianos/contact>
>
>                 _______________________________________________
>                 Bioc-sig-sequencing mailing list
>                 Bioc-sig-sequencing at r-project.org
>                 <mailto:Bioc-sig-sequencing at r-project.org>
>                 https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>                      
>
>                
>
>
>
>
>          
>
>
>