[Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Thu Sep 10 22:55:38 CEST 2009

Hi all,

I'm finding it hard to shed some of my 10 thumbs when using the  
IRanges/Rle classes, so can someone suggest the "correct" way to do  
what my code does below?

My goal is to use GenomeGraphs to plot a picture of my gene model,  
along with the reads from 1 or more samples above it. I've actually  
got that working, but I feel there's a more idiomatic way to do this,  
so if you have any comments, I'd be interested in hearing them.

I'll paste the function in its entirety at the end of the email. It  
works as advertised, but perhaps not done in the most efficient way.  
For brevity's sake, however, I've distilled the specific code for your  
feedback.

Anyway, here's the gist:

1. sample.reads : An IRanges object of the reads aligned to a  
particular chromosome from a particular rna-seq sample:

R> > head(sample.reads)
IRanges instance:
     start   end width
[1] 11869 11900    32
[2] 11882 11913    32
[3] 12245 12276    32
[4] 12554 12585    32
[5] 12555 12586    32
[6] 12557 12588    32

2. bounds: An IRange object that indicates the start and end position  
of my gene -- I want all the reads that land here:

R> bounds
IRanges instance:
        start      end width
[1] 12040238 12073571 33334

Goal:
I wan to build two vectors:
(i) A vector of positions on my chromosome (having length == to the  
length of my gene)
(ii) A vector of counts over those positions

Positions with 0 counts are removed from (i) and (ii)

==== Here's my code ====

# Get the index of the reads from my sample.reads
# object that lie within my gene bounds
which.reads <- subjectHits(overlap(sample.reads, bounds))

# Slice out these reads
gene.reads <- reads[which.reads]

# Build my coverage Rle vector
cov <- coverage(gene.reads)

# Only keep the part of the Rle that starts @ the start
# of my gene to the end (I feel like doing this is weird (?))
cov <- cov[start(bounds):length(cov)]

# Create the non-Rle coverage vector that I will use down stream
# to make a GenomeGraph::BaseTrack object
counts <- numeric(width(bounds))

# Populate this vector with the correct coverage counts
counts[1:length(cov)] <- as.vector(cov)

# Build the base-position vector and filter out positions
# with 0 counts from both vectors
keep <- counts != 0

# This is (i) from above
bases <- (start(bounds):end(bounds))[keep]

# This is (ii) from above
counts <- counts[keep]

=====================

I feel like the coverage interface, using the start/end args would  
help here, no? Wouldn't this have worked to make things a bit easier?

cov <- coverage(gene.reads, start=start(bounds), end=end(bounds))

How would you do that by using the width/shift args? Or wouldn't you?  
Maybe I'm not clear on what the original start/end args in  
``coverage`` are meant to do?

Anyway, I was just curious if there was a better way to do what I need  
to do.

Thanks,

-steve

===== Here's the function in all of its glory, use it if you like =====

Here's a sample call:

plotGenomeGraphReads(list(Sample1=reads.1, Sample2=reads.2),
     gene.id='ENSG00000116688', title="MFN2")

Here's the function:

plotGenomeGraphReads <- function(sample.reads, gene.model=NULL,  
gene.id=NULL,
                                  biomart=NULL, cols=NULL, title=NULL) {
   # Plots the read counts over a particular gene model.
   #
   # You can pass your own gene.model as an IRange object, or use  
biomaRt to
   # retrieve the model using the Ensembl Gene ID gene.id
   #
   # Parameters
   # ----------
   # sample.reads : named list[IRanges] representing reads from  
separate samples
   #                over a given chromosome (the one your gene is on!)
   # gene.model   : IRanges object of start/end positions of the gene
   # mart         : biomaRt, if you want to query ensembl for the gene  
model
   # cols         : the colors you want to use for the reads in the  
respective
   #                samples
   if (is.null(gene.model) && is.null(gene.id)) {
     stop("Need to get the gene.model from somewhere.")
   }

   # Figure out what type of gene.model we're using and get  
appropriate bounds
   if (!is.null(gene.model)) {
     gm <- makeGeneModel(start=start(gene.model), end=end(gene.model))
     bounds <- range(gene.model)
   } else {
     if (is.null(biomart)) {
       biomart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
     }
     gm <- makeGene(id=gene.id, type='ensembl_gene_id', biomart=biomart)
     if (is.null(title)) {
       title <- gene.id
     }
     anno <- gm at ens
     bounds <- range(IRanges(start=anno$exon_chrom_start,
                             end=anno$exon_chrom_end))
   }

   sample.names <- names(sample.reads)

   if (is.null(cols)) {
     cols <- rainbow(length(sample.reads))
     names(cols) <- sample.names
   }

   if (is.null(title)) {
     title <- 'Gene Reads'
   }

   stopifnot(length(cols) == length(sample.reads))
   stopifnot(names(cols) == sample.names)

   # Calculate the coverage vectors over the gene model for each sample
   sample.coverage <- lapply(sample.names, function(name) {
     reads <- sample.reads[[name]]
     which.reads <- subjectHits(overlap(reads, bounds))
     gene.reads <- reads[which.reads]
     cov <- coverage(gene.reads)
     cov <- cov[start(bounds):length(cov)]

     counts <- numeric(width(bounds))
     counts[1:length(cov)] <- as.vector(cov)
     keep <- counts != 0
     bases <- (start(bounds):end(bounds))[keep]
     counts <- counts[keep]
     list(pos=bases, coverage=counts)
   })
   names(sample.coverage) <- sample.names

   # Make y-axis the same for each sample
   max.counts <- max(sapply(sample.coverage, function(cov) max(cov 
$coverage)))

   tracks <- lapply(sample.names, function(name) {
     dp <- DisplayPars(lwd=0.3, color=cols[[name]], ylim=c(0,  
max.counts))
     makeBaseTrack(base=sample.coverage[[name]]$pos,
                   value=sample.coverage[[name]]$coverage,
                   dp=dp)
   })
   names(tracks) <- names(sample.reads)

   title <- makeTitle(text=title)
   plot.me <- c(title, tracks, list(gm))

   gdPlot(plot.me, minBase=start(bounds), maxBase=end(bounds))
}

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact