[Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?

Fri Sep 11 19:14:23 CEST 2009

Patrick Aboyoun wrote:
> Steve,
> I have revised your code to simplify it a bit. I added example data as 
> well to make it easier for others to run it. The shift and width 
> arguments to coverage are discussed in the coverage man page. If you are 
> unsure what the sift argument will do to a set of ranges, you can run 
> the shift operation on them (e.g. shift(bounds, shift = - start(bounds) 
> + 1)).

Hi Patrick --


> 
> 
> ==== Revised code ====
> 
> suppressMessages(library(IRanges))
> 
> sample.reads <- IRanges(start = c(11869, 11882, 12245, 12554, 12555, 
> 12557), width = 32)
> bounds <- IRanges(start=11700, end=12550)

How would this generalize, e.g., to

   library(GenomicFeatures)
   data(geneHuman)
   bounds <- ranges(transcripts(geneHuman))[["chr6"]]

or even

   bounds <- ranges(transcripts(geneHuman))

(it would be nice if transcripts filled in the 'names' of IRanges with 
the corresponding 'name' of geneHuman)

Martin

> # Get the the reads from my sample.reads
> # object that lie within my gene bounds
> gene.reads <- sample.reads[bounds]
> 
> # Build my coverage Rle vector
> cov <- coverage(gene.reads, shift = - start(bounds) + 1, width = 
> width(bounds))
> 
> # Create the non-Rle coverage vector that I will use down stream
> # to make a GenomeGraph::BaseTrack object
> counts <- as.numeric(cov)
> 
> # Build the base-position vector and filter out positions
> # with 0 counts from both vectors
> keep <- counts != 0
> 
> # This is (i) from above
> bases <- as.integer(bounds)[keep]
> 
> # This is (ii) from above
> counts <- counts[keep]
> 
> =====================
> 
> 
>  > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-09-07 r49613)
> i386-apple-darwin9.8.0
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> other attached packages:
> [1] IRanges_1.3.72
> 
> loaded via a namespace (and not attached):
> [1] tools_2.10.0
> 
> 
> 
> 
> Steve Lianoglou wrote:
>> Hi Michael,
>>
>> Thanks for your suggestions so far ... being able to slice a Range
>> object w/ another Range object somehow fell off of my radar.
>>
>> -steve
>>
>> On Thu, Sep 10, 2009 at 6:25 PM, Michael Lawrence<mflawren at fhcrc.org> 
>> wrote:
>>  
>>> On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou
>>> <mailinglist.honeypot at gmail.com> wrote:
>>>    
>>>> Hi all,
>>>>
>>>> I'm finding it hard to shed some of my 10 thumbs when using the
>>>> IRanges/Rle classes, so can someone suggest the "correct" way to do 
>>>> what my
>>>> code does below?
>>>>
>>>> My goal is to use GenomeGraphs to plot a picture of my gene model, 
>>>> along
>>>> with the reads from 1 or more samples above it. I've actually got that
>>>> working, but I feel there's a more idiomatic way to do this, so if 
>>>> you have
>>>> any comments, I'd be interested in hearing them.
>>>>
>>>> I'll paste the function in its entirety at the end of the email. It 
>>>> works
>>>> as advertised, but perhaps not done in the most efficient way. For 
>>>> brevity's
>>>> sake, however, I've distilled the specific code for your feedback.
>>>>
>>>> Anyway, here's the gist:
>>>>
>>>> 1. sample.reads : An IRanges object of the reads aligned to a 
>>>> particular
>>>> chromosome from a particular rna-seq sample:
>>>>
>>>> R> > head(sample.reads)
>>>> IRanges instance:
>>>>    start   end width
>>>> [1] 11869 11900    32
>>>> [2] 11882 11913    32
>>>> [3] 12245 12276    32
>>>> [4] 12554 12585    32
>>>> [5] 12555 12586    32
>>>> [6] 12557 12588    32
>>>>
>>>> 2. bounds: An IRange object that indicates the start and end 
>>>> position of
>>>> my gene -- I want all the reads that land here:
>>>>
>>>> R> bounds
>>>> IRanges instance:
>>>>       start      end width
>>>> [1] 12040238 12073571 33334
>>>>
>>>> Goal:
>>>> I wan to build two vectors:
>>>> (i) A vector of positions on my chromosome (having length == to the 
>>>> length
>>>> of my gene)
>>>> (ii) A vector of counts over those positions
>>>>
>>>> Positions with 0 counts are removed from (i) and (ii)
>>>>
>>>> ==== Here's my code ====
>>>>
>>>> # Get the index of the reads from my sample.reads
>>>> # object that lie within my gene bounds
>>>> which.reads <- subjectHits(overlap(sample.reads, bounds))
>>>>
>>>> # Slice out these reads
>>>> gene.reads <- reads[which.reads]
>>>>
>>>>       
>>> The above two steps could be simply:
>>>    
>>>> gene.reads <- sample.reads[bounds]
>>>>       # Build my coverage Rle vector
>>>> cov <- coverage(gene.reads)
>>>>
>>>> # Only keep the part of the Rle that starts @ the start
>>>> # of my gene to the end (I feel like doing this is weird (?))
>>>> cov <- cov[start(bounds):length(cov)]
>>>>
>>>> # Create the non-Rle coverage vector that I will use down stream
>>>> # to make a GenomeGraph::BaseTrack object
>>>> counts <- numeric(width(bounds))
>>>>
>>>> # Populate this vector with the correct coverage counts
>>>> counts[1:length(cov)] <- as.vector(cov)
>>>>
>>>>       
>>> As you say, this looks like it could be done easier using arguments to
>>> coverage(). Maybe someone else could explain this?
>>>
>>>    
>>>> # Build the base-position vector and filter out positions
>>>> # with 0 counts from both vectors
>>>> keep <- counts != 0
>>>>
>>>> # This is (i) from above
>>>> bases <- (start(bounds):end(bounds))[keep]
>>>>
>>>>       
>>> A bit shorter:
>>> bases <- as.integer(bounds)[keep]
>>>
>>>    
>>>> # This is (ii) from above
>>>> counts <- counts[keep]
>>>>
>>>> =====================
>>>>
>>>> I feel like the coverage interface, using the start/end args would help
>>>> here, no? Wouldn't this have worked to make things a bit easier?
>>>>
>>>> cov <- coverage(gene.reads, start=start(bounds), end=end(bounds))
>>>>
>>>> How would you do that by using the width/shift args? Or wouldn't you?
>>>> Maybe I'm not clear on what the original start/end args in 
>>>> ``coverage`` are
>>>> meant to do?
>>>>
>>>> Anyway, I was just curious if there was a better way to do what I 
>>>> need to
>>>> do.
>>>>
>>>> Thanks,
>>>>
>>>> -steve
>>>>
>>>>
>>>> ===== Here's the function in all of its glory, use it if you like =====
>>>>
>>>> Here's a sample call:
>>>>
>>>> plotGenomeGraphReads(list(Sample1=reads.1, Sample2=reads.2),
>>>>    gene.id='ENSG00000116688', title="MFN2")
>>>>
>>>> Here's the function:
>>>>
>>>> plotGenomeGraphReads <- function(sample.reads, gene.model=NULL,
>>>> gene.id=NULL,
>>>>                                 biomart=NULL, cols=NULL, title=NULL) {
>>>>  # Plots the read counts over a particular gene model.
>>>>  #
>>>>  # You can pass your own gene.model as an IRange object, or use 
>>>> biomaRt to
>>>>  # retrieve the model using the Ensembl Gene ID gene.id
>>>>  #
>>>>  # Parameters
>>>>  # ----------
>>>>  # sample.reads : named list[IRanges] representing reads from separate
>>>> samples
>>>>  #                over a given chromosome (the one your gene is on!)
>>>>  # gene.model   : IRanges object of start/end positions of the gene
>>>>  # mart         : biomaRt, if you want to query ensembl for the gene 
>>>> model
>>>>  # cols         : the colors you want to use for the reads in the
>>>> respective
>>>>  #                samples
>>>>  if (is.null(gene.model) && is.null(gene.id)) {
>>>>    stop("Need to get the gene.model from somewhere.")
>>>>  }
>>>>
>>>>  # Figure out what type of gene.model we're using and get appropriate
>>>> bounds
>>>>  if (!is.null(gene.model)) {
>>>>    gm <- makeGeneModel(start=start(gene.model), end=end(gene.model))
>>>>    bounds <- range(gene.model)
>>>>  } else {
>>>>    if (is.null(biomart)) {
>>>>      biomart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
>>>>    }
>>>>    gm <- makeGene(id=gene.id, type='ensembl_gene_id', biomart=biomart)
>>>>    if (is.null(title)) {
>>>>      title <- gene.id
>>>>    }
>>>>    anno <- gm at ens
>>>>    bounds <- range(IRanges(start=anno$exon_chrom_start,
>>>>                            end=anno$exon_chrom_end))
>>>>  }
>>>>
>>>>  sample.names <- names(sample.reads)
>>>>
>>>>  if (is.null(cols)) {
>>>>    cols <- rainbow(length(sample.reads))
>>>>    names(cols) <- sample.names
>>>>  }
>>>>
>>>>  if (is.null(title)) {
>>>>    title <- 'Gene Reads'
>>>>  }
>>>>
>>>>  stopifnot(length(cols) == length(sample.reads))
>>>>  stopifnot(names(cols) == sample.names)
>>>>
>>>>  # Calculate the coverage vectors over the gene model for each sample
>>>>  sample.coverage <- lapply(sample.names, function(name) {
>>>>    reads <- sample.reads[[name]]
>>>>    which.reads <- subjectHits(overlap(reads, bounds))
>>>>    gene.reads <- reads[which.reads]
>>>>    cov <- coverage(gene.reads)
>>>>    cov <- cov[start(bounds):length(cov)]
>>>>
>>>>    counts <- numeric(width(bounds))
>>>>    counts[1:length(cov)] <- as.vector(cov)
>>>>    keep <- counts != 0
>>>>    bases <- (start(bounds):end(bounds))[keep]
>>>>    counts <- counts[keep]
>>>>    list(pos=bases, coverage=counts)
>>>>  })
>>>>  names(sample.coverage) <- sample.names
>>>>
>>>>  # Make y-axis the same for each sample
>>>>  max.counts <- max(sapply(sample.coverage, function(cov)
>>>> max(cov$coverage)))
>>>>
>>>>  tracks <- lapply(sample.names, function(name) {
>>>>    dp <- DisplayPars(lwd=0.3, color=cols[[name]], ylim=c(0, 
>>>> max.counts))
>>>>    makeBaseTrack(base=sample.coverage[[name]]$pos,
>>>>                  value=sample.coverage[[name]]$coverage,
>>>>                  dp=dp)
>>>>  })
>>>>  names(tracks) <- names(sample.reads)
>>>>
>>>>  title <- makeTitle(text=title)
>>>>  plot.me <- c(title, tracks, list(gm))
>>>>
>>>>  gdPlot(plot.me, minBase=start(bounds), maxBase=end(bounds))
>>>> }
>>>>
>>>>
>>>> -- 
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>>  |  Memorial Sloan-Kettering Cancer Center
>>>>  |  Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>       
>>>     
>>
>>
>>
>>
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793