[Bioc-sig-seq] IRanges::overlap interface, help writing idiomatic code?
Patrick Aboyoun
paboyoun at fhcrc.org
Fri Sep 11 19:09:25 CEST 2009
Michael,
I was recrafting Steve's example, in which he created a numeric vector
for the non-zero coverage values. He might need a numeric vector for
GenomeGraphs.
Patrick
Michael Lawrence wrote:
>
>
> On Fri, Sep 11, 2009 at 9:51 AM, Patrick Aboyoun <paboyoun at fhcrc.org
> <mailto:paboyoun at fhcrc.org>> wrote:
>
> Steve,
> I have revised your code to simplify it a bit. I added example
> data as well to make it easier for others to run it. The shift and
> width arguments to coverage are discussed in the coverage man
> page. If you are unsure what the sift argument will do to a set of
> ranges, you can run the shift operation on them (e.g.
> shift(bounds, shift = - start(bounds) + 1)).
>
>
> ==== Revised code ====
>
> suppressMessages(library(IRanges))
>
> sample.reads <- IRanges(start = c(11869, 11882, 12245, 12554,
> 12555, 12557), width = 32)
> bounds <- IRanges(start=11700, end=12550)
>
> # Get the the reads from my sample.reads
>
> # object that lie within my gene bounds
> gene.reads <- sample.reads[bounds]
>
> # Build my coverage Rle vector
> cov <- coverage(gene.reads, shift = - start(bounds) + 1, width =
> width(bounds))
>
>
> # Create the non-Rle coverage vector that I will use down stream
> # to make a GenomeGraph::BaseTrack object
> counts <- as.numeric(cov)
>
>
> Would as.integer() be sufficient here?
>
>
>
> # Build the base-position vector and filter out positions
> # with 0 counts from both vectors
> keep <- counts != 0
>
> # This is (i) from above
> bases <- as.integer(bounds)[keep]
>
> # This is (ii) from above
> counts <- counts[keep]
>
> =====================
>
>
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-09-07 r49613)
> i386-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods
> base
> other attached packages:
> [1] IRanges_1.3.72
>
> loaded via a namespace (and not attached):
> [1] tools_2.10.0
>
>
>
>
>
> Steve Lianoglou wrote:
>
> Hi Michael,
>
> Thanks for your suggestions so far ... being able to slice a Range
> object w/ another Range object somehow fell off of my radar.
>
> -steve
>
> On Thu, Sep 10, 2009 at 6:25 PM, Michael
> Lawrence<mflawren at fhcrc.org <mailto:mflawren at fhcrc.org>> wrote:
>
>
> On Thu, Sep 10, 2009 at 1:55 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com
> <mailto:mailinglist.honeypot at gmail.com>> wrote:
>
>
> Hi all,
>
> I'm finding it hard to shed some of my 10 thumbs when
> using the
> IRanges/Rle classes, so can someone suggest the
> "correct" way to do what my
> code does below?
>
> My goal is to use GenomeGraphs to plot a picture of my
> gene model, along
> with the reads from 1 or more samples above it. I've
> actually got that
> working, but I feel there's a more idiomatic way to do
> this, so if you have
> any comments, I'd be interested in hearing them.
>
> I'll paste the function in its entirety at the end of
> the email. It works
> as advertised, but perhaps not done in the most
> efficient way. For brevity's
> sake, however, I've distilled the specific code for
> your feedback.
>
> Anyway, here's the gist:
>
> 1. sample.reads : An IRanges object of the reads
> aligned to a particular
> chromosome from a particular rna-seq sample:
>
> R> > head(sample.reads)
> IRanges instance:
> start end width
> [1] 11869 11900 32
> [2] 11882 11913 32
> [3] 12245 12276 32
> [4] 12554 12585 32
> [5] 12555 12586 32
> [6] 12557 12588 32
>
> 2. bounds: An IRange object that indicates the start
> and end position of
> my gene -- I want all the reads that land here:
>
> R> bounds
> IRanges instance:
> start end width
> [1] 12040238 12073571 33334
>
> Goal:
> I wan to build two vectors:
> (i) A vector of positions on my chromosome (having
> length == to the length
> of my gene)
> (ii) A vector of counts over those positions
>
> Positions with 0 counts are removed from (i) and (ii)
>
> ==== Here's my code ====
>
> # Get the index of the reads from my sample.reads
> # object that lie within my gene bounds
> which.reads <- subjectHits(overlap(sample.reads, bounds))
>
> # Slice out these reads
> gene.reads <- reads[which.reads]
>
>
>
> The above two steps could be simply:
>
>
> gene.reads <- sample.reads[bounds]
> # Build my coverage Rle vector
> cov <- coverage(gene.reads)
>
> # Only keep the part of the Rle that starts @ the start
> # of my gene to the end (I feel like doing this is
> weird (?))
> cov <- cov[start(bounds):length(cov)]
>
> # Create the non-Rle coverage vector that I will use
> down stream
> # to make a GenomeGraph::BaseTrack object
> counts <- numeric(width(bounds))
>
> # Populate this vector with the correct coverage counts
> counts[1:length(cov)] <- as.vector(cov)
>
>
>
> As you say, this looks like it could be done easier using
> arguments to
> coverage(). Maybe someone else could explain this?
>
>
>
> # Build the base-position vector and filter out positions
> # with 0 counts from both vectors
> keep <- counts != 0
>
> # This is (i) from above
> bases <- (start(bounds):end(bounds))[keep]
>
>
>
> A bit shorter:
> bases <- as.integer(bounds)[keep]
>
>
>
> # This is (ii) from above
> counts <- counts[keep]
>
> =====================
>
> I feel like the coverage interface, using the
> start/end args would help
> here, no? Wouldn't this have worked to make things a
> bit easier?
>
> cov <- coverage(gene.reads, start=start(bounds),
> end=end(bounds))
>
> How would you do that by using the width/shift args?
> Or wouldn't you?
> Maybe I'm not clear on what the original start/end
> args in ``coverage`` are
> meant to do?
>
> Anyway, I was just curious if there was a better way
> to do what I need to
> do.
>
> Thanks,
>
> -steve
>
>
> ===== Here's the function in all of its glory, use it
> if you like =====
>
> Here's a sample call:
>
> plotGenomeGraphReads(list(Sample1=reads.1,
> Sample2=reads.2),
> gene.id <http://gene.id>='ENSG00000116688',
> title="MFN2")
>
> Here's the function:
>
> plotGenomeGraphReads <- function(sample.reads,
> gene.model=NULL,
> gene.id <http://gene.id>=NULL,
> biomart=NULL,
> cols=NULL, title=NULL) {
> # Plots the read counts over a particular gene model.
> #
> # You can pass your own gene.model as an IRange
> object, or use biomaRt to
> # retrieve the model using the Ensembl Gene ID
> gene.id <http://gene.id>
> #
> # Parameters
> # ----------
> # sample.reads : named list[IRanges] representing
> reads from separate
> samples
> # over a given chromosome (the one
> your gene is on!)
> # gene.model : IRanges object of start/end
> positions of the gene
> # mart : biomaRt, if you want to query
> ensembl for the gene model
> # cols : the colors you want to use for the
> reads in the
> respective
> # samples
> if (is.null(gene.model) && is.null(gene.id
> <http://gene.id>)) {
> stop("Need to get the gene.model from somewhere.")
> }
>
> # Figure out what type of gene.model we're using and
> get appropriate
> bounds
> if (!is.null(gene.model)) {
> gm <- makeGeneModel(start=start(gene.model),
> end=end(gene.model))
> bounds <- range(gene.model)
> } else {
> if (is.null(biomart)) {
> biomart <- useMart('ensembl',
> dataset='hsapiens_gene_ensembl')
> }
> gm <- makeGene(id=gene.id <http://gene.id>,
> type='ensembl_gene_id', biomart=biomart)
> if (is.null(title)) {
> title <- gene.id <http://gene.id>
> }
> anno <- gm at ens
> bounds <- range(IRanges(start=anno$exon_chrom_start,
> end=anno$exon_chrom_end))
> }
>
> sample.names <- names(sample.reads)
>
> if (is.null(cols)) {
> cols <- rainbow(length(sample.reads))
> names(cols) <- sample.names
> }
>
> if (is.null(title)) {
> title <- 'Gene Reads'
> }
>
> stopifnot(length(cols) == length(sample.reads))
> stopifnot(names(cols) == sample.names)
>
> # Calculate the coverage vectors over the gene model
> for each sample
> sample.coverage <- lapply(sample.names, function(name) {
> reads <- sample.reads[[name]]
> which.reads <- subjectHits(overlap(reads, bounds))
> gene.reads <- reads[which.reads]
> cov <- coverage(gene.reads)
> cov <- cov[start(bounds):length(cov)]
>
> counts <- numeric(width(bounds))
> counts[1:length(cov)] <- as.vector(cov)
> keep <- counts != 0
> bases <- (start(bounds):end(bounds))[keep]
> counts <- counts[keep]
> list(pos=bases, coverage=counts)
> })
> names(sample.coverage) <- sample.names
>
> # Make y-axis the same for each sample
> max.counts <- max(sapply(sample.coverage, function(cov)
> max(cov$coverage)))
>
> tracks <- lapply(sample.names, function(name) {
> dp <- DisplayPars(lwd=0.3, color=cols[[name]],
> ylim=c(0, max.counts))
> makeBaseTrack(base=sample.coverage[[name]]$pos,
> value=sample.coverage[[name]]$coverage,
> dp=dp)
> })
> names(tracks) <- names(sample.reads)
>
> title <- makeTitle(text=title)
> plot.me <- c(title, tracks, list(gm))
>
> gdPlot(plot.me, minBase=start(bounds),
> maxBase=end(bounds))
> }
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
> <http://cbio.mskcc.org/%7Elianos/contact>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
>
>
>
>
>
>
>
>
>
More information about the Bioc-sig-sequencing
mailing list