[BioC] question about DeSeq

Mon Dec 27 15:44:45 CET 2010

Simon asked for me to put code behind my words, and to provide the
equivalent commands to the counting operations at

  http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

There are many apporaches to this kind of task; the following is one.

## Data

We'd like test cases representing Simon's figure. So create a
`GRangesList` of regions of interest (`subj`) and reads (`query`),
e.g., using a short helper function `rng` for brevity

    library(GenomicFeatures)

    rng <- function(s, w)
        ## ranges by 'start', 'width'; same chromosome, strand
    	GRanges(seq="chr1", IRanges(s, width=w), strand="+")

    subj <- GRangesList(A=rng(1000, 900),
                        B=rng(2000, 900),
                        C=rng(c(3000, 3600), c(500, 300)),
                    	D=rng(c(4000, 4600), c(500, 300)),
                    	E1=rng(5000, 700), E2=rng(5500, 400),
                    	F1=rng(6000, 700), F2=rng(6500, 400),
                    	G1=rng(7000, 700), G2=rng(7500, 400))

    query <- GRangesList(a=rng(1500, 100),
                         b=rng(2850, 100),
                         c=rng(3450, 200),
                     	 d=rng(c(4400, 4600), 100),
                     	 e=rng(5100, 100),
                     	 f=rng(6450, 100),
                     	 g=rng(7600, 100))

Normally one might create `subj` with `makeTranscriptDbFromUCSC`,
`import` (for GFF-like) or other `rtracklayer` functionality, or from
queries to `biomaRt` or using the `AnnotationDbi`, `*org`, and
`BSgenome*` Bioconductor annotation packages. `query` might normally
come from BAM files (e.g., `readGappedAlignements` in `GenomicRanges`
or `scanBam` in `Rsamtools`) or other aligned reads (using base `R`
functionality for pure text files, or perhaps `readAligned` from
`ShortRead`, if the alignments do not contain gaps). These operations
are described in the vignettes of the corresponding packages.

## union

To implement this, we count the number of times each read (`query`)
overlaps a region of interset (`subj`), then drop those reads with
more than one hit and count the number of times each subject overlaps
a (now unambiguous) read

    hitsPerQuery <- countOverlaps(query, subj)
    countOverlaps(subj, query[hitsPerQuery == 1])

## strict intersection

Simon's intersection requires a way to count queries that lie strictly
within regions of interest. This is not available with `GRangesList`
objects, so we create a helper function to coerce our data to a
`RangesList`, and to create an index that maps in the reverse
direction

    grl2rl <- function(x)
        ## coerce GRangesList to RangesList
        as(unlist(x, use.names=FALSE), "RangesList")
    maprl <- function(x)
        ## map to 'x' GRangesList indicies from RangesList
    	rep(seq_len(length(x)), elementLengths(x))

We will also need a helper function to take overlaps found on the
`RangesList` objects and map them to `GRangesList`, taking into
account the query (`qmap`) and subject (`smap`) maps. We also use the
opportunity to remove queries that hit multiple subjects

    recount <-
        function(olap, qmap, smap)
        ## re-map to GRangesList, removing multi-hit queries
    {
        ## re-map hits
        q_hits <- qmap[queryHits(olap)]
        s_hits <- smap[subjectHits(olap)]
        ## remove queries that align to multiple subjects
        hitmap0 <- lapply(split(s_hits, q_hits), unique)
        hitmap <- Filter(function(x) length(x) == 1, hitmap0)
        ## return as vector of counts
        tabulate(as.integer(hitmap), max(smap))
    }

We count with

    olaps <-
        findOverlaps(grl2rl(query), grl2rl(subj), type="within")
    lapply(olaps, recount, maprl(query), maprl(subj))

grl2rl loses strand information. This will be appropriate for some
RNAseq protocols; if not, then the two lines above can be repeated
for appropriate strand subsetes of query, subj.  Interesting exercises
would modify 'recount' to, say, use in the 'union' approach to divide
queries amongst the regions they overlap (rather than discarding) or
to return counts-per-exon rather than summarizing to the original gene
level.

##  Non-empty intersect

The goal is to count reads that overlap, by any number of nucleotides,
any single subject. To do this we write a helper to coerce a
`GRangesList` to disjoint regions that are represented exactly once,
i.e., discarding subject regions overlapping one another.

    grl2udrl <- function(x)
        ## GRangesList to unique disjoint RangesList
    {
	rl <- unlist(x)
        d <- disjoin(rl)
        ud <- d[countOverlaps(d, rl) == 1]
        as(ud, "RangesList")
    }
    mapudrl <- function(x)
        ## map to 'x'  GRangesList from u. d. RangesList
        subjectHits(findOverlaps(grl2udrl(x), x))

and then we count

    olaps <- findOverlaps(grl2rl(query), grl2udrl(subj))
    lapply(olaps, recount, maprl(query), mapudrl(subj))

Martin

----- Original Message -----
> On 12/22/2010 05:24 AM, Simon Anders wrote:
> > Hi
> >
> > On 12/22/2010 08:23 AM, vasu punj wrote:
> >> What kind of input DEseq can take Can it take Cufflinks out put or
> >> we
> >> have to use FKPM conversion.
> >
> > DESeq needs integer counts, i.e., just the raw number of reads that
> > align to a feature, without any normalization for sequencing depth
> > or
> > transcript length. Hence, the FPKM values reported by cufflinks are
> > not
> > suitable. the easiest is to use HTSeq-count to get a table of raw
> > counts.
> >
> > http://www-huber.embl.de/users/anders/HTSeq/
> > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
> 
> It is as straight-forward and flexible to count overlaps with
> Bioconductor IRanges countOverlaps, with the added advantages of
> integration with annotation resources (e.g., GenomicFeatures
> TranscriptDb) and familiar (to those on the Bioconductor mailing list)
> programming environment.
> 
> Martin
> 
> >
> >   Simon
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> 
> Location: M1-B861
> Telephone: 206 667-2793
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor