[BioC] question about DeSeq
Martin Morgan
mtmorgan at fhcrc.org
Mon Dec 27 15:44:45 CET 2010
Simon asked for me to put code behind my words, and to provide the
equivalent commands to the counting operations at
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
There are many apporaches to this kind of task; the following is one.
## Data
We'd like test cases representing Simon's figure. So create a
`GRangesList` of regions of interest (`subj`) and reads (`query`),
e.g., using a short helper function `rng` for brevity
library(GenomicFeatures)
rng <- function(s, w)
## ranges by 'start', 'width'; same chromosome, strand
GRanges(seq="chr1", IRanges(s, width=w), strand="+")
subj <- GRangesList(A=rng(1000, 900),
B=rng(2000, 900),
C=rng(c(3000, 3600), c(500, 300)),
D=rng(c(4000, 4600), c(500, 300)),
E1=rng(5000, 700), E2=rng(5500, 400),
F1=rng(6000, 700), F2=rng(6500, 400),
G1=rng(7000, 700), G2=rng(7500, 400))
query <- GRangesList(a=rng(1500, 100),
b=rng(2850, 100),
c=rng(3450, 200),
d=rng(c(4400, 4600), 100),
e=rng(5100, 100),
f=rng(6450, 100),
g=rng(7600, 100))
Normally one might create `subj` with `makeTranscriptDbFromUCSC`,
`import` (for GFF-like) or other `rtracklayer` functionality, or from
queries to `biomaRt` or using the `AnnotationDbi`, `*org`, and
`BSgenome*` Bioconductor annotation packages. `query` might normally
come from BAM files (e.g., `readGappedAlignements` in `GenomicRanges`
or `scanBam` in `Rsamtools`) or other aligned reads (using base `R`
functionality for pure text files, or perhaps `readAligned` from
`ShortRead`, if the alignments do not contain gaps). These operations
are described in the vignettes of the corresponding packages.
## union
To implement this, we count the number of times each read (`query`)
overlaps a region of interset (`subj`), then drop those reads with
more than one hit and count the number of times each subject overlaps
a (now unambiguous) read
hitsPerQuery <- countOverlaps(query, subj)
countOverlaps(subj, query[hitsPerQuery == 1])
## strict intersection
Simon's intersection requires a way to count queries that lie strictly
within regions of interest. This is not available with `GRangesList`
objects, so we create a helper function to coerce our data to a
`RangesList`, and to create an index that maps in the reverse
direction
grl2rl <- function(x)
## coerce GRangesList to RangesList
as(unlist(x, use.names=FALSE), "RangesList")
maprl <- function(x)
## map to 'x' GRangesList indicies from RangesList
rep(seq_len(length(x)), elementLengths(x))
We will also need a helper function to take overlaps found on the
`RangesList` objects and map them to `GRangesList`, taking into
account the query (`qmap`) and subject (`smap`) maps. We also use the
opportunity to remove queries that hit multiple subjects
recount <-
function(olap, qmap, smap)
## re-map to GRangesList, removing multi-hit queries
{
## re-map hits
q_hits <- qmap[queryHits(olap)]
s_hits <- smap[subjectHits(olap)]
## remove queries that align to multiple subjects
hitmap0 <- lapply(split(s_hits, q_hits), unique)
hitmap <- Filter(function(x) length(x) == 1, hitmap0)
## return as vector of counts
tabulate(as.integer(hitmap), max(smap))
}
We count with
olaps <-
findOverlaps(grl2rl(query), grl2rl(subj), type="within")
lapply(olaps, recount, maprl(query), maprl(subj))
grl2rl loses strand information. This will be appropriate for some
RNAseq protocols; if not, then the two lines above can be repeated
for appropriate strand subsetes of query, subj. Interesting exercises
would modify 'recount' to, say, use in the 'union' approach to divide
queries amongst the regions they overlap (rather than discarding) or
to return counts-per-exon rather than summarizing to the original gene
level.
## Non-empty intersect
The goal is to count reads that overlap, by any number of nucleotides,
any single subject. To do this we write a helper to coerce a
`GRangesList` to disjoint regions that are represented exactly once,
i.e., discarding subject regions overlapping one another.
grl2udrl <- function(x)
## GRangesList to unique disjoint RangesList
{
rl <- unlist(x)
d <- disjoin(rl)
ud <- d[countOverlaps(d, rl) == 1]
as(ud, "RangesList")
}
mapudrl <- function(x)
## map to 'x' GRangesList from u. d. RangesList
subjectHits(findOverlaps(grl2udrl(x), x))
and then we count
olaps <- findOverlaps(grl2rl(query), grl2udrl(subj))
lapply(olaps, recount, maprl(query), mapudrl(subj))
Martin
----- Original Message -----
> On 12/22/2010 05:24 AM, Simon Anders wrote:
> > Hi
> >
> > On 12/22/2010 08:23 AM, vasu punj wrote:
> >> What kind of input DEseq can take Can it take Cufflinks out put or
> >> we
> >> have to use FKPM conversion.
> >
> > DESeq needs integer counts, i.e., just the raw number of reads that
> > align to a feature, without any normalization for sequencing depth
> > or
> > transcript length. Hence, the FPKM values reported by cufflinks are
> > not
> > suitable. the easiest is to use HTSeq-count to get a table of raw
> > counts.
> >
> > http://www-huber.embl.de/users/anders/HTSeq/
> > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
>
> It is as straight-forward and flexible to count overlaps with
> Bioconductor IRanges countOverlaps, with the added advantages of
> integration with annotation resources (e.g., GenomicFeatures
> TranscriptDb) and familiar (to those on the Bioconductor mailing list)
> programming environment.
>
> Martin
>
> >
> > Simon
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list