[Bioc-sig-seq] who have RNA-seq pipeline code
Martin Morgan
mtmorgan at fhcrc.org
Sat Sep 3 00:29:38 CEST 2011
On 09/02/2011 10:19 AM, wang peter wrote:
> hi, each one
> who have RNA-seq pipeline code
> can you share them with me?
> thx
> shan gao
* Scenario
This is a simple work flow for single-end RNA-seq looking at
differential representation of known genes. The data is in a directory
dataDir, as implied by the commands below (<...> indicates info to be
provide by the user).
dataDir <- <...>
fastqDir <- file.path(dataDir, "fastq")
bamDir <- file.path(dataDir, "bam")
outputDir <- file.path(dataDir, "output")
The initial data (up to differential representation analysis) is
'big' and we have access to large-memory linux machines where the
multicore package is available.
library(multicore)
lapply <- mclapply
sapply <- function(...) simplify2array(mclapply(...))
The following uses 'devel' versions of R and Bioconductor packages;
these will become the next 'release' at the end of October, 2011.
* Sequencing
Illumina sequencing is done outside R (!). We have access to fastq
files, in the fastqDir directory.
* Quality assessment
Use ShortRead to produce a basic QA report. Down-sample fastq files if
they are big.
library(ShortRead)
fls <- list.files(fastqDir, "fastq$", full=TRUE)
names(fls) <- sub(".fastq", "", basename(fls))
## use FastqSampler if fastq files are large
qas <- lapply(seq_along(fls),
function(i, fls) qa(readFastq(fls[i]), names(fls)[i]),
fls)
qa <- do.call(rbind, qas)
save(qa, file=file.path(outputDir, "qa.rda")
browseURL(report(qa))
* Alignment
Alignment is done outside R. Output is a BAM file, one per
sample. Biostrings::matchPDict (in some circumstances) and Rsubread
might be used.
* Known genes
Known gene information can come from a variety of sources,
conveniently from UCSC or biomaRt using the GenomicFeatures
package. These can be saved to disk as sqlite data bases for future
use or to be shared with lab mates. There are also semi-annual
snapshots available, as used here.
library(TxDb.Scerevisiae.UCSC.sacCer2.ensGene)
txdb <- Scerevisiae_UCSC_sacCer2_ensGene_TxDb
gnModel <- exonsBy(txdb, "gene")
* Counting
There are different ways to count, some of which are implemented in
GenomicRanges::countGappedAlignments. I like a simple approach that
counts any kind of overlap between known genes as a 'hit', and that
discards a read hitting more than one gene. This will be
unsatisfactory in many situations.
bamFls <- list.files(bamDir, "bam$", full=TRUE)
names(bamFls) <- sub("\\..*", "", basename(bamFls))
counter <- function(fl, gnModel)
{
aln <- readGappedAlignments(fl)
strand(aln) <- "*" # for strand-blind sample prep protocol
hits <- countOverlaps(aln, gnModel)
counts <- countOverlaps(gnModel, aln[hits==1])
names(counts) <- names(gnModel)
counts
}
counts <- sapply(bamFls, counter, gnModel)
save(counts, file=file.path(outputDir, "counts.rda"))
* Differential representation
edgeR and DESeq are mature packages for the analysis of differential
representation, taking a sophisticated approach to the statistical
analysis; DEXSeq is a recent package to identify differential exon
use. An edgeR work flow is
library(edgeR)
## identify treatment groups
grp <- factor(<...>)
## create data structures
dge <- DGEList(counts, group=grp)
dge <- calcNormFactors(dge)
## filter uniformative genes
m <- 1e6 * t(t(dge$counts) / dge$samples$lib.size)
ridx <- rowSums(m > 1) >= 2
dge <- dge[ridx,]
## comparison between groups
design <- model.matrix( ~ grp )
dge <- estimateCommonDisp(dge, design)
fit <- glmFit(dge, design, dispersion=dge$common.dispersion)
lrTest <- glmLRT(dge, fit, coef=2)
tt <- topTags(lrTest, Inf)
save(tt, file=file.path(dataDir, "tt.rda"))
Two sanity checks along the way are that the treatment groups are
relatively distinct in an MDS plot, and that the differentially
representation tags really are.
plotMDS.DGEList(dge)
sapply(rownames(tt$table)[1:4],
function(x, grp) tapply(counts[x,], grp, mean),
grp)
* Annotation
Some annotation information is available from the org* model organism
packages; the biomaRt package is an alternative. Here we get the sgd
ids from the top table, and use these to add gene name annotations
library(org.Sc.sgd.db)
ttids <- rownames(tt$table)
sgd2gnname <- mget(ttids, org.Sc.sgdGENENAME, ifnotfound=NA)
ttAnno <- cbind(tt$table, genename=unlist(sgd2gnname))
A next step might involve gene set analyses, as in the goseq package.
Martin
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list