[Bioc-sig-seq] who have RNA-seq pipeline code

Martin Morgan mtmorgan at fhcrc.org
Sat Sep 3 00:29:38 CEST 2011

On 09/02/2011 10:19 AM, wang peter wrote:
> hi, each one
> who have RNA-seq pipeline code
> can you share them with me?
> thx
> shan gao

* Scenario

This is a simple work flow for single-end RNA-seq looking at
differential representation of known genes. The data is in a directory
dataDir, as implied by the commands below (<...> indicates info to be
provide by the user).

   dataDir <- <...>
   fastqDir <- file.path(dataDir, "fastq")
   bamDir <- file.path(dataDir, "bam")
   outputDir <- file.path(dataDir, "output")

The initial data (up to differential representation analysis) is
'big' and we have access to large-memory linux machines where the
multicore package is available.

   lapply <- mclapply
   sapply <- function(...) simplify2array(mclapply(...))

The following uses 'devel' versions of R and Bioconductor packages;
these will become the next 'release' at the end of October, 2011.

* Sequencing

Illumina sequencing is done outside R (!). We have access to fastq
files, in the fastqDir directory.

* Quality assessment

Use ShortRead to produce a basic QA report. Down-sample fastq files if
they are big.

   fls <- list.files(fastqDir, "fastq$", full=TRUE)
   names(fls) <- sub(".fastq", "", basename(fls))
   ## use FastqSampler if fastq files are large
   qas <- lapply(seq_along(fls),
                 function(i, fls) qa(readFastq(fls[i]), names(fls)[i]),
   qa <- do.call(rbind, qas)
   save(qa, file=file.path(outputDir, "qa.rda")

* Alignment

Alignment is done outside R. Output is a BAM file, one per
sample. Biostrings::matchPDict (in some circumstances) and Rsubread
might be used.

* Known genes

Known gene information can come from a variety of sources,
conveniently from UCSC or biomaRt using the GenomicFeatures
package. These can be saved to disk as sqlite data bases for future
use or to be shared with lab mates. There are also semi-annual
snapshots available, as used here.

   txdb <- Scerevisiae_UCSC_sacCer2_ensGene_TxDb
   gnModel <- exonsBy(txdb, "gene")

* Counting

There are different ways to count, some of which are implemented in
GenomicRanges::countGappedAlignments. I like a simple approach that
counts any kind of overlap between known genes as a 'hit', and that
discards a read hitting more than one gene. This will be
unsatisfactory in many situations.

   bamFls <- list.files(bamDir, "bam$", full=TRUE)
   names(bamFls) <- sub("\\..*", "", basename(bamFls))
   counter <- function(fl, gnModel)
       aln <- readGappedAlignments(fl)
       strand(aln) <- "*" # for strand-blind sample prep protocol
       hits <- countOverlaps(aln, gnModel)
       counts <- countOverlaps(gnModel, aln[hits==1])
       names(counts) <- names(gnModel)
   counts <- sapply(bamFls, counter, gnModel)
   save(counts, file=file.path(outputDir, "counts.rda"))

* Differential representation

edgeR and DESeq are mature packages for the analysis of differential
representation, taking a sophisticated approach to the statistical
analysis; DEXSeq is a recent package to identify differential exon
use. An edgeR work flow is


   ## identify treatment groups
   grp <- factor(<...>)

   ## create data structures
   dge <- DGEList(counts, group=grp)
   dge <- calcNormFactors(dge)

   ## filter uniformative genes
   m <- 1e6 * t(t(dge$counts) / dge$samples$lib.size)
   ridx <- rowSums(m > 1) >= 2
   dge <- dge[ridx,]

   ## comparison between groups
   design <- model.matrix( ~ grp )
   dge <- estimateCommonDisp(dge, design)
   fit <- glmFit(dge, design, dispersion=dge$common.dispersion)
   lrTest <- glmLRT(dge, fit, coef=2)
   tt <- topTags(lrTest, Inf)
   save(tt, file=file.path(dataDir, "tt.rda"))

Two sanity checks along the way are that the treatment groups are
relatively distinct in an MDS plot, and that the differentially
representation tags really are.

          function(x, grp) tapply(counts[x,], grp, mean),

* Annotation

Some annotation information is available from the org* model organism
packages; the biomaRt package is an alternative. Here we get the sgd
ids from the top table, and use these to add gene name annotations


   ttids <- rownames(tt$table)
   sgd2gnname <- mget(ttids, org.Sc.sgdGENENAME, ifnotfound=NA)
   ttAnno <- cbind(tt$table, genename=unlist(sgd2gnname))

A next step might involve gene set analyses, as in the goseq package.


> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

More information about the Bioc-sig-sequencing mailing list