[BioC] easyRNASeq: Number of total counts

Pankaj Agarwal [guest] guest at bioconductor.org
Fri Apr 11 21:55:18 CEST 2014


Hi,
I am using easyRNASeq for estimating the counts in an rna-seq alignment to hg19/GRCh37 done using bowtie2.  I would like to get the counts "per gene".  The code runs successfully and I get an output table, but the number of records in the output table are ~57000:
cat count.tsv | wc -l
57774

I am wondering why the number of counts are so much greater than the total number of genes (~30,000).

I am getting some warning message, which may be related to this, especially #1 and #4:

Warning messages: 
1: Consider using 'synthetic transcripts' as described in the section 7.1 of the vignette instead of the count=genes,summarization=geneModels deprecated paradigm. 
2: In easyRNASeq(filesDirectory = getwd(), filenames = c("BRPC13-1118_L1.D710_501.sorted.bam",  : 
  You enforce UCSC chromosome conventions, however the provided chromosome size list is not compliant. Correcting it. 
3: In easyRNASeq(filesDirectory = getwd(), filenames = c("BRPC13-1118_L1.D710_501.sorted.bam",  : 
  You enforce UCSC chromosome conventions, however the provided annotation is not compliant. Correcting it. 
4: In easyRNASeq(filesDirectory = getwd(), filenames = c("BRPC13-1118_L1.D710_501.sorted.bam",  : 
  There are 18950 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? 
5: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter,  : 
  You enforce UCSC chromosome conventions, however the provided alignments are not compliant. Correcting it. 
6: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter,  : 
  You enforce UCSC chromosome conventions, however the provided alignments are not compliant. Correcting it. 
7: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter,  : 
  You enforce UCSC chromosome conventions, however the provided alignments are not compliant. Correcting it. 
8: In fetchCoverage(rnaSeq, format = format, filename = filename, filter = filter,  : 
  You enforce UCSC chromosome conventions, however the provided alignments are not compliant. Correcting it.

Code I am running for estimating the counts:

> count.table <- easyRNASeq(filesDirectory=getwd(), 
+ filenames=c("A.sorted.bam","B.sorted.bam","C.sorted.bam","D.sorted.bam"), 
+ organism="Hsapiens", 
+ annotationMethod="gtf", 
+ annotationFile="/general/NGS/index/human/Homo_sapiens.GRCh37.74.gtf", 
+ count="genes", 
+ summarization="geneModels") 

 -- output of sessionInfo(): 

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
> 

> packageDescription("easyRNASeq")
Package: easyRNASeq
Version: 1.8.7
Date: 2014-03-25
Type: Package
Title: Count summarization and normalization for RNA-Seq data.
Author: Nicolas Delhomme, Ismael Padioleau, Bastian Schiffthaler
Maintainer: Nicolas Delhomme <delhomme at embl.de>
Description: Calculates the coverage of high-throughput short-reads
        against a genome of reference and summarizes it per feature of
        interest (e.g. exon, gene, transcript). The data can be
        normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.
Depends: genomeIntervals (>= 1.18.0), Biobase (>= 2.22.0), biomaRt (>=
        2.18.0), edgeR (>= 3.4.0), Biostrings (>= 2.30.0), DESeq (>=
        1.14.0), GenomicRanges (>= 1.14.3), IRanges (>= 1.20.5),
        Rsamtools (>= 1.14.1), ShortRead (>= 1.20.0)
Imports: graphics, methods, parallel, utils, BiocGenerics (>= 0.8.0),
        LSD (>= 2.5)
Suggests: BSgenome (>= 1.30.0), BSgenome.Dmelanogaster.UCSC.dm3 (>=
        1.3.19), GenomicFeatures (>= 1.14.0), RnaSeqTutorial (>=
        0.0.13), BiocStyle (>= 1.0.0)
License: Artistic-2.0
LazyLoad: yes
biocViews: GeneExpression, RNAseq, Genetics, Preprocessing
Packaged: 2014-03-26 04:53:07 UTC; biocbuild
Built: R 3.0.3; ; 2014-03-31 20:30:18 UTC; unix


--
Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list