[BioC] Problems Counting Reads with summarizeOverlaps

Sun Dec 8 15:55:53 CET 2013

Hi Sam,

Are you getting any error messages related to seqlevels when you count? 
If you have confirmed that annotation seqlevels match the bam files next 
look at the maximum possible overlap in a single bam file.

   bf <- singleBamFile
   reads <- readGAlignments(bf)  ## assuming single-end reads
   so <- summarizeOverlaps(tx, reads, mode=Union, inter.feature=FALSE)

'inter.feature=FALSE' with 'mode=Union' is countOverlaps with 
'type=ANY'. Reads that hit multiple features will be double counted in 
this case. The idea to to see if there are any common regions at all.

If there are still no counts, you could look more closely at a smaller 
chromosome region in 'reads' where you would expect counts. This visual 
inspection might give you some insight as to why there is no overlap.

Valerie

On 12/05/13 18:51, Sam McInturf wrote:
> Hello Bioconductors,
>     I am working on an Arabidopsis RNA seq set, and after executing my count
> reads script, I get a count table with no reads counted (ie
> sum(colSums(mat)) = 0).
>
> I mapped my reads using Tophat2 (without supplying a gtf/gff file) and used
> samtools to sort and index my accepted_hits.bam files.  I loaded these bam
> files into IGV (integrated Genome Browser) and the reads appear to have
> been mapped (additionally the align_summary.txt says I have good mapping).
>   Script and session info below.
>
> Essentially I use Rsamtools to load in my bamfiles, and then
> makeTranscriptsFromGFF to make a txdb object, transcriptsBy to get
> transcripts, and then summarizeOverlaps to count the reads.  My gff file is
> TAIR10.  I am relatively sure that the genome I mapped to was the TAIR10
> release, but I am not 100% on that.
>
> I have also used biomaRt to do this (commented out), but I had the same
> results.
>
> I am currently upgrading to R 3.0.2 and re-installing bioconductor (maybe
> it will change something?)
>
> Thanks for any thoughts!
>
> Script:
>
> gffdir <-  "/data/smcintur/Annotation/TAIR10_GFF3_genes.gff";
> library(ShortRead)
> library(GenomicFeatures)
> library(Rsamtools)
> #library(biomaRt)
> setwd("/data/smcintur/RNASeq/NMseq/newTophat/BamFiles/")
> bf <- list.files(pattern = ".bam$")
> bi <- list.files(pattern = "bam.bai$")
> bfl <- BamFileList(bf, bi)
>
> #mart <- makeTranscriptDbFromBiomart(biomart = "Public_TAIRV10", dataset=
> "tairv10")
> tx <- transcriptsBy(mart)
> gff <- makeTranscriptDbFromGFF(gffdir, format = "gff")
> gff
>
> cds <- cdsBy(gff)
> tx <- transcriptsBy(gff)
>
> olapTx <- summarizeOverlaps(tx, bfl)
> olapCds <- summarizeOverlaps(cds, bfl)
>
> txMat <- assays(olapTx)$counts
> cdsMat <- assays(olapCds)$counts
>
> write.table(txMat, file = "countMatTxGff.txt", sep = "\t")
> write.table(cdsMat, file = "countMatCdsGff.txt", sep = "\t")
>
> sessionInfo
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] BiocInstaller_1.12.0
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.24.0   Biobase_2.22.0         BiocGenerics_0.8.0
>   [4] biomaRt_2.18.0         Biostrings_2.30.1      bitops_1.0-6
>   [7] BSgenome_1.30.0        DBI_0.2-7              GenomicFeatures_1.14.2
> [10] GenomicRanges_1.14.3   IRanges_1.20.6         parallel_3.0.0
> [13] RCurl_1.95-4.1         Rsamtools_1.14.2       RSQLite_0.11.4
> [16] rtracklayer_1.22.0     stats4_3.0.0           tools_3.0.0
> [19] XML_3.98-1.1           XVector_0.2.0          zlibbioc_1.8.0
>