[BioC] Problems Counting Reads with summarizeOverlaps

Valerie Obenchain vobencha at fhcrc.org
Mon Dec 9 18:08:54 CET 2013


Reema,

Thank you for pointing out the error in the summarizeOverlaps vignette. 
That sentence was a leftover from a previous version and should have 
been removed. I have updated the vignette in both release (GenomicRanges 
1.14.4) and devel (GenomicAlignments 0.99.8).

When the 'reads' argument is a GAlignmentPairs object the stand-alone 
functions of Union(), IntersectionStrict() and IntersectionNotEmpty() 
count the data as paired-end. If 'reads' is a GAlignments object they 
are counted as single-end. summarizeOverlaps() reads the data into a 
GAlignments or GAlignmentPairs object and calls these functions 
internally; they are exactly the same.

Valerie


On 12/09/2013 02:46 AM, Reema Singh wrote:
> Hi all,
>
> Indeed summarizeOverlaps() works on paired end data. But different
> counting modes [Union, Intersection and IntersectionNotEmpty] they do
> not handle paired end reads. Here the reference
> [http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf]-
> Page 2 [under Counting Modes].
>
> Kind Regards
>
>
>
> On Sun, Dec 8, 2013 at 6:31 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     summarizeOverlaps() does count paired-end reads. Use option
>     singleEnd=FALSE. Details under 'singleEnd' in the 'Arguments'
>     section on the man page.
>
>     library(GenomicAlignments) ## devel
>     library(GenomicRanges) ## release
>     ?summarizeOverlaps
>
>     Valerie
>
>
>     On 12/08/13 09:42, Reema Singh wrote:
>
>         Hi Sam,
>
>         It depends on whether you are using single read or paired end reads
>         files. As far as I known summarizeOverlaps function only working
>         in case
>         of single read. Anyone please correct me if I am wrong.
>
>         Kind Regards
>
>
>         On Sun, Dec 8, 2013 at 8:25 PM, Valerie Obenchain
>         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
>              Hi Sam,
>
>              Are you getting any error messages related to seqlevels
>         when you
>              count? If you have confirmed that annotation seqlevels
>         match the bam
>              files next look at the maximum possible overlap in a single
>         bam file.
>
>                 bf <- singleBamFile
>                 reads <- readGAlignments(bf)  ## assuming single-end reads
>                 so <- summarizeOverlaps(tx, reads, mode=Union,
>         inter.feature=FALSE)
>
>              'inter.feature=FALSE' with 'mode=Union' is countOverlaps with
>              'type=ANY'. Reads that hit multiple features will be double
>         counted
>              in this case. The idea to to see if there are any common
>         regions at all.
>
>              If there are still no counts, you could look more closely at a
>              smaller chromosome region in 'reads' where you would expect
>         counts.
>              This visual inspection might give you some insight as to
>         why there
>              is no overlap.
>
>
>              Valerie
>
>
>              On 12/05/13 18:51, Sam McInturf wrote:
>
>                  Hello Bioconductors,
>                       I am working on an Arabidopsis RNA seq set, and after
>                  executing my count
>                  reads script, I get a count table with no reads counted (ie
>                  sum(colSums(mat)) = 0).
>
>                  I mapped my reads using Tophat2 (without supplying a
>         gtf/gff
>                  file) and used
>                  samtools to sort and index my accepted_hits.bam files.
>           I loaded
>                  these bam
>                  files into IGV (integrated Genome Browser) and the
>         reads appear
>                  to have
>                  been mapped (additionally the align_summary.txt says I
>         have good
>                  mapping).
>                     Script and session info below.
>
>                  Essentially I use Rsamtools to load in my bamfiles, and
>         then
>                  makeTranscriptsFromGFF to make a txdb object,
>         transcriptsBy to get
>                  transcripts, and then summarizeOverlaps to count the
>         reads.  My
>                  gff file is
>                  TAIR10.  I am relatively sure that the genome I mapped
>         to was
>                  the TAIR10
>                  release, but I am not 100% on that.
>
>                  I have also used biomaRt to do this (commented out),
>         but I had
>                  the same
>                  results.
>
>                  I am currently upgrading to R 3.0.2 and re-installing
>                  bioconductor (maybe
>                  it will change something?)
>
>                  Thanks for any thoughts!
>
>                  Script:
>
>                  gffdir <-
>           "/data/smcintur/Annotation/____TAIR10_GFF3_genes.gff";
>                  library(ShortRead)
>                  library(GenomicFeatures)
>                  library(Rsamtools)
>                  #library(biomaRt)
>
>         setwd("/data/smcintur/RNASeq/____NMseq/newTophat/BamFiles/")
>
>                  bf <- list.files(pattern = ".bam$")
>                  bi <- list.files(pattern = "bam.bai$")
>                  bfl <- BamFileList(bf, bi)
>
>                  #mart <- makeTranscriptDbFromBiomart(____biomart =
>
>                  "Public_TAIRV10", dataset=
>                  "tairv10")
>                  tx <- transcriptsBy(mart)
>                  gff <- makeTranscriptDbFromGFF(____gffdir, format = "gff")
>
>                  gff
>
>                  cds <- cdsBy(gff)
>                  tx <- transcriptsBy(gff)
>
>                  olapTx <- summarizeOverlaps(tx, bfl)
>                  olapCds <- summarizeOverlaps(cds, bfl)
>
>                  txMat <- assays(olapTx)$counts
>                  cdsMat <- assays(olapCds)$counts
>
>                  write.table(txMat, file = "countMatTxGff.txt", sep = "\t")
>                  write.table(cdsMat, file = "countMatCdsGff.txt", sep =
>         "\t")
>
>                  sessionInfo
>                  R version 3.0.0 (2013-04-03)
>                  Platform: x86_64-unknown-linux-gnu (64-bit)
>
>                  locale:
>                     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>                     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>                     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>                     [7] LC_PAPER=C                 LC_NAME=C
>                     [9] LC_ADDRESS=C               LC_TELEPHONE=C
>                  [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>                  attached base packages:
>                  [1] stats     graphics  grDevices utils     datasets
>           methods   base
>
>                  other attached packages:
>                  [1] BiocInstaller_1.12.0
>
>                  loaded via a namespace (and not attached):
>                     [1] AnnotationDbi_1.24.0   Biobase_2.22.0
>                  BiocGenerics_0.8.0
>                     [4] biomaRt_2.18.0         Biostrings_2.30.1
>           bitops_1.0-6
>                     [7] BSgenome_1.30.0        DBI_0.2-7
>                    GenomicFeatures_1.14.2
>                  [10] GenomicRanges_1.14.3   IRanges_1.20.6
>         parallel_3.0.0
>                  [13] RCurl_1.95-4.1         Rsamtools_1.14.2
>         RSQLite_0.11.4
>                  [16] rtracklayer_1.22.0     stats4_3.0.0
>         tools_3.0.0
>                  [19] XML_3.98-1.1           XVector_0.2.0
>           zlibbioc_1.8.0
>
>
>              ___________________________________________________
>              Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>         https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>              <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>              Search the archives:
>         http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>
>
>         --
>         Reema Singh
>         PhD Scholar
>         Computational Biology and Bioinformatics
>         School of Computational and Integrative Sciences
>         Jawaharlal Nehru University
>         New Delhi-110067
>         INDIA
>
>
>
>
>
> --
> Reema Singh
> Postdoctoral Research Assistant
> College of Life Sciences
> University of Dundee,
> Dundee DD1 4HN, Scotland
> United Kingdom


-- 
Valerie Obenchain

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319



More information about the Bioconductor mailing list