[BioC] Problems Counting Reads with summarizeOverlaps

Valerie Obenchain vobencha at fhcrc.org
Sun Dec 22 01:18:51 CET 2013


On 12/21/13 05:14, Sam McInturf wrote:
> Valerie and Reema,
> I am using single end reads.
>
> I did as Valerie suggested
>
>   so <- summarizeOverlaps(tx, reads, mode="Union",type = "any")

This is not the example I used below. You did not include 
'inter.feature=FALSE'. 'type' is not an argument to summarizeOverlaps(). 
In other words, do not use 'type', use 'inter.feature'. Please see my 
previous explanation and the man page for summarizeOverlaps().


Valerie


> colSums(assays(so)$counts
> #185,940
>
> which feels silly, allow for double counts and return fewer counts than
> prior
>
> P.S. thank you for your prompt reply, my email "hid" it from me
> Best
>
> Sam
>
>
>
> On Mon, Dec 9, 2013 at 6:08 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Reema,
>
>     Thank you for pointing out the error in the summarizeOverlaps
>     vignette. That sentence was a leftover from a previous version and
>     should have been removed. I have updated the vignette in both
>     release (GenomicRanges 1.14.4) and devel (GenomicAlignments 0.99.8).
>
>     When the 'reads' argument is a GAlignmentPairs object the
>     stand-alone functions of Union(), IntersectionStrict() and
>     IntersectionNotEmpty() count the data as paired-end. If 'reads' is a
>     GAlignments object they are counted as single-end.
>     summarizeOverlaps() reads the data into a GAlignments or
>     GAlignmentPairs object and calls these functions internally; they
>     are exactly the same.
>
>     Valerie
>
>
>
>     On 12/09/2013 02:46 AM, Reema Singh wrote:
>
>         Hi all,
>
>         Indeed summarizeOverlaps() works on paired end data. But different
>         counting modes [Union, Intersection and IntersectionNotEmpty]
>         they do
>         not handle paired end reads. Here the reference
>         [http://www.bioconductor.org/__packages/release/bioc/__vignettes/GenomicRanges/inst/__doc/summarizeOverlaps.pdf]-
>         <http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf%5D->
>         Page 2 [under Counting Modes].
>
>         Kind Regards
>
>
>
>         On Sun, Dec 8, 2013 at 6:31 PM, Valerie Obenchain
>         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
>              summarizeOverlaps() does count paired-end reads. Use option
>              singleEnd=FALSE. Details under 'singleEnd' in the 'Arguments'
>              section on the man page.
>
>              library(GenomicAlignments) ## devel
>              library(GenomicRanges) ## release
>              ?summarizeOverlaps
>
>              Valerie
>
>
>              On 12/08/13 09:42, Reema Singh wrote:
>
>                  Hi Sam,
>
>                  It depends on whether you are using single read or
>         paired end reads
>                  files. As far as I known summarizeOverlaps function
>         only working
>                  in case
>                  of single read. Anyone please correct me if I am wrong.
>
>                  Kind Regards
>
>
>                  On Sun, Dec 8, 2013 at 8:25 PM, Valerie Obenchain
>                  <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                  <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>>> wrote:
>
>                       Hi Sam,
>
>                       Are you getting any error messages related to
>         seqlevels
>                  when you
>                       count? If you have confirmed that annotation seqlevels
>                  match the bam
>                       files next look at the maximum possible overlap in
>         a single
>                  bam file.
>
>                          bf <- singleBamFile
>                          reads <- readGAlignments(bf)  ## assuming
>         single-end reads
>                          so <- summarizeOverlaps(tx, reads, mode=Union,
>                  inter.feature=FALSE)
>
>                       'inter.feature=FALSE' with 'mode=Union' is
>         countOverlaps with
>                       'type=ANY'. Reads that hit multiple features will
>         be double
>                  counted
>                       in this case. The idea to to see if there are any
>         common
>                  regions at all.
>
>                       If there are still no counts, you could look more
>         closely at a
>                       smaller chromosome region in 'reads' where you
>         would expect
>                  counts.
>                       This visual inspection might give you some insight
>         as to
>                  why there
>                       is no overlap.
>
>
>                       Valerie
>
>
>                       On 12/05/13 18:51, Sam McInturf wrote:
>
>                           Hello Bioconductors,
>                                I am working on an Arabidopsis RNA seq
>         set, and after
>                           executing my count
>                           reads script, I get a count table with no
>         reads counted (ie
>                           sum(colSums(mat)) = 0).
>
>                           I mapped my reads using Tophat2 (without
>         supplying a
>                  gtf/gff
>                           file) and used
>                           samtools to sort and index my
>         accepted_hits.bam files.
>                    I loaded
>                           these bam
>                           files into IGV (integrated Genome Browser) and the
>                  reads appear
>                           to have
>                           been mapped (additionally the
>         align_summary.txt says I
>                  have good
>                           mapping).
>                              Script and session info below.
>
>                           Essentially I use Rsamtools to load in my
>         bamfiles, and
>                  then
>                           makeTranscriptsFromGFF to make a txdb object,
>                  transcriptsBy to get
>                           transcripts, and then summarizeOverlaps to
>         count the
>                  reads.  My
>                           gff file is
>                           TAIR10.  I am relatively sure that the genome
>         I mapped
>                  to was
>                           the TAIR10
>                           release, but I am not 100% on that.
>
>                           I have also used biomaRt to do this (commented
>         out),
>                  but I had
>                           the same
>                           results.
>
>                           I am currently upgrading to R 3.0.2 and
>         re-installing
>                           bioconductor (maybe
>                           it will change something?)
>
>                           Thanks for any thoughts!
>
>                           Script:
>
>                           gffdir <-
>                    "/data/smcintur/Annotation/______TAIR10_GFF3_genes.gff";
>                           library(ShortRead)
>                           library(GenomicFeatures)
>                           library(Rsamtools)
>                           #library(biomaRt)
>
>
>         setwd("/data/smcintur/RNASeq/______NMseq/newTophat/BamFiles/")
>
>
>                           bf <- list.files(pattern = ".bam$")
>                           bi <- list.files(pattern = "bam.bai$")
>                           bfl <- BamFileList(bf, bi)
>
>                           #mart <-
>         makeTranscriptDbFromBiomart(______biomart =
>
>
>                           "Public_TAIRV10", dataset=
>                           "tairv10")
>                           tx <- transcriptsBy(mart)
>                           gff <- makeTranscriptDbFromGFF(______gffdir,
>         format = "gff")
>
>
>                           gff
>
>                           cds <- cdsBy(gff)
>                           tx <- transcriptsBy(gff)
>
>                           olapTx <- summarizeOverlaps(tx, bfl)
>                           olapCds <- summarizeOverlaps(cds, bfl)
>
>                           txMat <- assays(olapTx)$counts
>                           cdsMat <- assays(olapCds)$counts
>
>                           write.table(txMat, file = "countMatTxGff.txt",
>         sep = "\t")
>                           write.table(cdsMat, file =
>         "countMatCdsGff.txt", sep =
>                  "\t")
>
>                           sessionInfo
>                           R version 3.0.0 (2013-04-03)
>                           Platform: x86_64-unknown-linux-gnu (64-bit)
>
>                           locale:
>                              [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>                              [3] LC_TIME=en_US.UTF-8
>           LC_COLLATE=en_US.UTF-8
>                              [5] LC_MONETARY=en_US.UTF-8
>           LC_MESSAGES=en_US.UTF-8
>                              [7] LC_PAPER=C                 LC_NAME=C
>                              [9] LC_ADDRESS=C               LC_TELEPHONE=C
>                           [11] LC_MEASUREMENT=en_US.UTF-8
>         LC_IDENTIFICATION=C
>
>                           attached base packages:
>                           [1] stats     graphics  grDevices utils
>         datasets
>                    methods   base
>
>                           other attached packages:
>                           [1] BiocInstaller_1.12.0
>
>                           loaded via a namespace (and not attached):
>                              [1] AnnotationDbi_1.24.0   Biobase_2.22.0
>                           BiocGenerics_0.8.0
>                              [4] biomaRt_2.18.0         Biostrings_2.30.1
>                    bitops_1.0-6
>                              [7] BSgenome_1.30.0        DBI_0.2-7
>                             GenomicFeatures_1.14.2
>                           [10] GenomicRanges_1.14.3   IRanges_1.20.6
>                  parallel_3.0.0
>                           [13] RCurl_1.95-4.1         Rsamtools_1.14.2
>                  RSQLite_0.11.4
>                           [16] rtracklayer_1.22.0     stats4_3.0.0
>                  tools_3.0.0
>                           [19] XML_3.98-1.1           XVector_0.2.0
>                    zlibbioc_1.8.0
>
>
>                       _____________________________________________________
>
>                       Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>                  <mailto:Bioconductor at r-____project.org
>         <mailto:Bioconductor at r-__project.org>
>                  <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>>
>         https://stat.ethz.ch/mailman/______listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/____listinfo/bioconductor>
>                  <https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>>
>
>
>
>           <https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>>
>                       Search the archives:
>         http://news.gmane.org/gmane.______science.biology.informatics.______conductor
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor>
>
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>>
>
>
>         <http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>>
>
>
>
>
>
>                  --
>                  Reema Singh
>                  PhD Scholar
>                  Computational Biology and Bioinformatics
>                  School of Computational and Integrative Sciences
>                  Jawaharlal Nehru University
>                  New Delhi-110067
>                  INDIA
>
>
>
>
>
>         --
>         Reema Singh
>         Postdoctoral Research Assistant
>         College of Life Sciences
>         University of Dundee,
>         Dundee DD1 4HN, Scotland
>         United Kingdom
>
>
>
>     --
>     Valerie Obenchain
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B155
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>     Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> Sam McInturf



More information about the Bioconductor mailing list