[BioC] Problems Counting Reads with summarizeOverlaps
Valerie Obenchain
vobencha at fhcrc.org
Mon Dec 9 18:08:54 CET 2013
Reema,
Thank you for pointing out the error in the summarizeOverlaps vignette.
That sentence was a leftover from a previous version and should have
been removed. I have updated the vignette in both release (GenomicRanges
1.14.4) and devel (GenomicAlignments 0.99.8).
When the 'reads' argument is a GAlignmentPairs object the stand-alone
functions of Union(), IntersectionStrict() and IntersectionNotEmpty()
count the data as paired-end. If 'reads' is a GAlignments object they
are counted as single-end. summarizeOverlaps() reads the data into a
GAlignments or GAlignmentPairs object and calls these functions
internally; they are exactly the same.
Valerie
On 12/09/2013 02:46 AM, Reema Singh wrote:
> Hi all,
>
> Indeed summarizeOverlaps() works on paired end data. But different
> counting modes [Union, Intersection and IntersectionNotEmpty] they do
> not handle paired end reads. Here the reference
> [http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf]-
> Page 2 [under Counting Modes].
>
> Kind Regards
>
>
>
> On Sun, Dec 8, 2013 at 6:31 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> summarizeOverlaps() does count paired-end reads. Use option
> singleEnd=FALSE. Details under 'singleEnd' in the 'Arguments'
> section on the man page.
>
> library(GenomicAlignments) ## devel
> library(GenomicRanges) ## release
> ?summarizeOverlaps
>
> Valerie
>
>
> On 12/08/13 09:42, Reema Singh wrote:
>
> Hi Sam,
>
> It depends on whether you are using single read or paired end reads
> files. As far as I known summarizeOverlaps function only working
> in case
> of single read. Anyone please correct me if I am wrong.
>
> Kind Regards
>
>
> On Sun, Dec 8, 2013 at 8:25 PM, Valerie Obenchain
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
> Hi Sam,
>
> Are you getting any error messages related to seqlevels
> when you
> count? If you have confirmed that annotation seqlevels
> match the bam
> files next look at the maximum possible overlap in a single
> bam file.
>
> bf <- singleBamFile
> reads <- readGAlignments(bf) ## assuming single-end reads
> so <- summarizeOverlaps(tx, reads, mode=Union,
> inter.feature=FALSE)
>
> 'inter.feature=FALSE' with 'mode=Union' is countOverlaps with
> 'type=ANY'. Reads that hit multiple features will be double
> counted
> in this case. The idea to to see if there are any common
> regions at all.
>
> If there are still no counts, you could look more closely at a
> smaller chromosome region in 'reads' where you would expect
> counts.
> This visual inspection might give you some insight as to
> why there
> is no overlap.
>
>
> Valerie
>
>
> On 12/05/13 18:51, Sam McInturf wrote:
>
> Hello Bioconductors,
> I am working on an Arabidopsis RNA seq set, and after
> executing my count
> reads script, I get a count table with no reads counted (ie
> sum(colSums(mat)) = 0).
>
> I mapped my reads using Tophat2 (without supplying a
> gtf/gff
> file) and used
> samtools to sort and index my accepted_hits.bam files.
> I loaded
> these bam
> files into IGV (integrated Genome Browser) and the
> reads appear
> to have
> been mapped (additionally the align_summary.txt says I
> have good
> mapping).
> Script and session info below.
>
> Essentially I use Rsamtools to load in my bamfiles, and
> then
> makeTranscriptsFromGFF to make a txdb object,
> transcriptsBy to get
> transcripts, and then summarizeOverlaps to count the
> reads. My
> gff file is
> TAIR10. I am relatively sure that the genome I mapped
> to was
> the TAIR10
> release, but I am not 100% on that.
>
> I have also used biomaRt to do this (commented out),
> but I had
> the same
> results.
>
> I am currently upgrading to R 3.0.2 and re-installing
> bioconductor (maybe
> it will change something?)
>
> Thanks for any thoughts!
>
> Script:
>
> gffdir <-
> "/data/smcintur/Annotation/____TAIR10_GFF3_genes.gff";
> library(ShortRead)
> library(GenomicFeatures)
> library(Rsamtools)
> #library(biomaRt)
>
> setwd("/data/smcintur/RNASeq/____NMseq/newTophat/BamFiles/")
>
> bf <- list.files(pattern = ".bam$")
> bi <- list.files(pattern = "bam.bai$")
> bfl <- BamFileList(bf, bi)
>
> #mart <- makeTranscriptDbFromBiomart(____biomart =
>
> "Public_TAIRV10", dataset=
> "tairv10")
> tx <- transcriptsBy(mart)
> gff <- makeTranscriptDbFromGFF(____gffdir, format = "gff")
>
> gff
>
> cds <- cdsBy(gff)
> tx <- transcriptsBy(gff)
>
> olapTx <- summarizeOverlaps(tx, bfl)
> olapCds <- summarizeOverlaps(cds, bfl)
>
> txMat <- assays(olapTx)$counts
> cdsMat <- assays(olapCds)$counts
>
> write.table(txMat, file = "countMatTxGff.txt", sep = "\t")
> write.table(cdsMat, file = "countMatCdsGff.txt", sep =
> "\t")
>
> sessionInfo
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] BiocInstaller_1.12.0
>
> loaded via a namespace (and not attached):
> [1] AnnotationDbi_1.24.0 Biobase_2.22.0
> BiocGenerics_0.8.0
> [4] biomaRt_2.18.0 Biostrings_2.30.1
> bitops_1.0-6
> [7] BSgenome_1.30.0 DBI_0.2-7
> GenomicFeatures_1.14.2
> [10] GenomicRanges_1.14.3 IRanges_1.20.6
> parallel_3.0.0
> [13] RCurl_1.95-4.1 Rsamtools_1.14.2
> RSQLite_0.11.4
> [16] rtracklayer_1.22.0 stats4_3.0.0
> tools_3.0.0
> [19] XML_3.98-1.1 XVector_0.2.0
> zlibbioc_1.8.0
>
>
> ___________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/____listinfo/bioconductor
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
> Search the archives:
> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>
>
> --
> Reema Singh
> PhD Scholar
> Computational Biology and Bioinformatics
> School of Computational and Integrative Sciences
> Jawaharlal Nehru University
> New Delhi-110067
> INDIA
>
>
>
>
>
> --
> Reema Singh
> Postdoctoral Research Assistant
> College of Life Sciences
> University of Dundee,
> Dundee DD1 4HN, Scotland
> United Kingdom
--
Valerie Obenchain
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: vobencha at fhcrc.org
Phone: (206) 667-3158
Fax: (206) 667-1319
More information about the Bioconductor
mailing list