[BioC] Problems Counting Reads with summarizeOverlaps
Valerie Obenchain
vobencha at fhcrc.org
Sun Dec 22 01:18:51 CET 2013
On 12/21/13 05:14, Sam McInturf wrote:
> Valerie and Reema,
> I am using single end reads.
>
> I did as Valerie suggested
>
> so <- summarizeOverlaps(tx, reads, mode="Union",type = "any")
This is not the example I used below. You did not include
'inter.feature=FALSE'. 'type' is not an argument to summarizeOverlaps().
In other words, do not use 'type', use 'inter.feature'. Please see my
previous explanation and the man page for summarizeOverlaps().
Valerie
> colSums(assays(so)$counts
> #185,940
>
> which feels silly, allow for double counts and return fewer counts than
> prior
>
> P.S. thank you for your prompt reply, my email "hid" it from me
> Best
>
> Sam
>
>
>
> On Mon, Dec 9, 2013 at 6:08 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> Reema,
>
> Thank you for pointing out the error in the summarizeOverlaps
> vignette. That sentence was a leftover from a previous version and
> should have been removed. I have updated the vignette in both
> release (GenomicRanges 1.14.4) and devel (GenomicAlignments 0.99.8).
>
> When the 'reads' argument is a GAlignmentPairs object the
> stand-alone functions of Union(), IntersectionStrict() and
> IntersectionNotEmpty() count the data as paired-end. If 'reads' is a
> GAlignments object they are counted as single-end.
> summarizeOverlaps() reads the data into a GAlignments or
> GAlignmentPairs object and calls these functions internally; they
> are exactly the same.
>
> Valerie
>
>
>
> On 12/09/2013 02:46 AM, Reema Singh wrote:
>
> Hi all,
>
> Indeed summarizeOverlaps() works on paired end data. But different
> counting modes [Union, Intersection and IntersectionNotEmpty]
> they do
> not handle paired end reads. Here the reference
> [http://www.bioconductor.org/__packages/release/bioc/__vignettes/GenomicRanges/inst/__doc/summarizeOverlaps.pdf]-
> <http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf%5D->
> Page 2 [under Counting Modes].
>
> Kind Regards
>
>
>
> On Sun, Dec 8, 2013 at 6:31 PM, Valerie Obenchain
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
> summarizeOverlaps() does count paired-end reads. Use option
> singleEnd=FALSE. Details under 'singleEnd' in the 'Arguments'
> section on the man page.
>
> library(GenomicAlignments) ## devel
> library(GenomicRanges) ## release
> ?summarizeOverlaps
>
> Valerie
>
>
> On 12/08/13 09:42, Reema Singh wrote:
>
> Hi Sam,
>
> It depends on whether you are using single read or
> paired end reads
> files. As far as I known summarizeOverlaps function
> only working
> in case
> of single read. Anyone please correct me if I am wrong.
>
> Kind Regards
>
>
> On Sun, Dec 8, 2013 at 8:25 PM, Valerie Obenchain
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>>> wrote:
>
> Hi Sam,
>
> Are you getting any error messages related to
> seqlevels
> when you
> count? If you have confirmed that annotation seqlevels
> match the bam
> files next look at the maximum possible overlap in
> a single
> bam file.
>
> bf <- singleBamFile
> reads <- readGAlignments(bf) ## assuming
> single-end reads
> so <- summarizeOverlaps(tx, reads, mode=Union,
> inter.feature=FALSE)
>
> 'inter.feature=FALSE' with 'mode=Union' is
> countOverlaps with
> 'type=ANY'. Reads that hit multiple features will
> be double
> counted
> in this case. The idea to to see if there are any
> common
> regions at all.
>
> If there are still no counts, you could look more
> closely at a
> smaller chromosome region in 'reads' where you
> would expect
> counts.
> This visual inspection might give you some insight
> as to
> why there
> is no overlap.
>
>
> Valerie
>
>
> On 12/05/13 18:51, Sam McInturf wrote:
>
> Hello Bioconductors,
> I am working on an Arabidopsis RNA seq
> set, and after
> executing my count
> reads script, I get a count table with no
> reads counted (ie
> sum(colSums(mat)) = 0).
>
> I mapped my reads using Tophat2 (without
> supplying a
> gtf/gff
> file) and used
> samtools to sort and index my
> accepted_hits.bam files.
> I loaded
> these bam
> files into IGV (integrated Genome Browser) and the
> reads appear
> to have
> been mapped (additionally the
> align_summary.txt says I
> have good
> mapping).
> Script and session info below.
>
> Essentially I use Rsamtools to load in my
> bamfiles, and
> then
> makeTranscriptsFromGFF to make a txdb object,
> transcriptsBy to get
> transcripts, and then summarizeOverlaps to
> count the
> reads. My
> gff file is
> TAIR10. I am relatively sure that the genome
> I mapped
> to was
> the TAIR10
> release, but I am not 100% on that.
>
> I have also used biomaRt to do this (commented
> out),
> but I had
> the same
> results.
>
> I am currently upgrading to R 3.0.2 and
> re-installing
> bioconductor (maybe
> it will change something?)
>
> Thanks for any thoughts!
>
> Script:
>
> gffdir <-
> "/data/smcintur/Annotation/______TAIR10_GFF3_genes.gff";
> library(ShortRead)
> library(GenomicFeatures)
> library(Rsamtools)
> #library(biomaRt)
>
>
> setwd("/data/smcintur/RNASeq/______NMseq/newTophat/BamFiles/")
>
>
> bf <- list.files(pattern = ".bam$")
> bi <- list.files(pattern = "bam.bai$")
> bfl <- BamFileList(bf, bi)
>
> #mart <-
> makeTranscriptDbFromBiomart(______biomart =
>
>
> "Public_TAIRV10", dataset=
> "tairv10")
> tx <- transcriptsBy(mart)
> gff <- makeTranscriptDbFromGFF(______gffdir,
> format = "gff")
>
>
> gff
>
> cds <- cdsBy(gff)
> tx <- transcriptsBy(gff)
>
> olapTx <- summarizeOverlaps(tx, bfl)
> olapCds <- summarizeOverlaps(cds, bfl)
>
> txMat <- assays(olapTx)$counts
> cdsMat <- assays(olapCds)$counts
>
> write.table(txMat, file = "countMatTxGff.txt",
> sep = "\t")
> write.table(cdsMat, file =
> "countMatCdsGff.txt", sep =
> "\t")
>
> sessionInfo
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils
> datasets
> methods base
>
> other attached packages:
> [1] BiocInstaller_1.12.0
>
> loaded via a namespace (and not attached):
> [1] AnnotationDbi_1.24.0 Biobase_2.22.0
> BiocGenerics_0.8.0
> [4] biomaRt_2.18.0 Biostrings_2.30.1
> bitops_1.0-6
> [7] BSgenome_1.30.0 DBI_0.2-7
> GenomicFeatures_1.14.2
> [10] GenomicRanges_1.14.3 IRanges_1.20.6
> parallel_3.0.0
> [13] RCurl_1.95-4.1 Rsamtools_1.14.2
> RSQLite_0.11.4
> [16] rtracklayer_1.22.0 stats4_3.0.0
> tools_3.0.0
> [19] XML_3.98-1.1 XVector_0.2.0
> zlibbioc_1.8.0
>
>
> _____________________________________________________
>
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> <mailto:Bioconductor at r-____project.org
> <mailto:Bioconductor at r-__project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>>
> https://stat.ethz.ch/mailman/______listinfo/bioconductor
> <https://stat.ethz.ch/mailman/____listinfo/bioconductor>
> <https://stat.ethz.ch/mailman/____listinfo/bioconductor
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>>
>
>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioconductor
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>>
> Search the archives:
> http://news.gmane.org/gmane.______science.biology.informatics.______conductor
> <http://news.gmane.org/gmane.____science.biology.informatics.____conductor>
>
> <http://news.gmane.org/gmane.____science.biology.informatics.____conductor
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>>
>
>
> <http://news.gmane.org/gmane.____science.biology.informatics.____conductor
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>>
>
>
>
>
>
> --
> Reema Singh
> PhD Scholar
> Computational Biology and Bioinformatics
> School of Computational and Integrative Sciences
> Jawaharlal Nehru University
> New Delhi-110067
> INDIA
>
>
>
>
>
> --
> Reema Singh
> Postdoctoral Research Assistant
> College of Life Sciences
> University of Dundee,
> Dundee DD1 4HN, Scotland
> United Kingdom
>
>
>
> --
> Valerie Obenchain
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B155
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> Sam McInturf
More information about the Bioconductor
mailing list