[BioC] Using GenomicAlignments to get Exon-centric Counts
Valerie Obenchain
vobencha at fhcrc.org
Tue Jul 1 03:55:13 CEST 2014
Hi,
On 06/30/2014 04:51 PM, Fong Chun Chan wrote:
> Hi Valerie,
>
> Thanks for your reply. After re-reading that paired-end statement, I
> understand what you meant now and was definitely confused because I
> thought it was being treated as two single-end reads which would have
> defeated the purpose of using reading it in as a GAlignmentPairs object.
>
> Thanks for the clarification on the inter.feature = FALSE parameter.
> That is exactly what I want actually. I wanted split reads across two
> exons should be counted as contributing to each exon. Even more
> importantly are paired-reads where the other pair is mapped to another
> distinct exon needs to be included also as contributing one to each exon.
Great! I'm glad it fits that niche.
>
> The default counting mode settings are likely more applicable to when
> working on gene-centric models where you have GRangeList objects with
> each exon as a GRange object (hopefully I am correct in my understanding
> of that) and in those situations you probably don't want to double count.
Yes, the modes w/ default settings were intended for gene-based models,
pattered after Simon Anders' HTSeq. 'inter.feature' was added later by
request and has extended the functionality to situations like you describe.
Valerie
>
>
>
>
> On Mon, Jun 30, 2014 at 4:29 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> Hi Fong,
>
> readGAlignments() is for single-end data, readGAlignmentPairs() and
> readGAlignmentsList() are for paired-end. If you read paired-end
> data with readGAlignments() and perform counting, a read that
> overlaps both pair parts will be counted twice (once for each part).
> This is probably not what you want.
>
> More below.
>
>
> On 06/30/2014 03:19 PM, Fong Chun Chan wrote:
>
> Hi,
>
> Was wondering if anyone had any experience working with the
> GenomicAlignments R package when trying to retrieve per-exon
> count data?
> Specifically, whether these is rationale for using
> readGAlignmentsPairs()
> over readGAlignments() when using the summarizeOverlaps()
> function? From my
> understanding, the readGAlignmentsPairs() function will traverse
> the input
> bam file and return each read pair as one GAlignment.
>
>
> Data read with readGAlignmentPairs() is put in a GAlignmentPairs
> class. Counting operations on a GAlignmentPairs class are aware that
> each pair should be counted as one record. If a range overlaps one
> or both portions of the pair it will be counted as one hit.
>
> The same paired-end data read in with readGAlignments() is put in a
> GAlignments class. Counting operations on this class treat each
> record/row individually so you will get more counts than with the
> GAlignmentPairs class (you would get exactly double the counts if
> each read overlapped both parts of the pair).
>
>
> Looking at the
>
> summarizeOverlaps() function document, it states that:
>
> "paired-end reads : Paired-end reads are counted the same as
> single-end
> reads with a junction (N operation in the CIGAR); each pair
> registers as a
> single hit. Paired-end records can be counted in a GAlignmentPairs
> container or BAM file."
>
>
> This statement is confusing and should be removed. Thanks for
> pointing it out. I'll update the docs.
>
> summarizeOverlaps() uses different read functions depending on what
> arguments are used to call it. Here is a summary.
>
> - readGAlignments() when 'sigleEnd=TRUE'
> - readGAlignmentPairs() when 'singleEnd=FALSE' AND 'fragments=FALSE'
> - readGAlignmentsList() when 'singleEnd=FALSE' AND 'fragments=TRUE'
>
> If you have paired-end data you should be setting 'singleEnd=FALSE'
> in summarizeOverlaps(). If you are interested in all reads, not just
> those that are paired according to the algorithem, use
> 'fragments=TRUE'. The ?readGAlignmentsList man page describes the
> pairing algorithm and how to use the 'mate_status' metadata column
> to investigate reads of different pairing status.
>
>
>
> If the paired-end reads are counted as single-end reads, does
> this not
> suggest that using the readGAlignments() and
> readGAlignmentsPairs() returns
> you somewhat similar results?
>
>
> From above: paired-end are not counted as single-end.
>
>
>
> Another question I have is if a single-read is actually split
> between two
> exons and the exons features are stored as a GRanges (and NOT a
> GRangesList) object then the default summarizeOverlaps()
> function would
> technically discard this read if I understand correctly? Since
> the default
> mode is Union and the read overlaps with both features. In fact,
> wouldn't
> all the modes actually discard the read? If so, is it correct to use
> inter.feature = FALSE, to include split-reads into the equationb?
>
>
> The default is 'Union' but you can change that to
> 'IntersectionStrict' or 'IntersectionNotEmpty'. It is possible that
> none of the modes will be able to 'make a decision' about what
> feature the read should be assigned to. There is a graphic in the
> vignette that shows what types of overlaps are handled.
>
> browseVignettes('__GenomicAlignments')
>
> Using 'inter.feature=FALSE' will allow double counting; the spanning
> read will be counted once for each feature it overlaps. See the
> 'Counting multi-hit reads' example section at the bottom of the
> summarizeOverlaps man page.
>
>
>
> Valerie
>
>
> Thanks for any advice on this. sessionInfo() below:
>
> sessionInfo()
>
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> LC_PAPER=en_US.UTF-8
> LC_NAME=C LC_ADDRESS=C
> LC_TELEPHONE=C
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
> methods
> base
>
> other attached packages:
> [1] argparse_1.0.1 proto_0.3-10
> GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1
> Biostrings_2.32.0 XVector_0.4.0
> GenomicFeatures_1.16.2
> AnnotationDbi_1.26.0 Biobase_2.24.0
> GenomicRanges_1.16.3
> GenomeInfoDb_1.0.2 IRanges_1.22.9
> BiocGenerics_0.10.0
> vimcom.plus_0.9-93
> [16] setwidth_1.0-3 colorout_1.0-2
>
> loaded via a namespace (and not attached):
> [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1
> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6
> checkmate_1.0
> codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2
> findpython_1.0.1 foreach_1.4.2 getopt_1.20.0
> iterators_1.0.7
> plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1
> [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2
> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0
> XML_3.98-1.1 zlibbioc_1.10.0
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
> --
> Valerie Obenchain
> Program in Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, Seattle, WA 98109
>
> Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>
>
More information about the Bioconductor
mailing list