[BioC] function precede() not working with GRanges

Wed Jan 12 22:08:51 CET 2011

Hi Michael,

On 01/12/2011 05:45 AM, Michael Lawrence wrote:
>
>
> 2011/1/12 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Steve,
>
>
>     On 01/11/2011 07:20 PM, Steve Lianoglou wrote:
>
>         Hi,
>
>         I'm not Michael, but:
>
>             How general is the "peaks do not have a direction" statement?
>             In my experience (RNA-seq experiments), peaks are "stranded
>             features"
>             i.e. they belong to a particular strand.
>
>
>         I'm actually a bit surprised by that. It was my impression that most
>         of the rna-seq data to date have been done with a protocol that
>         doesn't honor the strand of the reads.
>
>
>     Just to clarify, I was not talking about protocols that honor or not
>     the original strand of the reads. I was talking about the strand to
>     which each read aligns. This information is stored in the SAM/BAM file
>     and propagates to the GRanges object you get when loading this file
>     (with something like 'as(read.GappedAlignments(myfile), "GRanges")'.
>     The strand information is here whatever type of *-seq experiment it
>     is (RNA-seq, CHIP-seq, ...)
>
>
> Right, but if you're getting reads from both strands of a transcript,
> you certainly don't want to ignore the reads coming from the strand
> opposite transcription.

Point taken. I somehow overlooked the fact that with a non-stranded
RNA-seq protocol, whatever strand a transcript is coming from, it will
generate reads that align to both strands. Thanks for the reminder!

> Steve is right about stranded protocols becoming
> more popular, but in many cases their advantages might not outweigh the
> increased cost/complexity. Time will tell, I guess. Anyway, let's not
> make GenomicRanges the RNA-seq package. Suffice it to say that we deal
> RNA-seq, ChIP-seq, other DNA-seqs and up till now we always need to
> remove strand before looking for overlaps.

Why only before looking for overlaps? I mean, if the strand is
meaningless, why not just remove the strand from the very beginning
i.e. when you make your GRanges object from your SAM/BAM file (like
Steve does)? So you only need to do strand(gr) <- "*" once. Isn't
that easier/safer than having to remember to set an additional
'ignore.strand' parameter to FALSE every time you need to call a 
strand-aware function like findOverlaps?

Adding an 'ignore.strand' argument to findOverlaps/countOverlaps (and
probably to other strand-aware GRanges methods if we start to go that
route) is not a big deal (just 2 lines of code in each method), but
still, that means adding yet another argument (findOverlaps already
has 9), and documenting it for all these methods. I'm just wondering
if it's worth it...

H.

>
>
>
>         Recently, I've come across several protocols where strandedness is
>         maintained and I'm sure that these will be increasingly the norm as
>         time goes on.
>
>         I've been working on/with different variants of some (rna)
>         tag-sequencing protocols, and these do typically keep strand
>         info, but
>         I think the "old" Illumina-driven transcriptome-wide rna-seq
>         data that
>         "most" people think of as RNA-seq is unstranded.
>
>         Still. I like that GenomicRanges can honor strandedness in
>         overlap/etc. functions when the strand is set.
>
>                 One thing that often confuses me with GenomicRanges is
>                 the dual meaning of
>                 strand. It seems to indicate both a direction and a
>                 coordinate. It makes
>                 sense to me to have resize() and precede() consider
>                 strand as a direction.
>                 But findOverlaps() treats strands as coordinates such
>                 that items on
>                 different strands do not overlap. This makes less sense.
>
>
>             Why? IMO a positive read (i.e. a read that was aligned to
>             the plus
>             strand) shouldn't hit features (genes/transcripts/exons)
>             that are
>             located on the minus strand. If for whatever reason this is
>             not what
>             the user wants, then s/he can always set the strand of the
>             query and
>             subject to '*' but I wouldn't say this is the typical usecase.
>
>                 Another good
>                 example is gaps() which will yield sequence-long ranges
>                 on the strands not
>                 represented in the original object. This is very often
>                 undesirable. But of
>                 course my perspective is limited.
>
>
>             I agree that the behaviour of gaps() on GRanges is awkward.
>             You not only
>             get those sequence-long ranges on the strands not
>             represented in the
>             original object but you also get them for sequences not
>             represented at
>             all (as long as they are listed in levels(seqnames(x))). But
>             if gaps()
>             wasn't doing this, gaps() wouldn't be reversible anymore (i.e.
>             'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind
>             of an
>             important property for gaps(). In particular, gaps(x) would give
>             the same result whether 'x' as a sequence-long range on one
>             strand
>             or not (very unlikely to happen with real data though).
>
>             Maybe an extra arg to gaps() would help control this?
>             Suggestions are
>             welcome. I'll make the change. Thanks!
>
>
>         I'd be in favor of adding a param to GRanges::gaps() that would
>         allow
>         me to explicitly turn off this "feature" ;-)
>
>
>     OK, will see what I can do. Thanks for your feedback!
>
>     H.
>
>
>         -steve
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M2-B876
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319