[BioC] function precede() not working with GRanges
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Thu Jan 13 14:30:15 CET 2011
I seem to be a bit late in joining the discussion (been away), but I
want to lend my support to Michael's suggestion (which I have been
advocating in older emails).
If we differentiate between the 'true' signal and what comes out of a
specific assay, it is clear that true signals can be stranded (say, a
gene) or unstranded (nucleosome positions, histone modifications) and
it is also possible to have an unstranded assay for a stranded true
signal (the standard Illumina RNA-Seq prep). On top of this, mapped
reads are always stranded (at least, I don't know of an unstranded
mapper) which may or may represent the signal. So doing comparisons
between stranded and unstranded data/annotation is a fundamental
issue.
Now, it has been suggested in this thread and in the past that one
could just change the strand of the objects we compare (and then do
the comparison). However, that is a destructive modification and will
almost alway mean that we will have two objects lying around (the
original one and the strand-modified one). I don't like that both
from a memory and a clutter perspective. For example, while I could
set my RNA-Seq reads to all have strand *, this precludes me from
obtaining the original nucleotide sequence for the read if all I have
is chr, start, end and no strand. So now I need two sets of RNA-Seq
reads.... Furthermore, I often have quite a few objects lying around
representing different kinds of signals and data and some of these are
strand specific, others are not. So this really becomes an issue when
you deal with an integrative analysis.
While it introduces an additional argument to many functions dealing
with GRanges, I find that solution to be better, from a user
perspective (it is clearly more work for a developer).
Furthermore, someone (and I am happy to do so), should explicitly
write down how strand comparisons ought to be interpreted in a section
in the vignette so we all have it for future reference. This includes
how to treat objects that have a mix of *, + and -. An explicit
reference will help us all make sure that all functions use the same
underlying logic (comments in this thread seems to indicate that this
is not the case). Even if an additional ignore.strand argument is not
implemented, it will be a nice section for future reference and bug
hunting. I believe it is important to be very precise about this.
In Genominator, we do have a ignoreStrand argument to all relevant
functions and we also interpret a strand of * as "present on both
strands" so that a 1-4,* overlaps 1-4,+. (It is my understanding that
right now, GenomicRanges essentially treat * as a separate strand).
So my comments above about preferring an ignoreStrand argument from a
user perspective is based on actual experience.
Kasper
2011/1/12 Michael Lawrence <lawrence.michael at gene.com>:
> 2011/1/12 Hervé Pagès <hpages at fhcrc.org>
>
>> Hi Michael,
>>
>> On 01/12/2011 05:45 AM, Michael Lawrence wrote:
>>
>>>
>>>
>>> 2011/1/12 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>>>
>>>
>>> Hi Steve,
>>>
>>>
>>> On 01/11/2011 07:20 PM, Steve Lianoglou wrote:
>>>
>>> Hi,
>>>
>>> I'm not Michael, but:
>>>
>>> How general is the "peaks do not have a direction" statement?
>>> In my experience (RNA-seq experiments), peaks are "stranded
>>> features"
>>> i.e. they belong to a particular strand.
>>>
>>>
>>> I'm actually a bit surprised by that. It was my impression that
>>> most
>>> of the rna-seq data to date have been done with a protocol that
>>> doesn't honor the strand of the reads.
>>>
>>>
>>> Just to clarify, I was not talking about protocols that honor or not
>>> the original strand of the reads. I was talking about the strand to
>>> which each read aligns. This information is stored in the SAM/BAM file
>>> and propagates to the GRanges object you get when loading this file
>>> (with something like 'as(read.GappedAlignments(myfile), "GRanges")'.
>>> The strand information is here whatever type of *-seq experiment it
>>> is (RNA-seq, CHIP-seq, ...)
>>>
>>>
>>> Right, but if you're getting reads from both strands of a transcript,
>>> you certainly don't want to ignore the reads coming from the strand
>>> opposite transcription.
>>>
>>
>> Point taken. I somehow overlooked the fact that with a non-stranded
>> RNA-seq protocol, whatever strand a transcript is coming from, it will
>> generate reads that align to both strands. Thanks for the reminder!
>>
>>
>> Steve is right about stranded protocols becoming
>>> more popular, but in many cases their advantages might not outweigh the
>>> increased cost/complexity. Time will tell, I guess. Anyway, let's not
>>> make GenomicRanges the RNA-seq package. Suffice it to say that we deal
>>> RNA-seq, ChIP-seq, other DNA-seqs and up till now we always need to
>>> remove strand before looking for overlaps.
>>>
>>
>> Why only before looking for overlaps? I mean, if the strand is
>> meaningless, why not just remove the strand from the very beginning
>> i.e. when you make your GRanges object from your SAM/BAM file (like
>> Steve does)? So you only need to do strand(gr) <- "*" once. Isn't
>> that easier/safer than having to remember to set an additional
>> 'ignore.strand' parameter to FALSE every time you need to call a
>> strand-aware function like findOverlaps?
>>
>>
> For RNA-seq, that's reasonable. There are cases where the strand of the
> reads is important. Fragment length estimation algorithms, mostly designed
> for enrichment experiments, use the strand (see chipseq
> estimate.mean.fraglen). For ChIP-seq, we detect peaks before doing any
> overlaps, but for MEDIP-seq, we deal with reads. Anyway, the point is that
> there are a variety of use cases. I'll admit that an extra parameter is not
> entirely justified at this point. Just brought this up for discussion.
>
> Thanks,
> Michael
>
>
>
>> Adding an 'ignore.strand' argument to findOverlaps/countOverlaps (and
>> probably to other strand-aware GRanges methods if we start to go that
>> route) is not a big deal (just 2 lines of code in each method), but
>> still, that means adding yet another argument (findOverlaps already
>> has 9), and documenting it for all these methods. I'm just wondering
>> if it's worth it...
>>
>> H.
>>
>>
>>>
>>>
>>> Recently, I've come across several protocols where strandedness is
>>> maintained and I'm sure that these will be increasingly the norm as
>>> time goes on.
>>>
>>> I've been working on/with different variants of some (rna)
>>> tag-sequencing protocols, and these do typically keep strand
>>> info, but
>>> I think the "old" Illumina-driven transcriptome-wide rna-seq
>>> data that
>>> "most" people think of as RNA-seq is unstranded.
>>>
>>> Still. I like that GenomicRanges can honor strandedness in
>>> overlap/etc. functions when the strand is set.
>>>
>>> One thing that often confuses me with GenomicRanges is
>>> the dual meaning of
>>> strand. It seems to indicate both a direction and a
>>> coordinate. It makes
>>> sense to me to have resize() and precede() consider
>>> strand as a direction.
>>> But findOverlaps() treats strands as coordinates such
>>> that items on
>>> different strands do not overlap. This makes less sense.
>>>
>>>
>>> Why? IMO a positive read (i.e. a read that was aligned to
>>> the plus
>>> strand) shouldn't hit features (genes/transcripts/exons)
>>> that are
>>> located on the minus strand. If for whatever reason this is
>>> not what
>>> the user wants, then s/he can always set the strand of the
>>> query and
>>> subject to '*' but I wouldn't say this is the typical usecase.
>>>
>>> Another good
>>> example is gaps() which will yield sequence-long ranges
>>> on the strands not
>>> represented in the original object. This is very often
>>> undesirable. But of
>>> course my perspective is limited.
>>>
>>>
>>> I agree that the behaviour of gaps() on GRanges is awkward.
>>> You not only
>>> get those sequence-long ranges on the strands not
>>> represented in the
>>> original object but you also get them for sequences not
>>> represented at
>>> all (as long as they are listed in levels(seqnames(x))). But
>>> if gaps()
>>> wasn't doing this, gaps() wouldn't be reversible anymore (i.e.
>>> 'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind
>>> of an
>>> important property for gaps(). In particular, gaps(x) would
>>> give
>>> the same result whether 'x' as a sequence-long range on one
>>> strand
>>> or not (very unlikely to happen with real data though).
>>>
>>> Maybe an extra arg to gaps() would help control this?
>>> Suggestions are
>>> welcome. I'll make the change. Thanks!
>>>
>>>
>>> I'd be in favor of adding a param to GRanges::gaps() that would
>>> allow
>>> me to explicitly turn off this "feature" ;-)
>>>
>>>
>>> OK, will see what I can do. Thanks for your feedback!
>>>
>>> H.
>>>
>>>
>>> -steve
>>>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M2-B876
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>>
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>
> [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list