[BioC] ChIPpeakAnno: TSS at - strand

Wed Jan 15 15:40:47 CET 2014

Hi Guangchuang,

This is a great idea to separate the sense and antisense strand. This is
what we will update in next release version. Now it is on testing. The new
version will based on GRanges, that is why it costs long time to test. If
you are interested in helping us to test the new version, I can send you
the codes. 

In the new version, we will consider to add ignore.strand when do
annotation. However, the distance between the peaks and features still
need to be separated by shortestDistance (as you mentioned from both ends)
and the distanceFromFeature as user defined. This will greatly improve the
annotation. However, usually the peaks we got from peak detection
software, such as MACS, are not strand-specific. And some gene expression
regulation models are not only consider the promoter regions. At this
point, we did design the distance from peaks and features without
considering the strands. And now in the new version, the setting of
ignore.strand is automatically detected by user's inputs (with useful
information about strands or not).

In the old version, as it cleared by Julie, if you want to consider about
the strands, you can separate annotationData into two groups, "+" and "-",
use start and end of the feature to get the annotation and combine them
later. 

Again, we thank you for your discussion about this question.

Yours sincerely,

Jianhong Ou

LRB 670A
Program in Gene Function and Expression
364 Plantation Street Worcester,
MA 01605

On 1/15/14 1:43 AM, "Yu, Guangchuang" <gcyu at connect.hku.hk> wrote:

>Dear Julie,
>
>Let me explain my point of view, if we know the strand of a peak, then
>what
>we would like to know is only the nearest feature located in the same
>strand.
>
>As we both know, peaks are strand-less and we don't have that information.
>
>Then we assume the peak could be located in + strand and - strand. When we
>consider it is in + strand, we identify the nearest feature by
>min(abs(peakStart-featureStart)). While when we consider it is in -
>strand,
>we should calculate the distance by min(abs(peakEnd-featureEnd)) as the
>"End" coordination is actually the start when it is in the - strand no
>matter it is a feature or a peak.
>
>If a feature didn't have strand information, it make sense to set
>PeakLocForDistance="middle".
>
>But if we have the feature strand information, I don't agree with your
>approach by setting the parameter PeakLocForDistance to "start" or "end".
>The function should automatically decide using "start" or "end"
>considering
>the feature strand information.
>
>Bests,
>Guangchuang
>
>
>On Wed, Jan 15, 2014 at 2:10 AM, Zhu, Lihua (Julie)
><Julie.Zhu at umassmed.edu>wrote:
>
>>  Guangchuang,
>>
>> Thanks for the clarification!
>>
>> annotatePeakInBatch uses gene "start" when gene is in + strand and use
>> gene "end" when gene is in - strand to calculate distance from peak to
>>TSS.
>> Peaks from ChIP-seq/ChIP-chip are not necessary always at the upstream
>>of
>> genes, so it is rather arbitrary to pick peak start for genes located
>>in +
>> and peak end for genes located in - strand . annotatePeakInBatch lets
>>users
>> to specify PeakLocForDistance (start, middle, end) regardless the gene
>> strand to calculate distance.
>>
>> If you want to use peak start to calculate distance to genes located in
>>+
>> strand, but use peak end to calculate distance to genes located in -
>> strand, I suggest you create two annotation files, one for + strand and
>>the
>> other for - strand, and specify PeakLocForDistance ="start" for +
>> annotation and PeakLocForDistance="end" for - annotation.
>>
>> Hope this is helpful. Thanks!
>>
>> Best regards,
>>
>> Julie
>>
>>
>> On 1/14/14 12:15 PM, "YGC" <no-reply at ygc.name> wrote:
>>
>>  I don't understand why setting PeakLocForDistance parameter. My point
>>is
>> it should use peak "start" when feature in + strand and use peak "end"
>>when
>> feature in - strand.
>>
>>
>>
>>
>>
>> On 1/14/14 11:41 AM, "Lihua Julie Zhu" <julie.zhu at umassmed.edu> wrote:
>>
>> > Dear  Guangchuang,
>> >
>> > Thank you very much for the feedback! Any feedbacks and responses sent
>> to
>> > bioconductor at r-project.org are archived and web searchable.  I would
>> > appreciate if you could send email to Bioconductor list for feedbacks
>> forward.
>> >
>> > If you type ?annotatePeakInBatch, you will notice that
>> FeatureLocForDistance
>> > is set to TSS by default, meaning that using start of feature when
>> feature is
>> > on plus strand and using end of feature when feature is on minus
>>strand.
>> I
>> > think this is consistent with your expectation.
>> >
>> > The discrepancy described by you is due to the different annotation
>> source
>> > used. You used transcript coordinates instead of gene coordinates
>>stored
>> in
>> > TSS.human.NCBI36. Another difference is that you used end(peak) to
>> calculate
>> > distance while annotatePeakInBatch uses start(peak) by default. As a
>> user, you
>> > can set PeakLocForDistance as start, end or middle of the peak.
>>However,
>> the
>> > result will be different. Please type ?annotatePeakInBatch to get the
>> details
>> > to customize your parameter setting to fit your needs. Also annotation
>> from
>> > different databases of different versions will be slightly different
>>as
>> you
>> > already pointed out. You can use getAnnotation function in
>>ChIPpeakAnno
>> to
>> > obtain the most recent Ensemble annotation. Alternatively, you could
>> download
>> > annotation from UCSC genome browser and convert it to RangedData  as
>> > annotation source. You also can download transcript coordinates to
>>feed
>> into
>> > annotatePeakInBatch to get the nearest transcript instead of gene.
>> >
>> > To test whether there exists this bug in the annotatePeakInBatch , let
>> us use
>> > your example  with a simple annotation consisting only two genes with
>> the
>> > exact coordinates provided by you instead of using the built-in
>> annotation.
>> > Looks like you are interested in peak end to TSS, so I will set
>> > PeakLocForDistance = "end" here.
>> >
>> > You can see that the results obtained, by customizing the parameters
>>in
>> > annoatePeakInBatch, is consistent with what you get.
>> >
>> > require(ChIPpeakAnno)
>> > packageVersion("ChIPpeakAnno")
>> > peak <- RangedData(space="chr1", IRanges(24736757, 24737528))
>> >
>> > TSS <- RangedData(space="chr1", IRanges(start=c(24742284
>> > ,24683489,24683489,24683489,24695211), end = c( 24799466,
>> 24700300,24740262,
>> > 24741587,24718169), names=c("NIPAL3",
>> > "STPG1-1","STPG1-2","STPG1-3","STPG1-4")), strand=c("+","-", "-",
>> "-","-"))
>> > TSS
>> > RangedData with 5 rows and 1 value column across 1 space
>> >            space               ranges |      strand
>> >         <factor>            <IRanges> | <character>
>> > NIPAL3      chr1 [24742284, 24799466] |           +
>> > STPG1-1     chr1 [24683489, 24700300] |           -
>> > STPG1-2     chr1 [24683489, 24740262] |           -
>> > STPG1-3     chr1 [24683489, 24741587] |           -
>> > STPG1-4     chr1 [24695211, 24718169] |           -
>> > ap <- annotatePeakInBatch(peak, Annotation=TSS,
>>PeakLocForDistance="end")
>> >  ap
>> > RangedData with 1 row and 9 value columns across 1 space
>> >              space               ranges |        peak      strand
>>     feature
>> > start_position end_position insideFeature distancetoFeature
>> shortestDistance
>> >           <factor>            <IRanges> | <character> <character>
>> <character>
>> > <numeric>    <numeric>   <character>         <numeric>
>><numeric>
>> > 1 STPG1-2     chr1 [24736757, 24737528] |           1           -
>>     STPG1-2
>> > 24683489     24740262        inside              2734             2734
>> >           fromOverlappingOrNearest
>> >
>>
>>
>>
>>
>>
>
>
>-- 
>--~--~---------~--~----~------------~-------~--~----~
>Guangchuang Yu, PhD Candidate
>School of Biological Sciences
>Room 7N-07, Kadoorie Biological Sciences Building
>The University of Hong Kong
>Hong Kong SAR, China
>-~----------~----~----~----~------~----~------~--~---
>
>	[[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor