[BioC] ChIPpeakAnno Strandedness and distance calculation

Dario Strbenac D.Strbenac at garvan.org.au
Fri May 14 06:20:55 CEST 2010


Oh, actually I thought of a problem with making start = midpoint. If you modify the start position to be the average, then you might get the wrong values in insideFeature column.

e.g.

Peak
Real coordinates
--------------------
chr  | start | end  |
-------------------
chr1 | 5000  | 5500 |

Peak
modified coordinates (start is midpoint)
--------------------
chr  | start | end  |
-------------------
chr1 | 5250  | 5500 |

A gene
-----------------------------
chr  | start | end | strand |
-----------------------------
chr1 | 1000  | 5100 |    -  |

So, instead of being 'overlapStart', it is called as 'upstream'.

It would be good if the package worked out an extra column for the tables, called 'position' and used the position for the distances, and the real start and end positions for the overlapping.

e.g. Something like this

if("strand" %in% colnames(table))
{
  table$position = ifelse(table$strand == '+', table$start, table$end)
} else {
  table$position = round((table$start + table$end) / 2)
}

- Dario.
---- Original message ----
>Date: Thu, 13 May 2010 22:40:46 -0400
>From: "Zhu, Julie" <Julie.Zhu at umassmed.edu>  
>Subject: Re: ChIPpeakAnno Strandedness and distance calculation  
>To: "D.Strbenac at garvan.org.au" <D.Strbenac at garvan.org.au>
>Cc: "bioconductor" <bioconductor at stat.math.ethz.ch>
>
>Hi Dario,
>
>You can create the annotation with strand = c(³+²). For example,
>
>AnnotationRangedData = RangedData(IRanges(start = c(967659, 2010898,
>2496700, 3075866,
>+ 3123260), end = c(967869, 2011108, 2496920, 3076166, 3123470), names =
>c("t1",
>+ "t2", "t3", "t4", "t5")), space = c("1", "2", "3", "1", "2"), strand
>=c("+"))
>
>Please take a look at the examples given on the paper just published on BMC
>Bioinformatics
>http://www.biomedcentral.com/1471-2105/11/237. In case you could not open
>the link, I also attached the pdf file.
>
>Regarding your other question about distance calculation, I suggest to
>create your AnnotationRangedData and PeakRangedData with start=midpoint to
>get the distance between midpoints.  The distance is calculated differently
>for features in plus strand and minus strand. For example, to calculate the
>distance between peak and TSS, the distance is calculated as the distance
>between the start of the binding site and the TSS, which is the gene start
>for genes located on the forward strand and the gene end for genes located
>on the reverse strand. Therefore, adding another parameter would mean to
>overwrite the way how the distance is calculated based on strandedness.
>After you tried the above suggested way and still prefer having a new
>parameter, I will be happy to add it to the next release.
>
>Best regards,
>
>Julie
>
>
>*******************************************
>Lihua Julie Zhu, Ph.D
>Research Associate Professor
>Program in Gene Function and Expression
>Program in Molecular Medicine
>University of Massachusetts Medical School
>364 Plantation Street, Room 613
>Worcester, MA 01605
>508-856-5256
>http://www.umassmed.edu/pgfe/faculty/zhu.cfm
>
>
>
>
>
>On 5/13/10 9:00 PM, "Dario Strbenac" <D.Strbenac at garvan.org.au> wrote:
>
>> Hello again,
>> 
>> Just one more question. When we are looking at DNA methtylation, we don't have
>> the strand of the peak (because the reverse complement of CG is CG). It seems
>> that it might not be possible to do this with ChipPeakAnno ?
>> 
>> e.g.
>> 
>>> > head(peaksT)
>>     chr     start       end
>> 1 chr13  83351701  83352000
>> 2 chr13  83351401  83351700
>> 3 chr20  25011901  25012200
>> 4 chr13  83352001  83352300
>> 5  chr8 143402101 143402400
>> 6  chr2 238246801 238247100
>> 
>>> > head(featTable)
>>      name  chr strand  start    end
>> 1 7896759 chr1      + 781253 783614
>> 2 7896761 chr1      + 850983 869824
>> 3 7896779 chr1      + 885829 890958
>> 4 7896798 chr1      + 891739 900345
>> 5 7896817 chr1      + 938709 939782
>> 6 7896822 chr1      + 945365 981355
>> 
>> Also, sometimes our feature table is a table of CpG islands, which don't have
>> a strand associated with them.
>> 
>> e.g.
>> 
>>> > head(featTable2)
>>    chr  start    end CpG Island Name
>> 1 chr1  18598  19673        CpG:_116
>> 2 chr1 124987 125426         CpG:_30
>> 3 chr1 317653 318092         CpG:_29
>> 4 chr1 427014 428027         CpG:_84
>> 5 chr1 439136 440407         CpG:_99
>> 6 chr1 523082 523977         CpG:_94
>> 
>> Is it possible to do this annotation with ChipPeakAnno ? Currently, the
>> annotatePeakInBatch function gives me an error when I don't give it strand
>> information when I create my RangedData object.
>> 
>> Thanks,
>>        Dario.
>> 
>> --------------------------------------
>> Dario Strbenac
>> Research Assistant
>> Cancer Epigenetics
>> Garvan Institute of Medical Research
>> Darlinghurst NSW 2010
>> Australia
>> 
>> 
>
>On 5/13/10 8:10 PM, "Dario Strbenac" <D.Strbenac at garvan.org.au> wrote:
>
>> Hello,
>> 
>> Firstly, thank you for making this package. It seems so useful ! We were
>> thinking of writing something like this ourselves, until I saw your package,
>> because we do a lot of ChIP-Seq here.
>> 
>> I just have a small feature request. In your distance calculation, you do
>> start of peak - start of feature. Would it be possible to allow the user to
>> choose if they want the distance calculation to use the start or the middle of
>> the feature (and also for the peak) ? This is because we do a lot of
>> methylation studies, and for CpG island features, we like to use the midpoint
>> as the position of our feature. It would also be nice to be able to use the
>> midpoint of the peak as the peak's position, since this is usually where the
>> signal is strongest.
>> 
>> Thanks,
>>       Dario.
>> 
>> --------------------------------------
>> Dario Strbenac
>> Research Assistant
>> Cancer Epigenetics
>> Garvan Institute of Medical Research
>> Darlinghurst NSW 2010
>> Australia
>
>


--------------------------------------
Dario Strbenac
Research Assistant
Cancer Epigenetics
Garvan Institute of Medical Research
Darlinghurst NSW 2010
Australia



More information about the Bioconductor mailing list