[BioC] ChIPpeakAnno: New Features: distance calculation based on user selection of location of the peak and location of feature
Dario Strbenac
D.Strbenac at garvan.org.au
Wed May 19 07:47:40 CEST 2010
Thanks, this looks great !
---- Original message ----
>Date: Tue, 18 May 2010 17:19:17 -0400
>From: "Zhu, Julie" <Julie.Zhu at umassmed.edu>
>Subject: Re: ChIPpeakAnno: New Features: distance calculation based on user selection of location of the peak and location of feature
>To: "D.Strbenac at garvan.org.au" <D.Strbenac at garvan.org.au>
>Cc: "bioconductor" <bioconductor at stat.math.ethz.ch>
>
> Hi Dario,
>
> I have added the new feature you requested into dev
> version 1.5.2 and fixed the strandness issue (strand
> now is optional) in the dev version.
>
> Here is the code snippet using your example to test
> the distance calculations for a few common
> scenarios. Please let me know if you have problems.
> Thank you very much for the input!
>
> peaksT <- data.frame(chr = c("chr1", "chr1",
> "chr1"), start = c(1000, 2000, 3000), end = c(2000,
> 3000, 4000))
> featuresT <- data.frame(chr = c("chr1", "chr1",
> "chr1"), start = c(1500, 2500, 3500), end = c(2500,
> 3500, 4500), strand = c("-","+","+"))
> myPeakList <- RangedData(space = peaksT$chr, ranges
> = IRanges(start = peaksT$start, end = peaksT$end))
> TSS.ordered <- RangedData(space = featuresT$chr,
> strand = featuresT$strand, ranges = IRanges(start =
> featuresT$start, end = featuresT$end,
> names=c("t1","t2","t3")))
>
> library(ChIPpeakAnno)
> #To be compatible with old versions, by default
> distance is calculated as peak start to TSS (start
> for + strand and end for - strand)
> annotatePeakInBatch(myPeakList,
> AnnotationData=TSS.ordered)
>
> #By specifying PeakLocForDistance to "m" or
> "middle", the distance now is calculated as middle
> of the peak to TSS
> annotatePeakInBatch(myPeakList,
> AnnotationData=TSS.ordered, PeakLocForDistance="m")
>
> #By specifying PeakLocForDistance="m" and
> FeatureLocForDistance="m", the distance is
> calculated as middle of the peak to middle of the
> feature.
> annotatePeakInBatch(myPeakList,
> AnnotationData=TSS.ordered,
> PeakLocForDistance="m",FeatureLocForDistance="m")
>
> Best regards,
>
> Julie
>
> *******************************************
> Lihua Julie Zhu, Ph.D
> Research Associate Professor
> Program in Gene Function and Expression
> Program in Molecular Medicine
> University of Massachusetts Medical School
> 364 Plantation Street, Room 613
> Worcester, MA 01605
> 508-856-5256
> http://www.umassmed.edu/pgfe/faculty/zhu.cfm
>
> On 5/16/10 7:04 PM, "Dario Strbenac"
> <D.Strbenac at garvan.org.au> wrote:
>
> Not quite. I mean that midpoint to midpoint
> distance would be good when the feature didn't
> have a strand. If the feature does have strand
> information, then it'd be nice to have midpoint of
> peak to start (start for + strand, end for -
> strand) of feature.
>
> Thanks,
> Dario.
>
> ---- Original message ----
> >Date: Fri, 14 May 2010 07:43:38 -0400
> >From: "Zhu, Julie" <Julie.Zhu at umassmed.edu>
> >Subject: Re: ChIPpeakAnno Strandedness and
> distance calculation
> >To: "D.Strbenac at garvan.org.au"
> <D.Strbenac at garvan.org.au>
> >Cc: "bioconductor"
> <bioconductor at stat.math.ethz.ch>
> >
> > Hi Dario,
> >
> > Thanks for the clarification! So you still
> want the
> > nearest feature and insideFeature column to
> be
> > calculated using start (+ strand) and end (-
> > strand). You only need an extra column with
> distance
> > from midpoint to midpoint to be included?
> >
> > Best regards,
> >
> > Julie
> >
> > On 5/14/10 12:20 AM, "Dario Strbenac"
> > <D.Strbenac at garvan.org.au> wrote:
> >
> > Oh, actually I thought of a problem with
> making
> > start = midpoint. If you modify the start
> position
> > to be the average, then you might get the
> wrong
> > values in insideFeature column.
> >
> > e.g.
> >
> > Peak
> > Real coordinates
> > --------------------
> > chr | start | end |
> > -------------------
> > chr1 | 5000 | 5500 |
> >
> > Peak
> > modified coordinates (start is midpoint)
> > --------------------
> > chr | start | end |
> > -------------------
> > chr1 | 5250 | 5500 |
> >
> > A gene
> > -----------------------------
> > chr | start | end | strand |
> > -----------------------------
> > chr1 | 1000 | 5100 | - |
> >
> > So, instead of being 'overlapStart', it is
> called
> > as 'upstream'.
> >
> > It would be good if the package worked out
> an
> > extra column for the tables, called
> 'position' and
> > used the position for the distances, and the
> real
> > start and end positions for the overlapping.
> >
> > e.g. Something like this
> >
> > if("strand" %in% colnames(table))
> > {
> > table$position = ifelse(table$strand ==
> '+',
> > table$start, table$end)
> > } else {
> > table$position = round((table$start +
> table$end)
> > / 2)
> > }
> >
> > - Dario.
> > ---- Original message ----
> > >Date: Thu, 13 May 2010 22:40:46 -0400
> > >From: "Zhu, Julie" <Julie.Zhu at umassmed.edu>
> > >Subject: Re: ChIPpeakAnno Strandedness and
> > distance calculation
> > >To: "D.Strbenac at garvan.org.au"
> > <D.Strbenac at garvan.org.au>
> > >Cc: "bioconductor"
> > <bioconductor at stat.math.ethz.ch>
> > >
> > >Hi Dario,
> > >
> > >You can create the annotation with strand =
> > c("+"). For example,
> > >
> > >AnnotationRangedData =
> RangedData(IRanges(start =
> > c(967659, 2010898,
> > >2496700, 3075866,
> > >+ 3123260), end = c(967869, 2011108,
> 2496920,
> > 3076166, 3123470), names =
> > >c("t1",
> > >+ "t2", "t3", "t4", "t5")), space = c("1",
> "2",
> > "3", "1", "2"), strand
> > >=c("+"))
> > >
> > >Please take a look at the examples given on
> the
> > paper just published on BMC
> > >Bioinformatics
> >
> >http://www.biomedcentral.com/1471-2105/11/237.
> In
> > case you could not open
> > >the link, I also attached the pdf file.
> > >
> > >Regarding your other question about
> distance
> > calculation, I suggest to
> > >create your AnnotationRangedData and
> > PeakRangedData with start=midpoint to
> > >get the distance between midpoints. The
> distance
> > is calculated differently
> > >for features in plus strand and minus
> strand. For
> > example, to calculate the
> > >distance between peak and TSS, the distance
> is
> > calculated as the distance
> > >between the start of the binding site and
> the
> > TSS, which is the gene start
> > >for genes located on the forward strand and
> the
> > gene end for genes located
> > >on the reverse strand. Therefore, adding
> another
> > parameter would mean to
> > >overwrite the way how the distance is
> calculated
> > based on strandedness.
> > >After you tried the above suggested way and
> still
> > prefer having a new
> > >parameter, I will be happy to add it to the
> next
> > release.
> > >
> > >Best regards,
> > >
> > >Julie
> > >
> > >
> > >*******************************************
> > >Lihua Julie Zhu, Ph.D
> > >Research Associate Professor
> > >Program in Gene Function and Expression
> > >Program in Molecular Medicine
> > >University of Massachusetts Medical School
> > >364 Plantation Street, Room 613
> > >Worcester, MA 01605
> > >508-856-5256
> >
> >http://www.umassmed.edu/pgfe/faculty/zhu.cfm
> > >
> > >
> > >
> > >
> > >
> > >On 5/13/10 9:00 PM, "Dario Strbenac"
> > <D.Strbenac at garvan.org.au> wrote:
> > >
> > >> Hello again,
> > >>
> > >> Just one more question. When we are
> looking at
> > DNA methtylation, we don't have
> > >> the strand of the peak (because the
> reverse
> > complement of CG is CG). It seems
> > >> that it might not be possible to do this
> with
> > ChipPeakAnno ?
> > >>
> > >> e.g.
> > >>
> > >>> > head(peaksT)
> > >> chr start end
> > >> 1 chr13 83351701 83352000
> > >> 2 chr13 83351401 83351700
> > >> 3 chr20 25011901 25012200
> > >> 4 chr13 83352001 83352300
> > >> 5 chr8 143402101 143402400
> > >> 6 chr2 238246801 238247100
> > >>
> > >>> > head(featTable)
> > >> name chr strand start end
> > >> 1 7896759 chr1 + 781253 783614
> > >> 2 7896761 chr1 + 850983 869824
> > >> 3 7896779 chr1 + 885829 890958
> > >> 4 7896798 chr1 + 891739 900345
> > >> 5 7896817 chr1 + 938709 939782
> > >> 6 7896822 chr1 + 945365 981355
> > >>
> > >> Also, sometimes our feature table is a
> table of
> > CpG islands, which don't have
> > >> a strand associated with them.
> > >>
> > >> e.g.
> > >>
> > >>> > head(featTable2)
> > >> chr start end CpG Island Name
> > >> 1 chr1 18598 19673 CpG:_116
> > >> 2 chr1 124987 125426 CpG:_30
> > >> 3 chr1 317653 318092 CpG:_29
> > >> 4 chr1 427014 428027 CpG:_84
> > >> 5 chr1 439136 440407 CpG:_99
> > >> 6 chr1 523082 523977 CpG:_94
> > >>
> > >> Is it possible to do this annotation with
> > ChipPeakAnno ? Currently, the
> > >> annotatePeakInBatch function gives me an
> error
> > when I don't give it strand
> > >> information when I create my RangedData
> object.
> > >>
> > >> Thanks,
> > >> Dario.
> > >>
> > >> --------------------------------------
> > >> Dario Strbenac
> > >> Research Assistant
> > >> Cancer Epigenetics
> > >> Garvan Institute of Medical Research
> > >> Darlinghurst NSW 2010
> > >> Australia
> > >>
> > >>
> > >
> > >On 5/13/10 8:10 PM, "Dario Strbenac"
> > <D.Strbenac at garvan.org.au> wrote:
> > >
> > >> Hello,
> > >>
> > >> Firstly, thank you for making this
> package. It
> > seems so useful ! We were
> > >> thinking of writing something like this
> > ourselves, until I saw your package,
> > >> because we do a lot of ChIP-Seq here.
> > >>
> > >> I just have a small feature request. In
> your
> > distance calculation, you do
> > >> start of peak - start of feature. Would
> it be
> > possible to allow the user to
> > >> choose if they want the distance
> calculation to
> > use the start or the middle of
> > >> the feature (and also for the peak) ?
> This is
> > because we do a lot of
> > >> methylation studies, and for CpG island
> > features, we like to use the midpoint
> > >> as the position of our feature. It would
> also
> > be nice to be able to use the
> > >> midpoint of the peak as the peak's
> position,
> > since this is usually where the
> > >> signal is strongest.
> > >>
> > >> Thanks,
> > >> Dario.
> > >>
> > >> --------------------------------------
> > >> Dario Strbenac
> > >> Research Assistant
> > >> Cancer Epigenetics
> > >> Garvan Institute of Medical Research
> > >> Darlinghurst NSW 2010
> > >> Australia
> > >
> > >
> >
> > --------------------------------------
> > Dario Strbenac
> > Research Assistant
> > Cancer Epigenetics
> > Garvan Institute of Medical Research
> > Darlinghurst NSW 2010
> > Australia
--------------------------------------
Dario Strbenac
Research Assistant
Cancer Epigenetics
Garvan Institute of Medical Research
Darlinghurst NSW 2010
Australia
More information about the Bioconductor
mailing list