[BioC] [Bioc-sig-seq] ChIPpeakAnno annotatePeakInBatch function

Tue Mar 9 20:48:50 CET 2010

It seems that we are getting quite excited about ChIPpeakAnno. Good!

Two additions to this discussion

##########
# Topic 1
##########

Just a comment about
AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),]

That statement will work ONLY if rownames have been defined in
AannotatedRd. That means that AannotatedRd must inherit the Ard's
rownames.

In the simplest case, when you import a BED file into a RangedData
instance, the row names will not be defined.

My solution, currently in use:

So as to preserve the rows in their original order then, what I do is
name the rows with numbers as soon as I import my BED. Like this

library(rtracklayer)
Ard <- import('mypeaks.bed', 'bed')
rownames(Ard) <- formatC(1:dim(Ard)[1], width=nchar(dim(Ard)[1]), flag='0')

Now
AannotatedRd = annotatePeakInBatch(Ard, AnnotationData = Srd)
AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),]
will work like a charm.

##########
# Topic 2
##########

I do not know how the AnnotationData (Srd above) that ships with the
package is built but this is how I build it on the fly

library(RMySQL)
con <- dbConnect(MySQL(), user='genome', dbname='mm9',
host='genome-mysql.cse.ucsc.edu')
Srd <- as(dbGetQuery(con, "SELECT a.chrom AS space, a.txStart+1 AS
start, a.txEnd AS end, a.strand AS strand FROM mm9.ensGene AS a GROUP
BY a.name2"), "RangedData")
rownames(Srd) <- as.list(dbGetQuery(con, "SELECT a.name2 AS name FROM
mm9.ensGene AS a GROUP BY a.name2"))[[1]]
Srd <- Srd[order(as.data.frame(Srd)$space, start(Srd)), ]
dbDisconnect(con)

Notice that I get ONE txStart per gene while in fact there may be
multiple transcripts per single gene, with different start sites. So,
Srd is suboptimal but I need it because ChIPpeakAnno needs gene names.

In the ideal case, ChIPpeakAnno should be able to link transcripts to
GO terms. That means that, instead of using Srd, we should be using
Trd, which you can obtain on the fly like this:

con <- dbConnect(MySQL(), user='genome', dbname='mm9',
host='genome-mysql.cse.ucsc.edu')
Trd <- as(dbGetQuery(con, "SELECT a.chrom AS space, a.txStart+1 AS
start, a.txEnd AS end, a.strand AS strand FROM mm9.ensGene AS a GROUP
BY a.name"), "RangedData")
rownames(Trd) <- as.list(dbGetQuery(con, "SELECT a.name AS name FROM
mm9.ensGene AS a GROUP BY a.name"))[[1]]
Trd <- Trd[order(as.data.frame(Trd)$space, start(Trd)), ]
dbDisconnect(con)

I hope this helps improve the package. It's worth it.

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878

On Tue, Mar 9, 2010 at 10:44 AM, Zhu, Julie <Julie.Zhu at umassmed.edu> wrote:
> Hi Ivan,
>
> Thank you very much for the feedback! Sure, I will add  AannotatedRd <-
> AannotatedRd[order(rownames(AannotatedRd)),] inside the function. Thanks!
>
> Best regards,
>
> Julie
>
>
> On 3/9/10 10:24 AM, "Ivan Gregoretti" <ivangreg at gmail.com> wrote:
>
> I haven't used Cisgenome or CEAS so I can't speak about them.
>
> I use ChIPpeakAnno because, while working in Bioconductor, I get the
> enriched GO terms in three lines of code. (Four lines, if you count
> library(ChIPpeakAnno)).
>
> By the way, since you are in the process of upgrading the package,
> would you consider this fixable problem?:
>
> say Ard and Srd are RangedData instances. When I do
>
> AannotatedRd = annotatePeakInBatch(Ard, AnnotationData = Srd)
>
> the order of rows in AannotatedRd is different from the order of rows
> in Ard. Currently, I fix this by hand doing
>
> AannotatedRd <- AannotatedRd[order(rownames(AannotatedRd)),]
>
> Would you mind incorporating this fix of row shuffling in the next upgrade?
>
> Both Amy and me use the stable release of R and Bioc, not the devel
> version. I would highly appreciate if you could push the upgrade for
> both stable and devel but I won't hold you accountable if you don't.
>
> Have a great day!
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
>
>
>
> On Tue, Mar 9, 2010 at 10:02 AM, Zhu, Julie <Julie.Zhu at umassmed.edu> wrote:
>> Hi Amy,
>>
>> Thank you very much for the feedback!
>>
>> The inside feature is TRUE if the peakSummit (stored as Start in the
>> RangedData) is inside the nearest annotated feature, FALSE otherwise.  The
>> feature you are proposing is useful for finding overlapping fragments
>> between query peak ranges and target ranges. The timing of your suggestion
>> cannot be better. We plan to add  a function called FindOverlappingPeaks
>> that will be added to the dev version, your ideas will be incorporated.
>> Please let me know if you and others think that it is better to
>> incorporate
>> your ideas into annotatePeakInBatch. Thanks again for your help making
>> this
>> package more useful.
>>
>> I have a favor to ask you and those who have used or are aware of the
>> ChIPpeakAnno package. We submitted a paper describing this package and
>> just
>> got the reviewers’ comments back. One question comes up is that what this
>> package offers that the existing annotation tools such as Cisgenome and
>> CEAS
>> do not. I thought it might be useful to get the feedback from the users of
>> this package. If possible, could you please send me your thoughts on this,
>> especially the reasons you chose using this package? Thanks a lot for your
>> time and help!
>>
>> Best regards,
>>
>> Julie
>>
>>
>> *******************************************
>> Lihua Julie Zhu, Ph.D
>> Research Associate Professor
>> Program Gene Function and Expression
>> University of Massachusetts Medical School
>> 364 Plantation Street, Room 613
>> Worcester, MA 01605
>> 508-856-5256
>> http://www.umassmed.edu/pgfe/faculty/zhu.cfm
>>
>>
>>
>> On 3/9/10 6:23 AM, "Amy Molesworth" <amy.m.molesworth at gsk.com> wrote:
>>
>>
>>
>> Firstly I'd like to thank the authors of the very useful package
>> ChIPpeakAnno. I'd like to report a feature in ChIPpeakAnno
>> annotatePeakInBatch function results that other users may or may not be
>> aware of. I also propose improvements to compensate.
>>
>> The resulting insideFeature column reports TRUE if the query peak is
>> either
>> contained within an annotated feature, and also reports TRUE if it
>> overlaps
>> the end of an annotated feature.
>>
>> I think its worth noting that it reports FALSE if the peak overlaps the
>> beginning of an annotated feature, and also reports FALSE if the peak
>> overlaps in entirety an annotated feature(s).
>>
>> So, perhaps the insideFeature column (or additional new column called
>> overlappingFeature) could report five options:
>> ("false","inside","overlapStart","overlapEnd","super"). I haven't looked
>> into the effects on how distanceToFeature should/could be called for each
>> different scenario.
>>
>> Apologies if this has already been addressed, or if others do not consider
>> this useful.
>>
>> Details with dummy example are described below.
>>
>> Many thanks,
>> Amy.
>>
>> #####
>>
>> In the dummy example below, p1 is bigger than f1 and consequently p1
>> overlaps it in entirety. It would be nice if ChIPpeakAnno could report
>> this
>> - although I accept it may overlap more than one feature,
>> so would need to consider how to deal with that.
>>
>> And another example from below, p3 in fact overlaps with the start of f3,
>> but is called as insideFeature=FALSE. It would be nice if ChIPpeakAnno
>> could
>> report it as OverlapStart.
>>
>> p4 is called as insideFeature = TRUE for overlapping with f4, but it would
>> be nice if ChIPpeakAnno could report it as OverlapEnd or something
>> similar.
>>
>> And correctly p2 is called as insideFeature = TRUE for overlap with f2, in
>> this case p2 ranges are within the f2 ranges as you would expect.
>>
>>
>> library(ChIPpeakAnno)
>> peaks =
>>
>> RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,167889600),end=c(1555199,1560599,1565199,1573799,167893599),names=c("p1","p2","p3","p4","p5")),strand=as.integer(1),space=c(6,6,6,6,5))
>> features =
>>
>>  RangedData(IRanges(start=c(1549800,1554400,1565000,1569400,167888600),end=c(1550599,1560799,1565399,1571199,167888999),names=c("f1","f2","f3","f4","f5")),strand=as.integer(1),space=c(6,6,6,6,5))
>>
>> annoPeaks = annotatePeakInBatch(peaks,AnnotationData=features)
>>
>> as.data.frame(annoPeaks)
>>
>>   space     start       end width names strand feature start_position
>> 1     5 167889600 167893599  4000    p5      1      f5      167888600
>> 2     6   1543200   1555199 12000    p1      1      f1        1549800
>> 3     6   1557200   1560599  3400    p2      1      f2        1554400
>> 4     6   1563000   1565199  2200    p3      1      f3        1565000
>> 5     6   1569800   1573799  4000    p4      1      f4        1569400
>>   end_position insideFeature distancetoFeature
>> 1    167888999         FALSE              1000
>> 2      1550599         FALSE             -6600
>> 3      1560799          TRUE              2800
>> 4      1565399         FALSE             -2000
>> 5      1571199          TRUE               400
>>
>>
>>> sessionInfo()
>> R version 2.10.0 (2009-10-26)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=C
>>  [5] LC_MONETARY=C              LC_MESSAGES=C
>>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>  [1] ChIPpeakAnno_1.3.0                  org.Hs.eg.db_2.3.6
>>  [3] GO.db_2.3.5                         RSQLite_0.7-3
>>  [5] DBI_0.2-4                           AnnotationDbi_1.8.0
>>  [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.14.0
>>  [9] Biostrings_2.14.2                   IRanges_1.5.18
>> [11] multtest_2.2.0                      Biobase_2.6.0
>> [13] biomaRt_2.3.0
>>
>> loaded via a namespace (and not attached):
>> [1] MASS_7.3-3      RCurl_1.3-0     XML_2.6-0       splines_2.10.0
>> [5] survival_2.35-7
>>
>>
>> -----------------------------------------------------------
>> This e-mail was sent by GlaxoSmithKline Services Unlimited
>> (registered in England and Wales No. 1047315), which is a
>> member of the GlaxoSmithKline group of companies. The
>> registered address of GlaxoSmithKline Services Unlimited
>> is 980 Great West Road, Brentford, Middlesex TW8 9GS.
>> -----------------------------------------------------------
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>
>
>
>