[BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
Stefan Pinter
pinter at molbio.mgh.harvard.edu
Sun Mar 17 20:47:27 CET 2013
Hi Julie,
maybe I'm misunderstanding something about this function. Below is what I've done as a positive control (using refseq genes as "peaks" compared to ensembl annotation), can you tell me whether these numbers make sense? I was expecting to see close to 100% overlap (according to bedtools closest on refseq genes against ensembl genes, 958/967 unique refseq X-linked genes are overlapping an ensembl gene (distance = 0)).
But with prox. promoter cutoff = 0, I get only get 42% prox. promoter and 45% "enhancer", which I believe could also be called intergenic, right? Even with default of 1k cutoff, only 81% would be counted as "genic" (sum of everything - "Enhancer").
Thanks,
- Stefan.
First, I pulled in the ensembl gene set from archive, because my data are in mm9:
ensembl67=useMart(host='may2012.archive.ensembl.org', biomart='ENSEMBL_MART_ENSEMBL', dataset='mmusculus_gene_ensembl')
> TSS <- getAnnotation(ensembl67, featureType ="TSS")
> Exon <- getAnnotation(ensembl67, featureType ="Exon")
> utr5 <- getAnnotation(ensembl67, featureType ="5utr")
> utr3 <- getAnnotation(ensembl67, featureType ="3utr")
Then, I used all X-linked genes from the conservative mm9 refseq gene set (merged to contain unique names only) as my "peak" list [as a positive control].
> bed.rGm<-read.table("genes/chrX.rGm4.bed", header = FALSE)
> rd.rGm<-BED2RangedData(bed.rGm, header=FALSE)
> rd.rGm
RangedData with 967 rows and 2 value columns across 1 space
Finally, I annotated these "peaks" relative to TSS and used assignChromosomeRegion with 0k or 1k cutoff:
> rda.rGm = annotatePeakInBatch(rd.rGm, AnnotationData=TSS, PeakLocForDistance = "middle")
> l.feat0k.rGm <-ChIPpeakAnno:::assignChromosomeRegion(rda.rGm, TSS, Exon, utr5, utr3, proximal.promoter.cutoff=0, immediate.downstream.cutoff=0)
> l.feat0k.rGm
$Exon
[1] 3.205791
$Intron
[1] 0
$`5UTR`
[1] 3.516029
$`3UTR`
[1] 6.30817
$`Proximal Promoter%`
[1] 41.98552
$`Immediate Downstream`
[1] 0
$Enhancer
[1] 44.98449
> l.feat1k.rGm <-ChIPpeakAnno:::assignChromosomeRegion(rda.rGm, TSS, Exon, utr5, utr3)
> l.feat1k.rGm
$Exon
[1] 3.205791
$Intron
[1] 0
$`5UTR`
[1] 3.516029
$`3UTR`
[1] 6.30817
$`Proximal Promoter%`
[1] 80.97208
$`Immediate Downstream`
[1] 0.1034126
$Enhancer
[1] 5.894519
----- Original Message -----
From: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
To: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>, "<bioconductor at r-project.org>" <bioconductor at r-project.org>
Cc: "Jianhong Ou" <Jianhong.Ou at umassmed.edu>
Sent: Saturday, March 16, 2013 8:25:44 PM
Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
Stefan,
Could you please send me a simple dataset to see what went wrong? Thanks for
letting know!
Thanks for hashing out the chromosome region enrichment score calculation!
Looks like you are looking for something very similar to GO enrichment
analysis (getEnrichedGO function in ChIPpeakAnno).
For option a, as you mentioned, we will need the genome level estimation of
the distribution for each category, which could be tricky. Once we have
these estimation, then Hypergeometric test can be applied to find whether a
category is enriched/depleted.
Jaccard index (option b) is an interesting alternative. We could implement
Jaccard index at the peak level first and add options for computing Jaccard
index at the nucleotide level later.
Best regards,
Julie
On 3/16/13 7:41 PM, "Stefan Pinter" <pinter at molbio.mgh.harvard.edu> wrote:
> Julie, I compared my results from ChipPeakAnno with a simple overlap in
> bedtools: the % of peaks inside genes (exon + intron + 3utr + 5utr) reported
> by ChIPpeakAnno is over an order of magnitude lower than what I learned from
> bedtools. Browsing the data indicates that bedtools is right. Something is
> off, do you have a test.data set you can check against?
> Thanks,
> -S.
>
> ----- Original Message -----
> From: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
> To: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
> Cc: "bioconductor" <bioconductor at r-project.org>
> Sent: Saturday, March 16, 2013 1:30:55 PM
> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>
> Hi Julie,
>
> well, I was more thinking of something along the lines of either:
>
> a.) Over/Under-representation:
> Percentage of exons in genome = 2%
> Percentage of peaks overlapping exons = 10%
> Exons are 5-fold over-represented
>
> But I realize now, that will only work with a full annotation, not a custom
> annotation.
>
> b.) something like the jaccard index used in bedtools:
>
> total length of intersection / total length of union
>
> http://en.wikipedia.org/wiki/Jaccard_index
>
> Just suggestions, there are probably many more possible ways to calculate a
> score.
> Thank you,
> - Stefan.
>
> ----- Original Message -----
> From: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
> To: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
> Cc: "<bioconductor at r-project.org>" <bioconductor at r-project.org>
> Sent: Friday, March 15, 2013 1:28:09 PM
> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>
> Stefan,
>
> Regarding the enrichment/depletion score, Does the following toy example
> illustrate how to calculate such score? If not, could you please give a toy
> example? Thanks!
>
> For example, if the total number of peaks = 100, number of peaks assigned to
> promoter = 90, number of peaks assigned to enhancer = 10, then the
> enrichment score = 90% and 10% for promoter region and enhancer region
> respectively.
>
> We will add closest to annotatePeakInBatch in the dev version. Thanks for
> the great suggestion!
>
> Please cc bioconductor at r-project.org. Thanks!
>
> Best regards,
>
> Julie
>
>
> On 3/15/13 12:26 PM, "Stefan Pinter" <pinter at molbio.mgh.harvard.edu> wrote:
>
>> Sure, thank you Julie.
>>
>> Yes, that is what I meant. Since all the necessary information (peak
>> locations, feature locations, total count, total length) are already
>> available
>> in RangedData & Annotation, it would be very convenient to calculate
>> enrichment/depletion and possibly even significance scores (by permutation)
>> on
>> the fly. The significance score may be overkill, but if the function at least
>> reported enrichment/depletion scores, the user could always supply a number
>> of
>> shuffled ranges to build a random model of enrichment scores and calculate
>> significance after.
>>
>> In addition, would it be possible to add another definition for
>> PeakLocForDistance in annotatePeakInBatch?
>> PeakLocForDistance = "start means using start of the peak to calculate the
>> distance to feature"
>>
>> It would be helpful to have "closest", meaning distance to feature measured
>> from peak start or peak end, whichever is closer. That would help with broad
>> peaks, which using "middle" for isn't very helpful.
>>
>> Thank you and Best wishes,
>> - Stefan.
>>
>> ----- Original Message -----
>> From: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>> To: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
>> Cc: "<bioconductor at r-project.org>" <bioconductor at r-project.org>
>> Sent: Friday, March 15, 2013 8:06:03 AM
>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>
>> Stefan,
>>
>> You can download 2.6.1 to directly use assignChromosomeRegion without the
>> package prefix (I updated last night and it might take sometime to propagate
>> to the installation site). To find out what parameters supported by the
>> function, in R, please type help( ChIPpeakAnno:::assignChromosomeRegion).
>> You will see that indeed PeakLocForDistance is not supported. If needed, I
>> can add such parameter.
>>
>> Also regarding your previous suggestion of adding enrichment status of
>> feature length, do you mean enrichment of peaks in certain category of
>> chromosome region? For example, a significant enrichment score with 90%
>> peaks in promoter region would be interesting.
>>
>> Could you please keep the thread in the Bioc for others to
>> contribute/benefit? Thanks!
>>
>> Best regards,
>>
>> Julie
>>
>>
>> On 3/14/13 10:10 PM, "Stefan Pinter" <pinter at molbio.mgh.harvard.edu> wrote:
>>
>>> PPS. I think PeakLocForDistance is not working in that function:
>>>
>>>> l.feat.dRF8.100k <-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>> Exon,
>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>> immediate.downstream.cutoff=100000, PeakLocForDistance="middle")
>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS, Exon, utr5, :
>>> unused argument(s) (PeakLocForDistance = "middle")
>>>> l.feat.dRF8.100k <-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>> Exon,
>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>> immediate.downstream.cutoff=100000, PeakLocForDistance= middle)
>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS, Exon, utr5, :
>>> unused argument(s) (PeakLocForDistance = middle)
>>>> l.feat.dRF8.100k <-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>> Exon,
>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>> immediate.downstream.cutoff=100000, PeakLocForDistance = c("middle"))
>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS, Exon, utr5, :
>>> unused argument(s) (PeakLocForDistance = c("middle"))
>>>
>>> ----- Original Message -----
>>> From: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
>>> To: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>> Sent: Thursday, March 14, 2013 9:39:10 PM
>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>
>>> PS. if I might add a simple feature addition to that function - it would be
>>> a
>>> TRUE/FALSE trigger for whether to also report enrichment/depletion stats
>>> based
>>> on feature lengths, i.e. 50% peaks labeled as intergenic (or enhancers) is
>>> less interesting than 50% of peaks reported as exonic (much greater
>>> enrichment
>>> as total exonic feature length is smaller). Thanks, Best...S
>>>
>>> ----- Original Message -----
>>> From: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
>>> To: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>> Cc: "bioconductor" <bioconductor at r-project.org>
>>> Sent: Thursday, March 14, 2013 9:32:33 PM
>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>
>>> Hi Julie,
>>>
>>> yes, that worked! Thank you for the quick help, I very much appreciate it!
>>> Thank you and Best wishes,
>>> - Stefan.
>>>
>>> ----- Original Message -----
>>> From: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>> To: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>, "Stefan Pinter"
>>> <pinter at molbio.mgh.harvard.edu>
>>> Cc: "<bioconductor at r-project.org>" <bioconductor at r-project.org>
>>> Sent: Thursday, March 14, 2013 8:50:51 PM
>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>
>>> Stefan,
>>>
>>> Thanks for reporting this! Actually the function assignChromosomeRegion is
>>> not exported.
>>>
>>> Could you please try the following to see if you can use it? Thanks!
>>>
>>> ChIPpeakAnno:::assignChromosomeRegion
>>>
>>> Best regards,
>>>
>>> Julie
>>>
>>>
>>> On 3/14/13 7:36 PM, "Lihua Julie Zhu" <julie.zhu at umassmed.edu> wrote:
>>>
>>>> Stefan,
>>>>
>>>> Could you please send me the sessionInfo? I just want to make sure you
>>>> installed version 2.6.0. Thanks!
>>>>
>>>> I noticed that when I try to install ChIPpeakAnno with the following code,
>>>> I
>>>> get 2.4 version instead.
>>>>
>>>> source("http://bioconductor.org/biocLite.R")
>>>> biocLite("ChIPpeakAnno")
>>>>
>>>> Best regards,
>>>>
>>>> Julie
More information about the Bioconductor
mailing list