[BioC] ChIPpeakAnno results

Tue Aug 10 16:26:37 CEST 2010

My explanation wasn't quite clear, excuse me.

I have a RangedData object, as obtained from ChipPeakAnno, with lots of
peaks which are annotated like this:

>table <- read.table("H3K4me3vsIgG_peaks.bed")
>bed <- BED2RangedData(bed,header=T)
>annotatedPeaks<-annotatePeakInBatch(bed,AnnotationData=TSS.mouse.NCBIM37)
> annotatedPeaks[1:2,]
RangedData with 2 rows and 9 value columns across 21 spaces
                                       space               ranges |
                                 <character>            <IRanges> |
MACS_peak_10 ENSMUSG00000025907            1 [ 6204139,  6204741] |
MACS_peak_102 ENSMUSG00000061518           1 [36748352, 36749344] |

     peak                  strand            feature

<character>                 <character>        <character>
MACS_peak_10 ENSMUSG00000025907   MACS_peak_10           1
ENSMUSG00000025907
MACS_peak_102 ENSMUSG00000061518 MACS_peak_102           1
ENSMUSG00000061518

start_position end_position insideFeature

     <numeric>    <numeric>   <character>
MACS_peak_10 ENSMUSG00000025907         6204743      6265656      upstream
MACS_peak_102 ENSMUSG00000061518       36748425     36750230  overlapStart

distancetoFeature shortestDistance

            <numeric>        <numeric>
MACS_peak_10 ENSMUSG00000025907               -604                2
MACS_peak_102 ENSMUSG00000061518               -73               73

fromOverlappingOrNearest

               <character>
MACS_peak_10 ENSMUSG00000025907              NearestStart
MACS_peak_102 ENSMUSG00000061518             NearestStart

If I do like this:
> org.Mm.egENSEMBL2EG$`ENSMUSG00000025907`
[1] "12421"
> org.Mm.egSYMBOL$`12421`
[1] "Rb1cc1"

Which is the symbol I am itnerested in. I would like to add the symbol
corresponding to each row in the RangedData Object as a new column.

I have tried this, with mget:

>
annotatedPeaks$entrez<-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)

but:

> annotatedPeaks[1:2,]
RangedData with 2 rows and 10 value columns across 21 spaces
                                       space               ranges |
                                 <character>            <IRanges> |
MACS_peak_10 ENSMUSG00000025907            1 [ 6204139,  6204741] |
MACS_peak_102 ENSMUSG00000061518           1 [36748352, 36749344] |
                                          peak      strand           
feature
                                   <character> <character>       
<character>
MACS_peak_10 ENSMUSG00000025907   MACS_peak_10           1
ENSMUSG00000025907
MACS_peak_102 ENSMUSG00000061518 MACS_peak_102           1
ENSMUSG00000061518
                                 start_position end_position insideFeature
                                      <numeric>    <numeric>   <character>
MACS_peak_10 ENSMUSG00000025907         6204743      6265656      upstream
MACS_peak_102 ENSMUSG00000061518       36748425     36750230  overlapStart
                                 distancetoFeature shortestDistance
                                         <numeric>        <numeric>
MACS_peak_10 ENSMUSG00000025907               -604                2
MACS_peak_102 ENSMUSG00000061518               -73               73
                                 fromOverlappingOrNearest   entrez
                                              <character>   <list>
MACS_peak_10 ENSMUSG00000025907              NearestStart ########
MACS_peak_102 ENSMUSG00000061518             NearestStart ########

as mget returns a list. Note that some ensemble IDs map to more than one
gene ID.
Also, using the convert2EntrezID from the same ChIPpeakAnno package:

>
annotatedPeaks$EntrezID<-convert2EntrezID(IDs=annotatedPeaks$feature,orgAnn="org.Mm.eg.db",ID_type="ensembl_gene_id")

Error in `[[<-`(`*tmp*`, name, value = c("12421", "12859", "67387",
"623661",  :
  9633 elements in value to replace 13721 elements
>

Which returns a matrix (dim 9633,1) as some ensemblID map to same gene ID.

As far as I understand I need to get use the geneID to map ensemblID to
SYMBOL. So i cannot get the Symbols.

So, I am stuck here.

Thanks again,
Marc

> sessionInfo()
R version 2.11.0 (2010-04-22)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] org.Mm.eg.db_2.4.1                  ChIPpeakAnno_1.4.1                
 [3] limma_3.4.0                         org.Hs.eg.db_2.4.1                
 [5] GO.db_2.4.1                         RSQLite_0.9-0                     
 [7] DBI_0.2-5                           AnnotationDbi_1.10.1              
 [9] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.1                   
[11] GenomicRanges_1.0.1                 Biostrings_2.16.0                 
[13] IRanges_1.6.1                       multtest_2.4.0                    
[15] Biobase_2.8.0                       biomaRt_2.4.0                     

loaded via a namespace (and not attached):
[1] MASS_7.3-5      RCurl_1.4-2     splines_2.11.0  survival_2.35-8
[5] tools_2.11.0    XML_3.1-0     
>

James W. MacDonald wrote:
> Hi Marc,
>
> On 8/10/2010 6:26 AM, Marc Noguera wrote:
>   
>> Hi all,
>> this may be a very naive question by I have been trying to solve it
>> myself and i can't get through it.
>>
>> I have this RangedData object obtained from ChIPpeakAnno Package, which
>> correspond toa  Chipseq experiment with annotated peaks, with ENSEMBL
>> identificators.
>> I can use this output already but like to transform the ENSEMBLID to a
>> gene symbol id
>>
>> for instance: ENSMUSG00000025907 to "Rb1cc1" symbol. It also would be
>> useful to add a field linking to a entrez gene web url.
>>
>> I have been looking at the org.Mm.eg.db package and although I can
>> retrieve the symbol for a particular ENSEMBLID can't get it for all the
>> elements in the object.
>>     
>
> What have you tried so far? Unless you give an example of what you have 
> done and how it didn't perform as you expect, it is very difficult for 
> anybody to help.
>
> As a shot in the dark, have you looked at the help page for mget()?
>
> I don't really understand how the field linking to Entrez Gene would 
> work, considering a RangedData object isn't an HTML page. However, 
> building a URL to Entrez Gene isn't that difficult. You can hijack some 
> internal code from the annotate package:
>
>  > suppressMessages(library(annotate))
>  > egids <- 1:5
>  > annotate:::.repositories[["en"]](egids)
> [1] 
> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=1"
> [2] 
> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=2"
> [3] 
> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=3"
> [4] 
> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=4"
> [5] 
> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=5"
>
> But doing it by hand wouldn't be that much more difficult. If you strip 
> out the error checking from the above function, all it really consists of is
>
> thefunction <- function(ids){
> paste("http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=", 
>
>          ids, sep = "")
> }
>
> Best,
>
> Jim
>
>
>   
>> Many thanks
>> Marc
>>
>>     
>
>   

-- 
-----------------------------------------------------
Marc Noguera i Julian, PhD
Genomics unit / Bioinformatics
Institut de Medicina Predictiva i Personalitzada
del Càncer (IMPPC)
B-10 Office
Carretera de Can Ruti
Camí de les Escoles s/n
08916 Badalona, Barcelona