[BioC] ChIPpeakAnno results

Tue Aug 10 22:53:30 CEST 2010

Hi Marc,

On 8/10/2010 10:26 AM, Marc Noguera wrote:
> My explanation wasn't quite clear, excuse me.
>
> I have a RangedData object, as obtained from ChipPeakAnno, with lots of
> peaks which are annotated like this:
>
>> table<- read.table("H3K4me3vsIgG_peaks.bed")
>> bed<- BED2RangedData(bed,header=T)
>> annotatedPeaks<-annotatePeakInBatch(bed,AnnotationData=TSS.mouse.NCBIM37)
>> annotatedPeaks[1:2,]
> RangedData with 2 rows and 9 value columns across 21 spaces
>                                         space               ranges |
>                                   <character>             <IRanges>  |
> MACS_peak_10 ENSMUSG00000025907            1 [ 6204139,  6204741] |
> MACS_peak_102 ENSMUSG00000061518           1 [36748352, 36749344] |
>
>       peak                  strand            feature
>
> <character>                  <character>         <character>
> MACS_peak_10 ENSMUSG00000025907   MACS_peak_10           1
> ENSMUSG00000025907
> MACS_peak_102 ENSMUSG00000061518 MACS_peak_102           1
> ENSMUSG00000061518
>
> start_position end_position insideFeature
>
>       <numeric>     <numeric>    <character>
> MACS_peak_10 ENSMUSG00000025907         6204743      6265656      upstream
> MACS_peak_102 ENSMUSG00000061518       36748425     36750230  overlapStart
>
> distancetoFeature shortestDistance
>
>              <numeric>         <numeric>
> MACS_peak_10 ENSMUSG00000025907               -604                2
> MACS_peak_102 ENSMUSG00000061518               -73               73
>
> fromOverlappingOrNearest
>
>                 <character>
> MACS_peak_10 ENSMUSG00000025907              NearestStart
> MACS_peak_102 ENSMUSG00000061518             NearestStart
>
> If I do like this:
>> org.Mm.egENSEMBL2EG$`ENSMUSG00000025907`
> [1] "12421"
>> org.Mm.egSYMBOL$`12421`
> [1] "Rb1cc1"
>
> Which is the symbol I am itnerested in. I would like to add the symbol
> corresponding to each row in the RangedData Object as a new column.
>
> I have tried this, with mget:
>
>>
> annotatedPeaks$entrez<-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
>
> but:
>
>> annotatedPeaks[1:2,]
> RangedData with 2 rows and 10 value columns across 21 spaces
>                                         space               ranges |
>                                   <character>             <IRanges>  |
> MACS_peak_10 ENSMUSG00000025907            1 [ 6204139,  6204741] |
> MACS_peak_102 ENSMUSG00000061518           1 [36748352, 36749344] |
>                                            peak      strand
> feature
>                                     <character>  <character>
> <character>
> MACS_peak_10 ENSMUSG00000025907   MACS_peak_10           1
> ENSMUSG00000025907
> MACS_peak_102 ENSMUSG00000061518 MACS_peak_102           1
> ENSMUSG00000061518
>                                   start_position end_position insideFeature
>                                        <numeric>     <numeric>    <character>
> MACS_peak_10 ENSMUSG00000025907         6204743      6265656      upstream
> MACS_peak_102 ENSMUSG00000061518       36748425     36750230  overlapStart
>                                   distancetoFeature shortestDistance
>                                           <numeric>         <numeric>
> MACS_peak_10 ENSMUSG00000025907               -604                2
> MACS_peak_102 ENSMUSG00000061518               -73               73
>                                   fromOverlappingOrNearest   entrez
>                                                <character>    <list>
> MACS_peak_10 ENSMUSG00000025907              NearestStart ########
> MACS_peak_102 ENSMUSG00000061518             NearestStart ########
>
> as mget returns a list. Note that some ensemble IDs map to more than one
> gene ID.

Well, that's true and unavoidable. You will simply have to choose which 
symbol you want to use. Or else, you can concatenate with commas if you 
want them all.

symbs <-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
symbs <- sapply(symbs, function(x) paste(x, collapse = ", "))
annotatedPeaks$entrez <- symbs

And as you already noted, it is a two step process to go from Ensembl 
gene IDs to gene symbols, so what has been done above isn't quite 
sufficient. But I assume you get the idea.

You can always do things in one shot using SQL queries. An example you 
can work from can be found in in section 2.0.9 of the AnnotationDbi 
vignette. That can be fun if you like SQL stuff.

Alternatively, you can use biomaRt. Note that you will always want to 
return the Ensembl gene ID when you use biomaRt, so you can line things 
up after the fact. As an example:

 > suppressMessages(library(biomaRt))
 > mart <- useMart("ensembl","mmusculus_gene_ensembl")
## fake up some IDs
 > ids <- paste("ENSMUSG", sprintf("%011.0f", 
c(1,3,28,37,49,56,58,78,85,88,93,94,103)), sep="")
 > ids
  [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
  [4] "ENSMUSG00000000037" "ENSMUSG00000000049" "ENSMUSG00000000056"
  [7] "ENSMUSG00000000058" "ENSMUSG00000000078" "ENSMUSG00000000085"
[10] "ENSMUSG00000000088" "ENSMUSG00000000093" "ENSMUSG00000000094"
[13] "ENSMUSG00000000103"
 > out <- getBM(c("ensembl_gene_id", "mgi_symbol"), "ensembl_gene_id", 
ids, mart)
 > out
       ensembl_gene_id mgi_symbol
1  ENSMUSG00000000001      Gnai3
2  ENSMUSG00000000003       Pbsn
3  ENSMUSG00000000028      Cdc45
4  ENSMUSG00000000037      Scml2
5  ENSMUSG00000000049       Apoh
6  ENSMUSG00000000056       Narf
7  ENSMUSG00000000058       Cav2
8  ENSMUSG00000000078       Klf6
9  ENSMUSG00000000085      Scmh1
10 ENSMUSG00000000088      Cox5a
11 ENSMUSG00000000093       Tbx2
12 ENSMUSG00000000094       Tbx4
13 ENSMUSG00000000103       Zfy2

Here if you have any duplicated symbols, you can use the trick from 
above. First convert to a list:

outlist <- tapply(1:nrow(out), out[,1], function(x) out[x,2])

then

outlist <- sapply(outlist, function(x) paste(x, collapse = ", "))

and append to your RangedData object.

Also note that Ensembl gene IDs that don't map to a symbol will be 
dropped silently, so you might need to do some manipulations of the 
returned data.frame to insert the gene IDs (and a '' for the symbol) in 
order to have things line up correctly when you put the symbols into 
your RangedData object.

Best,

Jim

> Also, using the convert2EntrezID from the same ChIPpeakAnno package:
>
>>
> annotatedPeaks$EntrezID<-convert2EntrezID(IDs=annotatedPeaks$feature,orgAnn="org.Mm.eg.db",ID_type="ensembl_gene_id")
>
> Error in `[[<-`(`*tmp*`, name, value = c("12421", "12859", "67387",
> "623661",  :
>    9633 elements in value to replace 13721 elements
>>
>
>
> Which returns a matrix (dim 9633,1) as some ensemblID map to same gene ID.
>
> As far as I understand I need to get use the geneID to map ensemblID to
> SYMBOL. So i cannot get the Symbols.
>
> So, I am stuck here.
>
>
> Thanks again,
> Marc
>
>> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-unknown-linux-gnu
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] org.Mm.eg.db_2.4.1                  ChIPpeakAnno_1.4.1
>   [3] limma_3.4.0                         org.Hs.eg.db_2.4.1
>   [5] GO.db_2.4.1                         RSQLite_0.9-0
>   [7] DBI_0.2-5                           AnnotationDbi_1.10.1
>   [9] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.1
> [11] GenomicRanges_1.0.1                 Biostrings_2.16.0
> [13] IRanges_1.6.1                       multtest_2.4.0
> [15] Biobase_2.8.0                       biomaRt_2.4.0
>
> loaded via a namespace (and not attached):
> [1] MASS_7.3-5      RCurl_1.4-2     splines_2.11.0  survival_2.35-8
> [5] tools_2.11.0    XML_3.1-0
>>
>
>
>
> James W. MacDonald wrote:
>> Hi Marc,
>>
>> On 8/10/2010 6:26 AM, Marc Noguera wrote:
>>
>>> Hi all,
>>> this may be a very naive question by I have been trying to solve it
>>> myself and i can't get through it.
>>>
>>> I have this RangedData object obtained from ChIPpeakAnno Package, which
>>> correspond toa  Chipseq experiment with annotated peaks, with ENSEMBL
>>> identificators.
>>> I can use this output already but like to transform the ENSEMBLID to a
>>> gene symbol id
>>>
>>> for instance: ENSMUSG00000025907 to "Rb1cc1" symbol. It also would be
>>> useful to add a field linking to a entrez gene web url.
>>>
>>> I have been looking at the org.Mm.eg.db package and although I can
>>> retrieve the symbol for a particular ENSEMBLID can't get it for all the
>>> elements in the object.
>>>
>>
>> What have you tried so far? Unless you give an example of what you have
>> done and how it didn't perform as you expect, it is very difficult for
>> anybody to help.
>>
>> As a shot in the dark, have you looked at the help page for mget()?
>>
>> I don't really understand how the field linking to Entrez Gene would
>> work, considering a RangedData object isn't an HTML page. However,
>> building a URL to Entrez Gene isn't that difficult. You can hijack some
>> internal code from the annotate package:
>>
>>   >  suppressMessages(library(annotate))
>>   >  egids<- 1:5
>>   >  annotate:::.repositories[["en"]](egids)
>> [1]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=1"
>> [2]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=2"
>> [3]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=3"
>> [4]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=4"
>> [5]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=5"
>>
>> But doing it by hand wouldn't be that much more difficult. If you strip
>> out the error checking from the above function, all it really consists of is
>>
>> thefunction<- function(ids){
>> paste("http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=",
>>
>>           ids, sep = "")
>> }
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>> Many thanks
>>> Marc
>>>
>>>
>>
>>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues