[BioC] ChIPpeakAnno results

Wed Aug 11 16:04:47 CEST 2010

Thanks for the detailed explanation.
I have tried as indicated, as some other things. Just for the record:

> entrez<-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
> entrez[1:3]
$ENSMUSG00000025907
[1] "12421"

$ENSMUSG00000061518
[1] "12859"     "100046079"

$ENSMUSG00000026111
[1] "67387"

Note that there are some entries in the list with NA, which are not
shown. I still don't want to collapse the inner lists as I need to
transform to symbol. Like this:

>symbols <-
lapply(entrez[!is.na(entrez)],mget,org.Mm.egSYMBOL,ifnotfound=NA)

Which gives me a shorter list, with as many entries for each inner list
as Entrez id existed. For the same entries:

> symbols[1:3]
$ENSMUSG00000025907
$ENSMUSG00000025907$`12421`
[1] "Rb1cc1"

$ENSMUSG00000061518
$ENSMUSG00000061518$`12859`
[1] "Cox5b"

$ENSMUSG00000061518$`100046079`
[1] "LOC100046079"

$ENSMUSG00000026111
$ENSMUSG00000026111$`67387`
[1] "Unc50"

Which I can now collapse:

>symbols <- sapply(symbols, function(x) paste(x,collapse=","))
>symbols[1:3]
 ENSMUSG00000025907   ENSMUSG00000061518   ENSMUSG00000026111
            "Rb1cc1" "Cox5b,LOC100046079"              "Unc50"

Now I want to add symbols to AnnotatedPeaks joined by Ensembl Id
(annotatedPeaks$feature), which I do with:

>df<-as.data.frame(annotatedPeaks)
>df$symbols<-symbols[as.character(df$feature)]
> subset(df,T,c(feature,symbols))[1:3,]
             feature            symbols
1 ENSMUSG00000025907             Rb1cc1
2 ENSMUSG00000061518 Cox5b,LOC100046079
3 ENSMUSG00000026111              Unc50

Then I write it to a table.
I know that this is probably a very non-optimal way to do it, so any
advice about improving it will be appreciated.

Thanks for all

Marc
James W. MacDonald wrote:
> Hi Marc,
>
> On 8/10/2010 10:26 AM, Marc Noguera wrote:
>   
>> My explanation wasn't quite clear, excuse me.
>>
>> I have a RangedData object, as obtained from ChipPeakAnno, with lots of
>> peaks which are annotated like this:
>>
>>     
>>> table<- read.table("H3K4me3vsIgG_peaks.bed")
>>> bed<- BED2RangedData(bed,header=T)
>>> annotatedPeaks<-annotatePeakInBatch(bed,AnnotationData=TSS.mouse.NCBIM37)
>>> annotatedPeaks[1:2,]
>>>       
>> RangedData with 2 rows and 9 value columns across 21 spaces
>>                                         space               ranges |
>>                                   <character>             <IRanges>  |
>> MACS_peak_10 ENSMUSG00000025907            1 [ 6204139,  6204741] |
>> MACS_peak_102 ENSMUSG00000061518           1 [36748352, 36749344] |
>>
>>       peak                  strand            feature
>>
>> <character>                  <character>         <character>
>> MACS_peak_10 ENSMUSG00000025907   MACS_peak_10           1
>> ENSMUSG00000025907
>> MACS_peak_102 ENSMUSG00000061518 MACS_peak_102           1
>> ENSMUSG00000061518
>>
>> start_position end_position insideFeature
>>
>>       <numeric>     <numeric>    <character>
>> MACS_peak_10 ENSMUSG00000025907         6204743      6265656      upstream
>> MACS_peak_102 ENSMUSG00000061518       36748425     36750230  overlapStart
>>
>> distancetoFeature shortestDistance
>>
>>              <numeric>         <numeric>
>> MACS_peak_10 ENSMUSG00000025907               -604                2
>> MACS_peak_102 ENSMUSG00000061518               -73               73
>>
>> fromOverlappingOrNearest
>>
>>                 <character>
>> MACS_peak_10 ENSMUSG00000025907              NearestStart
>> MACS_peak_102 ENSMUSG00000061518             NearestStart
>>
>> If I do like this:
>>     
>>> org.Mm.egENSEMBL2EG$`ENSMUSG00000025907`
>>>       
>> [1] "12421"
>>     
>>> org.Mm.egSYMBOL$`12421`
>>>       
>> [1] "Rb1cc1"
>>
>> Which is the symbol I am itnerested in. I would like to add the symbol
>> corresponding to each row in the RangedData Object as a new column.
>>
>> I have tried this, with mget:
>>
>>     
>> annotatedPeaks$entrez<-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
>>
>> but:
>>
>>     
>>> annotatedPeaks[1:2,]
>>>       
>> RangedData with 2 rows and 10 value columns across 21 spaces
>>                                         space               ranges |
>>                                   <character>             <IRanges>  |
>> MACS_peak_10 ENSMUSG00000025907            1 [ 6204139,  6204741] |
>> MACS_peak_102 ENSMUSG00000061518           1 [36748352, 36749344] |
>>                                            peak      strand
>> feature
>>                                     <character>  <character>
>> <character>
>> MACS_peak_10 ENSMUSG00000025907   MACS_peak_10           1
>> ENSMUSG00000025907
>> MACS_peak_102 ENSMUSG00000061518 MACS_peak_102           1
>> ENSMUSG00000061518
>>                                   start_position end_position insideFeature
>>                                        <numeric>     <numeric>    <character>
>> MACS_peak_10 ENSMUSG00000025907         6204743      6265656      upstream
>> MACS_peak_102 ENSMUSG00000061518       36748425     36750230  overlapStart
>>                                   distancetoFeature shortestDistance
>>                                           <numeric>         <numeric>
>> MACS_peak_10 ENSMUSG00000025907               -604                2
>> MACS_peak_102 ENSMUSG00000061518               -73               73
>>                                   fromOverlappingOrNearest   entrez
>>                                                <character>    <list>
>> MACS_peak_10 ENSMUSG00000025907              NearestStart ########
>> MACS_peak_102 ENSMUSG00000061518             NearestStart ########
>>
>> as mget returns a list. Note that some ensemble IDs map to more than one
>> gene ID.
>>     
>
> Well, that's true and unavoidable. You will simply have to choose which 
> symbol you want to use. Or else, you can concatenate with commas if you 
> want them all.
>
> symbs <-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
> symbs <- sapply(symbs, function(x) paste(x, collapse = ", "))
> annotatedPeaks$entrez <- symbs
>
> And as you already noted, it is a two step process to go from Ensembl 
> gene IDs to gene symbols, so what has been done above isn't quite 
> sufficient. But I assume you get the idea.
>
> You can always do things in one shot using SQL queries. An example you 
> can work from can be found in in section 2.0.9 of the AnnotationDbi 
> vignette. That can be fun if you like SQL stuff.
>
> Alternatively, you can use biomaRt. Note that you will always want to 
> return the Ensembl gene ID when you use biomaRt, so you can line things 
> up after the fact. As an example:
>
>  > suppressMessages(library(biomaRt))
>  > mart <- useMart("ensembl","mmusculus_gene_ensembl")
> ## fake up some IDs
>  > ids <- paste("ENSMUSG", sprintf("%011.0f", 
> c(1,3,28,37,49,56,58,78,85,88,93,94,103)), sep="")
>  > ids
>   [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
>   [4] "ENSMUSG00000000037" "ENSMUSG00000000049" "ENSMUSG00000000056"
>   [7] "ENSMUSG00000000058" "ENSMUSG00000000078" "ENSMUSG00000000085"
> [10] "ENSMUSG00000000088" "ENSMUSG00000000093" "ENSMUSG00000000094"
> [13] "ENSMUSG00000000103"
>  > out <- getBM(c("ensembl_gene_id", "mgi_symbol"), "ensembl_gene_id", 
> ids, mart)
>  > out
>        ensembl_gene_id mgi_symbol
> 1  ENSMUSG00000000001      Gnai3
> 2  ENSMUSG00000000003       Pbsn
> 3  ENSMUSG00000000028      Cdc45
> 4  ENSMUSG00000000037      Scml2
> 5  ENSMUSG00000000049       Apoh
> 6  ENSMUSG00000000056       Narf
> 7  ENSMUSG00000000058       Cav2
> 8  ENSMUSG00000000078       Klf6
> 9  ENSMUSG00000000085      Scmh1
> 10 ENSMUSG00000000088      Cox5a
> 11 ENSMUSG00000000093       Tbx2
> 12 ENSMUSG00000000094       Tbx4
> 13 ENSMUSG00000000103       Zfy2
>
> Here if you have any duplicated symbols, you can use the trick from 
> above. First convert to a list:
>
> outlist <- tapply(1:nrow(out), out[,1], function(x) out[x,2])
>
> then
>
> outlist <- sapply(outlist, function(x) paste(x, collapse = ", "))
>
> and append to your RangedData object.
>
> Also note that Ensembl gene IDs that don't map to a symbol will be 
> dropped silently, so you might need to do some manipulations of the 
> returned data.frame to insert the gene IDs (and a '' for the symbol) in 
> order to have things line up correctly when you put the symbols into 
> your RangedData object.
>
> Best,
>
> Jim
>
>
>
>
>   
>> Also, using the convert2EntrezID from the same ChIPpeakAnno package:
>>
>>     
>> annotatedPeaks$EntrezID<-convert2EntrezID(IDs=annotatedPeaks$feature,orgAnn="org.Mm.eg.db",ID_type="ensembl_gene_id")
>>
>> Error in `[[<-`(`*tmp*`, name, value = c("12421", "12859", "67387",
>> "623661",  :
>>    9633 elements in value to replace 13721 elements
>>     
>> Which returns a matrix (dim 9633,1) as some ensemblID map to same gene ID.
>>
>> As far as I understand I need to get use the geneID to map ensemblID to
>> SYMBOL. So i cannot get the Symbols.
>>
>> So, I am stuck here.
>>
>>
>> Thanks again,
>> Marc
>>
>>     
>>> sessionInfo()
>>>       
>> R version 2.11.0 (2010-04-22)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>   [1] org.Mm.eg.db_2.4.1                  ChIPpeakAnno_1.4.1
>>   [3] limma_3.4.0                         org.Hs.eg.db_2.4.1
>>   [5] GO.db_2.4.1                         RSQLite_0.9-0
>>   [7] DBI_0.2-5                           AnnotationDbi_1.10.1
>>   [9] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.1
>> [11] GenomicRanges_1.0.1                 Biostrings_2.16.0
>> [13] IRanges_1.6.1                       multtest_2.4.0
>> [15] Biobase_2.8.0                       biomaRt_2.4.0
>>
>> loaded via a namespace (and not attached):
>> [1] MASS_7.3-5      RCurl_1.4-2     splines_2.11.0  survival_2.35-8
>> [5] tools_2.11.0    XML_3.1-0
>>     
>>
>> James W. MacDonald wrote:
>>     
>>> Hi Marc,
>>>
>>> On 8/10/2010 6:26 AM, Marc Noguera wrote:
>>>
>>>       
>>>> Hi all,
>>>> this may be a very naive question by I have been trying to solve it
>>>> myself and i can't get through it.
>>>>
>>>> I have this RangedData object obtained from ChIPpeakAnno Package, which
>>>> correspond toa  Chipseq experiment with annotated peaks, with ENSEMBL
>>>> identificators.
>>>> I can use this output already but like to transform the ENSEMBLID to a
>>>> gene symbol id
>>>>
>>>> for instance: ENSMUSG00000025907 to "Rb1cc1" symbol. It also would be
>>>> useful to add a field linking to a entrez gene web url.
>>>>
>>>> I have been looking at the org.Mm.eg.db package and although I can
>>>> retrieve the symbol for a particular ENSEMBLID can't get it for all the
>>>> elements in the object.
>>>>
>>>>         
>>> What have you tried so far? Unless you give an example of what you have
>>> done and how it didn't perform as you expect, it is very difficult for
>>> anybody to help.
>>>
>>> As a shot in the dark, have you looked at the help page for mget()?
>>>
>>> I don't really understand how the field linking to Entrez Gene would
>>> work, considering a RangedData object isn't an HTML page. However,
>>> building a URL to Entrez Gene isn't that difficult. You can hijack some
>>> internal code from the annotate package:
>>>
>>>   >  suppressMessages(library(annotate))
>>>   >  egids<- 1:5
>>>   >  annotate:::.repositories[["en"]](egids)
>>> [1]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=1"
>>> [2]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=2"
>>> [3]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=3"
>>> [4]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=4"
>>> [5]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=5"
>>>
>>> But doing it by hand wouldn't be that much more difficult. If you strip
>>> out the error checking from the above function, all it really consists of is
>>>
>>> thefunction<- function(ids){
>>> paste("http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=",
>>>
>>>           ids, sep = "")
>>> }
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>       
>>>> Many thanks
>>>> Marc
>>>>
>>>>
>>>>         
>>>       
>>     
>
>   

-- 
-----------------------------------------------------
Marc Noguera i Julian, PhD
Genomics unit / Bioinformatics
Institut de Medicina Predictiva i Personalitzada
del Càncer (IMPPC)
B-10 Office
Carretera de Can Ruti
Camí de les Escoles s/n
08916 Badalona, Barcelona