[BioC] ChIPpeakAnno results
Marc Noguera
mnoguera at imppc.org
Wed Aug 11 16:04:47 CEST 2010
Thanks for the detailed explanation.
I have tried as indicated, as some other things. Just for the record:
> entrez<-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
> entrez[1:3]
$ENSMUSG00000025907
[1] "12421"
$ENSMUSG00000061518
[1] "12859" "100046079"
$ENSMUSG00000026111
[1] "67387"
Note that there are some entries in the list with NA, which are not
shown. I still don't want to collapse the inner lists as I need to
transform to symbol. Like this:
>symbols <-
lapply(entrez[!is.na(entrez)],mget,org.Mm.egSYMBOL,ifnotfound=NA)
Which gives me a shorter list, with as many entries for each inner list
as Entrez id existed. For the same entries:
> symbols[1:3]
$ENSMUSG00000025907
$ENSMUSG00000025907$`12421`
[1] "Rb1cc1"
$ENSMUSG00000061518
$ENSMUSG00000061518$`12859`
[1] "Cox5b"
$ENSMUSG00000061518$`100046079`
[1] "LOC100046079"
$ENSMUSG00000026111
$ENSMUSG00000026111$`67387`
[1] "Unc50"
Which I can now collapse:
>symbols <- sapply(symbols, function(x) paste(x,collapse=","))
>symbols[1:3]
ENSMUSG00000025907 ENSMUSG00000061518 ENSMUSG00000026111
"Rb1cc1" "Cox5b,LOC100046079" "Unc50"
Now I want to add symbols to AnnotatedPeaks joined by Ensembl Id
(annotatedPeaks$feature), which I do with:
>df<-as.data.frame(annotatedPeaks)
>df$symbols<-symbols[as.character(df$feature)]
> subset(df,T,c(feature,symbols))[1:3,]
feature symbols
1 ENSMUSG00000025907 Rb1cc1
2 ENSMUSG00000061518 Cox5b,LOC100046079
3 ENSMUSG00000026111 Unc50
Then I write it to a table.
I know that this is probably a very non-optimal way to do it, so any
advice about improving it will be appreciated.
Thanks for all
Marc
James W. MacDonald wrote:
> Hi Marc,
>
> On 8/10/2010 10:26 AM, Marc Noguera wrote:
>
>> My explanation wasn't quite clear, excuse me.
>>
>> I have a RangedData object, as obtained from ChipPeakAnno, with lots of
>> peaks which are annotated like this:
>>
>>
>>> table<- read.table("H3K4me3vsIgG_peaks.bed")
>>> bed<- BED2RangedData(bed,header=T)
>>> annotatedPeaks<-annotatePeakInBatch(bed,AnnotationData=TSS.mouse.NCBIM37)
>>> annotatedPeaks[1:2,]
>>>
>> RangedData with 2 rows and 9 value columns across 21 spaces
>> space ranges |
>> <character> <IRanges> |
>> MACS_peak_10 ENSMUSG00000025907 1 [ 6204139, 6204741] |
>> MACS_peak_102 ENSMUSG00000061518 1 [36748352, 36749344] |
>>
>> peak strand feature
>>
>> <character> <character> <character>
>> MACS_peak_10 ENSMUSG00000025907 MACS_peak_10 1
>> ENSMUSG00000025907
>> MACS_peak_102 ENSMUSG00000061518 MACS_peak_102 1
>> ENSMUSG00000061518
>>
>> start_position end_position insideFeature
>>
>> <numeric> <numeric> <character>
>> MACS_peak_10 ENSMUSG00000025907 6204743 6265656 upstream
>> MACS_peak_102 ENSMUSG00000061518 36748425 36750230 overlapStart
>>
>> distancetoFeature shortestDistance
>>
>> <numeric> <numeric>
>> MACS_peak_10 ENSMUSG00000025907 -604 2
>> MACS_peak_102 ENSMUSG00000061518 -73 73
>>
>> fromOverlappingOrNearest
>>
>> <character>
>> MACS_peak_10 ENSMUSG00000025907 NearestStart
>> MACS_peak_102 ENSMUSG00000061518 NearestStart
>>
>> If I do like this:
>>
>>> org.Mm.egENSEMBL2EG$`ENSMUSG00000025907`
>>>
>> [1] "12421"
>>
>>> org.Mm.egSYMBOL$`12421`
>>>
>> [1] "Rb1cc1"
>>
>> Which is the symbol I am itnerested in. I would like to add the symbol
>> corresponding to each row in the RangedData Object as a new column.
>>
>> I have tried this, with mget:
>>
>>
>> annotatedPeaks$entrez<-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
>>
>> but:
>>
>>
>>> annotatedPeaks[1:2,]
>>>
>> RangedData with 2 rows and 10 value columns across 21 spaces
>> space ranges |
>> <character> <IRanges> |
>> MACS_peak_10 ENSMUSG00000025907 1 [ 6204139, 6204741] |
>> MACS_peak_102 ENSMUSG00000061518 1 [36748352, 36749344] |
>> peak strand
>> feature
>> <character> <character>
>> <character>
>> MACS_peak_10 ENSMUSG00000025907 MACS_peak_10 1
>> ENSMUSG00000025907
>> MACS_peak_102 ENSMUSG00000061518 MACS_peak_102 1
>> ENSMUSG00000061518
>> start_position end_position insideFeature
>> <numeric> <numeric> <character>
>> MACS_peak_10 ENSMUSG00000025907 6204743 6265656 upstream
>> MACS_peak_102 ENSMUSG00000061518 36748425 36750230 overlapStart
>> distancetoFeature shortestDistance
>> <numeric> <numeric>
>> MACS_peak_10 ENSMUSG00000025907 -604 2
>> MACS_peak_102 ENSMUSG00000061518 -73 73
>> fromOverlappingOrNearest entrez
>> <character> <list>
>> MACS_peak_10 ENSMUSG00000025907 NearestStart ########
>> MACS_peak_102 ENSMUSG00000061518 NearestStart ########
>>
>> as mget returns a list. Note that some ensemble IDs map to more than one
>> gene ID.
>>
>
> Well, that's true and unavoidable. You will simply have to choose which
> symbol you want to use. Or else, you can concatenate with commas if you
> want them all.
>
> symbs <-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
> symbs <- sapply(symbs, function(x) paste(x, collapse = ", "))
> annotatedPeaks$entrez <- symbs
>
> And as you already noted, it is a two step process to go from Ensembl
> gene IDs to gene symbols, so what has been done above isn't quite
> sufficient. But I assume you get the idea.
>
> You can always do things in one shot using SQL queries. An example you
> can work from can be found in in section 2.0.9 of the AnnotationDbi
> vignette. That can be fun if you like SQL stuff.
>
> Alternatively, you can use biomaRt. Note that you will always want to
> return the Ensembl gene ID when you use biomaRt, so you can line things
> up after the fact. As an example:
>
> > suppressMessages(library(biomaRt))
> > mart <- useMart("ensembl","mmusculus_gene_ensembl")
> ## fake up some IDs
> > ids <- paste("ENSMUSG", sprintf("%011.0f",
> c(1,3,28,37,49,56,58,78,85,88,93,94,103)), sep="")
> > ids
> [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
> [4] "ENSMUSG00000000037" "ENSMUSG00000000049" "ENSMUSG00000000056"
> [7] "ENSMUSG00000000058" "ENSMUSG00000000078" "ENSMUSG00000000085"
> [10] "ENSMUSG00000000088" "ENSMUSG00000000093" "ENSMUSG00000000094"
> [13] "ENSMUSG00000000103"
> > out <- getBM(c("ensembl_gene_id", "mgi_symbol"), "ensembl_gene_id",
> ids, mart)
> > out
> ensembl_gene_id mgi_symbol
> 1 ENSMUSG00000000001 Gnai3
> 2 ENSMUSG00000000003 Pbsn
> 3 ENSMUSG00000000028 Cdc45
> 4 ENSMUSG00000000037 Scml2
> 5 ENSMUSG00000000049 Apoh
> 6 ENSMUSG00000000056 Narf
> 7 ENSMUSG00000000058 Cav2
> 8 ENSMUSG00000000078 Klf6
> 9 ENSMUSG00000000085 Scmh1
> 10 ENSMUSG00000000088 Cox5a
> 11 ENSMUSG00000000093 Tbx2
> 12 ENSMUSG00000000094 Tbx4
> 13 ENSMUSG00000000103 Zfy2
>
> Here if you have any duplicated symbols, you can use the trick from
> above. First convert to a list:
>
> outlist <- tapply(1:nrow(out), out[,1], function(x) out[x,2])
>
> then
>
> outlist <- sapply(outlist, function(x) paste(x, collapse = ", "))
>
> and append to your RangedData object.
>
> Also note that Ensembl gene IDs that don't map to a symbol will be
> dropped silently, so you might need to do some manipulations of the
> returned data.frame to insert the gene IDs (and a '' for the symbol) in
> order to have things line up correctly when you put the symbols into
> your RangedData object.
>
> Best,
>
> Jim
>
>
>
>
>
>> Also, using the convert2EntrezID from the same ChIPpeakAnno package:
>>
>>
>> annotatedPeaks$EntrezID<-convert2EntrezID(IDs=annotatedPeaks$feature,orgAnn="org.Mm.eg.db",ID_type="ensembl_gene_id")
>>
>> Error in `[[<-`(`*tmp*`, name, value = c("12421", "12859", "67387",
>> "623661", :
>> 9633 elements in value to replace 13721 elements
>>
>> Which returns a matrix (dim 9633,1) as some ensemblID map to same gene ID.
>>
>> As far as I understand I need to get use the geneID to map ensemblID to
>> SYMBOL. So i cannot get the Symbols.
>>
>> So, I am stuck here.
>>
>>
>> Thanks again,
>> Marc
>>
>>
>>> sessionInfo()
>>>
>> R version 2.11.0 (2010-04-22)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] org.Mm.eg.db_2.4.1 ChIPpeakAnno_1.4.1
>> [3] limma_3.4.0 org.Hs.eg.db_2.4.1
>> [5] GO.db_2.4.1 RSQLite_0.9-0
>> [7] DBI_0.2-5 AnnotationDbi_1.10.1
>> [9] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.1
>> [11] GenomicRanges_1.0.1 Biostrings_2.16.0
>> [13] IRanges_1.6.1 multtest_2.4.0
>> [15] Biobase_2.8.0 biomaRt_2.4.0
>>
>> loaded via a namespace (and not attached):
>> [1] MASS_7.3-5 RCurl_1.4-2 splines_2.11.0 survival_2.35-8
>> [5] tools_2.11.0 XML_3.1-0
>>
>>
>> James W. MacDonald wrote:
>>
>>> Hi Marc,
>>>
>>> On 8/10/2010 6:26 AM, Marc Noguera wrote:
>>>
>>>
>>>> Hi all,
>>>> this may be a very naive question by I have been trying to solve it
>>>> myself and i can't get through it.
>>>>
>>>> I have this RangedData object obtained from ChIPpeakAnno Package, which
>>>> correspond toa Chipseq experiment with annotated peaks, with ENSEMBL
>>>> identificators.
>>>> I can use this output already but like to transform the ENSEMBLID to a
>>>> gene symbol id
>>>>
>>>> for instance: ENSMUSG00000025907 to "Rb1cc1" symbol. It also would be
>>>> useful to add a field linking to a entrez gene web url.
>>>>
>>>> I have been looking at the org.Mm.eg.db package and although I can
>>>> retrieve the symbol for a particular ENSEMBLID can't get it for all the
>>>> elements in the object.
>>>>
>>>>
>>> What have you tried so far? Unless you give an example of what you have
>>> done and how it didn't perform as you expect, it is very difficult for
>>> anybody to help.
>>>
>>> As a shot in the dark, have you looked at the help page for mget()?
>>>
>>> I don't really understand how the field linking to Entrez Gene would
>>> work, considering a RangedData object isn't an HTML page. However,
>>> building a URL to Entrez Gene isn't that difficult. You can hijack some
>>> internal code from the annotate package:
>>>
>>> > suppressMessages(library(annotate))
>>> > egids<- 1:5
>>> > annotate:::.repositories[["en"]](egids)
>>> [1]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=1"
>>> [2]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=2"
>>> [3]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=3"
>>> [4]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=4"
>>> [5]
>>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=5"
>>>
>>> But doing it by hand wouldn't be that much more difficult. If you strip
>>> out the error checking from the above function, all it really consists of is
>>>
>>> thefunction<- function(ids){
>>> paste("http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=",
>>>
>>> ids, sep = "")
>>> }
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>
>>>> Many thanks
>>>> Marc
>>>>
>>>>
>>>>
>>>
>>
>
>
--
-----------------------------------------------------
Marc Noguera i Julian, PhD
Genomics unit / Bioinformatics
Institut de Medicina Predictiva i Personalitzada
del Càncer (IMPPC)
B-10 Office
Carretera de Can Ruti
Camí de les Escoles s/n
08916 Badalona, Barcelona
More information about the Bioconductor
mailing list