[BioC] ChIPpeakAnno results
James W. MacDonald
jmacdon at med.umich.edu
Tue Aug 10 22:53:30 CEST 2010
Hi Marc,
On 8/10/2010 10:26 AM, Marc Noguera wrote:
> My explanation wasn't quite clear, excuse me.
>
> I have a RangedData object, as obtained from ChipPeakAnno, with lots of
> peaks which are annotated like this:
>
>> table<- read.table("H3K4me3vsIgG_peaks.bed")
>> bed<- BED2RangedData(bed,header=T)
>> annotatedPeaks<-annotatePeakInBatch(bed,AnnotationData=TSS.mouse.NCBIM37)
>> annotatedPeaks[1:2,]
> RangedData with 2 rows and 9 value columns across 21 spaces
> space ranges |
> <character> <IRanges> |
> MACS_peak_10 ENSMUSG00000025907 1 [ 6204139, 6204741] |
> MACS_peak_102 ENSMUSG00000061518 1 [36748352, 36749344] |
>
> peak strand feature
>
> <character> <character> <character>
> MACS_peak_10 ENSMUSG00000025907 MACS_peak_10 1
> ENSMUSG00000025907
> MACS_peak_102 ENSMUSG00000061518 MACS_peak_102 1
> ENSMUSG00000061518
>
> start_position end_position insideFeature
>
> <numeric> <numeric> <character>
> MACS_peak_10 ENSMUSG00000025907 6204743 6265656 upstream
> MACS_peak_102 ENSMUSG00000061518 36748425 36750230 overlapStart
>
> distancetoFeature shortestDistance
>
> <numeric> <numeric>
> MACS_peak_10 ENSMUSG00000025907 -604 2
> MACS_peak_102 ENSMUSG00000061518 -73 73
>
> fromOverlappingOrNearest
>
> <character>
> MACS_peak_10 ENSMUSG00000025907 NearestStart
> MACS_peak_102 ENSMUSG00000061518 NearestStart
>
> If I do like this:
>> org.Mm.egENSEMBL2EG$`ENSMUSG00000025907`
> [1] "12421"
>> org.Mm.egSYMBOL$`12421`
> [1] "Rb1cc1"
>
> Which is the symbol I am itnerested in. I would like to add the symbol
> corresponding to each row in the RangedData Object as a new column.
>
> I have tried this, with mget:
>
>>
> annotatedPeaks$entrez<-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
>
> but:
>
>> annotatedPeaks[1:2,]
> RangedData with 2 rows and 10 value columns across 21 spaces
> space ranges |
> <character> <IRanges> |
> MACS_peak_10 ENSMUSG00000025907 1 [ 6204139, 6204741] |
> MACS_peak_102 ENSMUSG00000061518 1 [36748352, 36749344] |
> peak strand
> feature
> <character> <character>
> <character>
> MACS_peak_10 ENSMUSG00000025907 MACS_peak_10 1
> ENSMUSG00000025907
> MACS_peak_102 ENSMUSG00000061518 MACS_peak_102 1
> ENSMUSG00000061518
> start_position end_position insideFeature
> <numeric> <numeric> <character>
> MACS_peak_10 ENSMUSG00000025907 6204743 6265656 upstream
> MACS_peak_102 ENSMUSG00000061518 36748425 36750230 overlapStart
> distancetoFeature shortestDistance
> <numeric> <numeric>
> MACS_peak_10 ENSMUSG00000025907 -604 2
> MACS_peak_102 ENSMUSG00000061518 -73 73
> fromOverlappingOrNearest entrez
> <character> <list>
> MACS_peak_10 ENSMUSG00000025907 NearestStart ########
> MACS_peak_102 ENSMUSG00000061518 NearestStart ########
>
> as mget returns a list. Note that some ensemble IDs map to more than one
> gene ID.
Well, that's true and unavoidable. You will simply have to choose which
symbol you want to use. Or else, you can concatenate with commas if you
want them all.
symbs <-mget(annotatedPeaks$feature,org.Mm.egENSEMBL2EG,ifnotfound=NA)
symbs <- sapply(symbs, function(x) paste(x, collapse = ", "))
annotatedPeaks$entrez <- symbs
And as you already noted, it is a two step process to go from Ensembl
gene IDs to gene symbols, so what has been done above isn't quite
sufficient. But I assume you get the idea.
You can always do things in one shot using SQL queries. An example you
can work from can be found in in section 2.0.9 of the AnnotationDbi
vignette. That can be fun if you like SQL stuff.
Alternatively, you can use biomaRt. Note that you will always want to
return the Ensembl gene ID when you use biomaRt, so you can line things
up after the fact. As an example:
> suppressMessages(library(biomaRt))
> mart <- useMart("ensembl","mmusculus_gene_ensembl")
## fake up some IDs
> ids <- paste("ENSMUSG", sprintf("%011.0f",
c(1,3,28,37,49,56,58,78,85,88,93,94,103)), sep="")
> ids
[1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
[4] "ENSMUSG00000000037" "ENSMUSG00000000049" "ENSMUSG00000000056"
[7] "ENSMUSG00000000058" "ENSMUSG00000000078" "ENSMUSG00000000085"
[10] "ENSMUSG00000000088" "ENSMUSG00000000093" "ENSMUSG00000000094"
[13] "ENSMUSG00000000103"
> out <- getBM(c("ensembl_gene_id", "mgi_symbol"), "ensembl_gene_id",
ids, mart)
> out
ensembl_gene_id mgi_symbol
1 ENSMUSG00000000001 Gnai3
2 ENSMUSG00000000003 Pbsn
3 ENSMUSG00000000028 Cdc45
4 ENSMUSG00000000037 Scml2
5 ENSMUSG00000000049 Apoh
6 ENSMUSG00000000056 Narf
7 ENSMUSG00000000058 Cav2
8 ENSMUSG00000000078 Klf6
9 ENSMUSG00000000085 Scmh1
10 ENSMUSG00000000088 Cox5a
11 ENSMUSG00000000093 Tbx2
12 ENSMUSG00000000094 Tbx4
13 ENSMUSG00000000103 Zfy2
Here if you have any duplicated symbols, you can use the trick from
above. First convert to a list:
outlist <- tapply(1:nrow(out), out[,1], function(x) out[x,2])
then
outlist <- sapply(outlist, function(x) paste(x, collapse = ", "))
and append to your RangedData object.
Also note that Ensembl gene IDs that don't map to a symbol will be
dropped silently, so you might need to do some manipulations of the
returned data.frame to insert the gene IDs (and a '' for the symbol) in
order to have things line up correctly when you put the symbols into
your RangedData object.
Best,
Jim
> Also, using the convert2EntrezID from the same ChIPpeakAnno package:
>
>>
> annotatedPeaks$EntrezID<-convert2EntrezID(IDs=annotatedPeaks$feature,orgAnn="org.Mm.eg.db",ID_type="ensembl_gene_id")
>
> Error in `[[<-`(`*tmp*`, name, value = c("12421", "12859", "67387",
> "623661", :
> 9633 elements in value to replace 13721 elements
>>
>
>
> Which returns a matrix (dim 9633,1) as some ensemblID map to same gene ID.
>
> As far as I understand I need to get use the geneID to map ensemblID to
> SYMBOL. So i cannot get the Symbols.
>
> So, I am stuck here.
>
>
> Thanks again,
> Marc
>
>> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] org.Mm.eg.db_2.4.1 ChIPpeakAnno_1.4.1
> [3] limma_3.4.0 org.Hs.eg.db_2.4.1
> [5] GO.db_2.4.1 RSQLite_0.9-0
> [7] DBI_0.2-5 AnnotationDbi_1.10.1
> [9] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.1
> [11] GenomicRanges_1.0.1 Biostrings_2.16.0
> [13] IRanges_1.6.1 multtest_2.4.0
> [15] Biobase_2.8.0 biomaRt_2.4.0
>
> loaded via a namespace (and not attached):
> [1] MASS_7.3-5 RCurl_1.4-2 splines_2.11.0 survival_2.35-8
> [5] tools_2.11.0 XML_3.1-0
>>
>
>
>
> James W. MacDonald wrote:
>> Hi Marc,
>>
>> On 8/10/2010 6:26 AM, Marc Noguera wrote:
>>
>>> Hi all,
>>> this may be a very naive question by I have been trying to solve it
>>> myself and i can't get through it.
>>>
>>> I have this RangedData object obtained from ChIPpeakAnno Package, which
>>> correspond toa Chipseq experiment with annotated peaks, with ENSEMBL
>>> identificators.
>>> I can use this output already but like to transform the ENSEMBLID to a
>>> gene symbol id
>>>
>>> for instance: ENSMUSG00000025907 to "Rb1cc1" symbol. It also would be
>>> useful to add a field linking to a entrez gene web url.
>>>
>>> I have been looking at the org.Mm.eg.db package and although I can
>>> retrieve the symbol for a particular ENSEMBLID can't get it for all the
>>> elements in the object.
>>>
>>
>> What have you tried so far? Unless you give an example of what you have
>> done and how it didn't perform as you expect, it is very difficult for
>> anybody to help.
>>
>> As a shot in the dark, have you looked at the help page for mget()?
>>
>> I don't really understand how the field linking to Entrez Gene would
>> work, considering a RangedData object isn't an HTML page. However,
>> building a URL to Entrez Gene isn't that difficult. You can hijack some
>> internal code from the annotate package:
>>
>> > suppressMessages(library(annotate))
>> > egids<- 1:5
>> > annotate:::.repositories[["en"]](egids)
>> [1]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=1"
>> [2]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=2"
>> [3]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=3"
>> [4]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=4"
>> [5]
>> "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=5"
>>
>> But doing it by hand wouldn't be that much more difficult. If you strip
>> out the error checking from the above function, all it really consists of is
>>
>> thefunction<- function(ids){
>> paste("http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=",
>>
>> ids, sep = "")
>> }
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>> Many thanks
>>> Marc
>>>
>>>
>>
>>
>
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list