[BioC] conversion of geneset species ID

Mon Sep 5 23:31:37 CEST 2011

Hi Iain --

On 09/05/2011 07:57 AM, Iain Gallagher wrote:
> Dear List
>
> I wonder if someone could help me re-annotate the Broad c2 genesets from human to bovine IDs. Here's what I have so far:
>
> rm(list=ls())
> library(biomaRt)
> library(GSEABase)
>
> setwd('/home/iain/Documents/Work/Results/bovineMacRNAData/deAnalysis/GSEAData/')
>
> cowGenes<- read.table('cowGenesENID.csv', header=F, sep='\t')
>
> cow = useMart("ensembl",dataset="btaurus_gene_ensembl")
> orth = getBM(c("ensembl_gene_id","human_ensembl_gene"), filters="ensembl_gene_id",values = cowGenes[,1], mart = cow)
> orth2<- orth[which(orth[,2]!=''), ]#drop those with no human ortho
>
> orth3<- orth2[-which(duplicated(orth2[,1]) == TRUE),]#get only unique mappings i.e. one cow ID to one human ID
>
> head(orth3)
>
>
> This gets me a data frame of bovine ENSEMBL gene Ids and the human ortholog (again ENSEMBL id).
>
> broadSets<- getGmt('/home/iain/Documents/Work/Results/bovineMacRNAData/deAnalysis/GSEAData/c2.all.v3.0.entrez.gmt', geneIdType = EntrezIdentifier('org.Hs.eg.db'))
>
> broadSetsENS<- mapIdentifiers(broadSets, ENSEMBLIdentifier())
>
> I now have the c2 Broad geneset with gene IDs converted to human ENSEMBL ids. I would like to map the postion of each of the ENSEMBL Ids in my dataframe (orth3) and then substitute in the bovine id and the clean up any NA's.
>
> I am at rather a loss as to how to do this and wondered if someone with more familiarity with the GSEABase would be able to help (or perhaps suggest a different strategy!)?

Not sure that I follow entirely, but along the lines of

   lst = lapply(broadSetsENS, function(gs, map) {
      huids = geneIds(gs)
      ## map, not sure what the columns are?
      geneIds(gs) = map[map$huids %in% huids, "cowids"]
      geneIdType(gs), ENSEMBLIdentifier()
      gs
   }, ortho3)
   GeneSetCollection(lst)

This is a bit of a guess, could be more specific if you provided a 
reproducible example.

Hope that helps,

Martin

> Thanks
>
> Iain
>
>> sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
>   [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
>   [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
>   [7] LC_PAPER=en_GB.utf8       LC_NAME=C
>   [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] GSEABase_1.14.0      graph_1.30.0         annotate_1.30.0
>   [4] org.Hs.eg.db_2.5.0   org.Bt.eg.db_2.5.0   RSQLite_0.9-4
>   [7] DBI_0.2-5            AnnotationDbi_1.14.1 Biobase_2.12.2
> [10] biomaRt_2.8.1
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.6-9  tools_2.13.1 XML_3.4-2    xtable_1.5-6
>>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793