[BioC] How to match Locus IDs with Gene Ontology IDs?
sdurinck@ebi.ac.uk
sdurinck at ebi.ac.uk
Wed Nov 23 09:32:25 CET 2005
Hi,
There are two ways to do this in biomaRt. The result will be the same but
the second way is a very general one and will allow you to extract
anything you want that is available from the BioMart databases.
1) Using the simple getGO function:
> mart=martConnect()
connected to: ensembl_mart_35
#Now retrieve the GO annotation for your id, note that you can also give a
vector of id's
> getGO(id="NM_001533",type="refseq",species="hsapiens",mart=mart)
id GOID description
evidence
1 NM_001533 GO:0000166 nucleotide binding
IEA
2 NM_001533 GO:0003723 RNA binding
TAS
3 NM_001533 GO:0006397 mRNA processing
IEA
4 NM_001533 GO:0005654 nucleoplasm
TAS
5 NM_001533 GO:0030530 heterogeneous nuclear ribonucleoprotein complex
TAS
6 NM_001533 GO:0005634 nucleus
IEA
martID
1 ENSG00000104824
2 ENSG00000104824
3 ENSG00000104824
4 ENSG00000104824
5 ENSG00000104824
6 ENSG00000104824
>martDisconnect(mart)
2) Using the more powerful getBM function:
#List the available BioMart databases
> listMarts()
[1] "ensembl_mart_35" "vega_mart_35" "snp_mart_35" "msd_mart_4"
"uniprot_mart_17"
#Select a BioMart to use with the function useMart() and verify database
connection.
> mart=useMart("ensembl_mart_35")
connected to: ensembl_mart_35
#List the available datasets in the selected BioMart database.
> listDatasets(mart)
dataset version
1 ptroglodytes_gene_ensembl CHIMP1
2 ggallus_gene_ensembl WASHUC1
3 rnorvegicus_gene_ensembl RGSC3.4
4 scerevisiae_gene_ensembl SGD1
5 tnigroviridis_gene_ensembl TETRAODON7
6 xtropicalis_gene_ensembl JGI3
7 frubripes_gene_ensembl FUGU2
8 cintestinalis_gene_ensembl CINT1.95
9 agambiae_gene_ensembl MOZ2a
10 amellifera_gene_ensembl AMEL2.0
11 btaurus_gene_ensembl BDGP4
12 celegans_gene_ensembl CEL140
13 mmusculus_gene_ensembl NCBIM34
14 cfamiliaris_gene_ensembl BROADD1
15 dmelanogaster_gene_ensembl BDGP4
16 drerio_gene_ensembl ZFISH5
17 hsapiens_gene_ensembl NCBI35
18 mdomestica_gene_ensembl JGI3
#Update the mart object by selecting a dataset
> mart = useDataset(dataset = "hsapiens_gene_ensembl",mart = mart)
Reading database configuration of: hsapiens_gene_ensembl
Checking attributes ... ok
Checking filters ... ok
Checking main tables ... ok
#List the available attributes for this dataset. This shows you all the
attributes that can be retrieved.
> listAttributes(mart)
[1] "chr_name"
[2] "chrom_start"
[3] "chrom_end"
[4] "chrom_strand"
.
#List the available filters for this dataset. This shows all possible
filters that can be used on this dataset.
> listFilters(mart)
[1] "chr_name"
[2] "gene_chrom_start"
[3] "gene_chrom_end"
[4] "in_encode"
.
#Now you can do a query e.g. Get the GO id, evidence code, and description
for refseq id as filter and with values NM_001533. Note that you can also
give a vector of identifiers as values.
> getBM(attributes=c("go_id", "go_description","evidence_code"),
filter="refseq_dna",values="NM_001533",mart=mart)
refseq_dna go_id go_description
1 NM_001533 GO:0000166 nucleotide binding
2 NM_001533 GO:0003723 RNA binding
3 NM_001533 GO:0006397 mRNA processing
4 NM_001533 GO:0005654 nucleoplasm
5 NM_001533 GO:0030530 heterogeneous nuclear ribonucleoprotein complex
6 NM_001533 GO:0005634 nucleus
evidence_code
1 IEA
2 TAS
3 IEA
4 TAS
5 TAS
6 IEA
best,
Steffen
> Hi
>
> Earl F. Glynn wrote:
>> I looked at several Bioconductor packages that deal with Gene Ontology
>> (GO,
>> goTools, ontoTools), and I don't seem to find functionality that does
>> the
>> following:
>>
>>
>>
>> Given Locus ID NM_001533 I can go to NCBI
>
> I think that is a RefSeq ID and I am also pretty sure that LocusLink
> has been retired in favor of Entrez Gene (although we are a bit slow in
> moving).
>
>>
>> http://www.ncbi.nlm.nih.gov/
>>
>> and search "Nucleotide" for "NM_001533"
>>
>>
>>
>> I can click on the NM_0015333 hit returned, and about 2/3rds of the way
>> down
>> the page under the CDS section, the go_component, go_function, and
>> go_process subsections give Gene Ontology info for NM_0015333.
>>
>
> biomaRt might be your best choice
>
>>
>>
>> Likewise, if I do the same thing with Locus ID BC001721, I see a hit and
>> a
>> CDS section, but no gene ontology information. That's OK, I'm not
>> expecting
>> everything to have GO information. (E.g, of the 45,101 probesets on the
>> Mouse430_2 Affy chip, only about 4693 have GO Biological process
>> information, 2573 have celleular info, and 4875 have molecular function
>> info. I'm not working with Affy data, but I know many IDs won't have GO
>> info, but some will.)
>
> Again I do not believe that BC001721 is an Entrez Gene ID, and it does
> matter a bit.
>
> You can of course always use AnnBuilder to build your own annotation
> for a microarray (if that is what you are working off).
>
> Robert
>
>>
>>
>>
>> If I have a long list of Locus IDs, e.g., NM_001533, BC001721, ., are
>> there
>> any Bioconductor packages that "connect" these identifiers to gene
>> ontology
>> identifiers, or perhaps some other identifier (say LocusLink, aka
>> Enterez
>> Gene) that is mapped to the Gene Ontology information?
>>
>>
>>
>> Thanks for any suggestions on how this might be automated using
>> Bioconductor
>> and R.
>>
>>
>>
>> Earl F. Glynn
>>
>> Scientific Programmer
>>
>> Bioinformatics Department
>>
>> Stowers Institute for Medical Research
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
>
> --
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> 206-667-7700
> rgentlem at fhcrc.org
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>
More information about the Bioconductor
mailing list