[BioC] How to know in advance what kind of ID one has in using getGO or GetGene with biomaRt?

Earl F. Glynn efg at stowers-institute.org
Mon Jan 30 23:40:40 CET 2006


I have a list of about 20,000 "Accession Reference" IDs and I want to find
corresponding Gene and GO information.

The IDs that start with "NM_" all seem to work fine as type="refseq", but
others, starting with, "S" or "AB" or "AF" can be found only as type="embl".
Those with "XM" seemingly cannot be found.

What information is stored in the prefix of an ID?  What do NM_, S, AB, AF,
or XM mean, and where is information about these prefixes?

Does it make sense to have a function that returns the type of an ID?  Does
it make sense to have biomaRt functions automtically "know" about the
various kinds of IDs?  I don't see how to vectorize any of this when one
must check the type of ID with each call.

Below I try "embl" IDs first because after a first pass I know I can only
connect about 3,000 out of 20,000 identifiers as "refseq".   Overall, trying
both "embl" and then "refseq" matches perhaps 90% of the dataset of 20,000,
but this doesn't seem very "clean", and perhaps about 1,000 XM probes were
never matched:

> # Show problem in knowing type of identifier while fetching GO or Gene
info
> # using biomaRt.  efg, 30 Jan 2006
>
> library(biomaRt)
Loading required package: RMySQL
Loading required package: DBI
Loading required package: XML
Warning message:
DLL attempted to change FPU control word from 8001f to 9001f
> mart <- martConnect()
connected to:  ensembl_mart_36
>
> # First five "Accession Reference" IDs from CAMDA06-related probe dataset:
> #
http://ecom2.mwgdna.com/download/arrays/arrays/gene_id/xls/gene_id_human_40k_a.xls
> # (discard _N or _NN in IDs)
> probe.list <- c("NM_001533", "NM_031990", "S76822", "AF232742",
"AB035863")
>
> GeneInfo.List <- NULL
>
> for (i in 1:length(probe.list))
+ {
+   probe <- probe.list[i]
+
+   # Assume embl ID
+   GOinfo <- getGO(id=probe,type="embl",species="hsapiens",mart=mart)
+   if ( (length(GOinfo at table$GOID) == 1) & is.na(GOinfo at table$GOID[1]) )
+   {
+     # IF embl ID fails, try as refseq (perhaps 15% refseqs with NM_
+     GOinfo   <- getGO(
id=probe,type="refseq",species="hsapiens",mart=mart)
+     GeneInfo <-
getGene(id=probe,type="refseq",species="hsapiens",mart=mart)
+     cat(i, "refseq", probe, "\n")
+
+
+   } else {
+     cat(i, "embl", probe, "\n")
+     GeneInfo <- getGene(id=probe,type="embl",species="hsapiens",mart=mart)
+   }
+
+   GeneInfo.List <-  rbind( GeneInfo.List,
+                            c(probe,
+                              unlist( GeneInfo at table[c(1,3,4,5,6,7,2)]) ))
+
+   cat(GOinfo at id[1], GOinfo at table$GOID, "\n")
+ }
1 refseq NM_001533
NM_001533 GO:0000166 GO:0003723 GO:0006397 GO:0005654 GO:0030530 GO:0005634
2 refseq NM_031990
NM_031990 GO:0000166 GO:0005515 GO:0008187 GO:0000398 GO:0008380 GO:0005654
GO:0005730 GO:0030530 GO:0003676 GO:0003723 GO:0006397 GO:0005634
3 embl S76822
S76822 GO:0000287 GO:0004310 GO:0016491 GO:0016740 GO:0006695 GO:0008299
GO:0005783 GO:0016021
4 embl AF232742
AF232742 GO:0003807 GO:0004263 GO:0004295 GO:0008233 GO:0006508 GO:0006954
GO:0007596 GO:0042730 GO:0005615
5 embl AB035863
AB035863 GO:0016874 GO:0008152 GO:0004775 GO:0006099 GO:0006104 GO:0006781
GO:0005739
>
> print(GeneInfo.List)
                 symbol   band    chromosome start       end         martID
[1,] "NM_001533" "HNRPL"  "q13.2" "19"       "44018883"  "44032452"
"ENSG00000104824"
[2,] "NM_031990" "PTBP1"  "p13.3" "19"       "748411"    "763327"
"ENSG00000011304"
[3,] "S76822"    "FDFT1"  "p23.1" "8"        "11697664"  "11734215"
"ENSG00000079459"
[4,] "AF232742"  "KLKB1"  "q35.2" "4"        "187523815" "187554773"
"ENSG00000164344"
[5,] "AB035863"  "SUCLA2" "q14.2" "13"       "47414793"  "47473463"
"ENSG00000136143"
     description
[1,] "Heterogeneous nuclear ribonucleoprotein L (hnRNP L).
[Source:Uniprot/SWISSPROT;Acc:P14866]"
[2,] "Polypyrimidine tract-binding protein 1 (PTB) (Heterogeneous nuclear
ribonucleoprotein I) (hnRNP I) (57 kDa RNA-binding protein PPTB-1).
[Source:Uniprot/SWISSPROT;Acc:P26599]"
[3,] "Squalene synthetase (EC 2.5.1.21) (SQS) (SS) (Farnesyl-diphosphate
farnesyltransferase) (FPP:FPP farnesyltransferase).
[Source:Uniprot/SWISSPROT;Acc:P37268]"
[4,] "Plasma kallikrein precursor (EC 3.4.21.34) (Plasma prekallikrein)
(Kininogenin) (Fletcher factor) [Contains: Plasma kallikrein heavy chain;
Plasma kallikrein light chain]. [Source:Uniprot/SWISSPROT;Acc:P03952]"
[5,] "Succinyl-CoA ligase [ADP-forming] beta-chain, mitochondrial precursor
(EC 6.2.1.5) (Succinyl-CoA synthetase, betaA chain) (SCS-betaA) (ATP-
specific succinyl-CoA synthetase beta subunit).
[Source:Uniprot/SWISSPROT;Acc:Q9P2R7]"
> write.csv(GeneInfo.List, row.names=F, file="GeneInfo.csv")
>
> martDisconnect(mart)


efg
Bioinformatics
Stowers Institute



More information about the Bioconductor mailing list