[BioC] How to know in advance what kind of ID one has in using getGO or GetGene with biomaRt?

Tue Jan 31 09:26:22 CET 2006

Hi,

Ideally your microarray provider should have added an extra column to the
excel sheet indicating which from database each identifier comes.

This is what I found on "XM_*" IDs on the NCBI website:

>RefSeq Model (predicted) Sequence Records from the Human Genome
annotation >process 	Two letters (XM, XP, or XR), an underscore bar, and
six >digits, e.g.:
>XM_000483

Your XM IDs have an extra underscore and digit, removing that and querying
at NCBI gives (for example for ID XM_170432_1)

 XM_170432 	Reports

    gi|20542485|ref|XM_170432.1|[20542485]
    This record was removed as a result of standard genome annotation
processing. See the genome build documentation at
http://www.ncbi.nlm.nih.gov/genome/guide/build.html for further
information, or contact info at ncbi.nlm.nih.gov.

best,
Steffen

> I have a list of about 20,000 "Accession Reference" IDs and I want to find
> corresponding Gene and GO information.
>
> The IDs that start with "NM_" all seem to work fine as type="refseq", but
> others, starting with, "S" or "AB" or "AF" can be found only as
> type="embl".
> Those with "XM" seemingly cannot be found.
>
> What information is stored in the prefix of an ID?  What do NM_, S, AB,
> AF,
> or XM mean, and where is information about these prefixes?
>
> Does it make sense to have a function that returns the type of an ID?
> Does
> it make sense to have biomaRt functions automtically "know" about the
> various kinds of IDs?  I don't see how to vectorize any of this when one
> must check the type of ID with each call.
>
> Below I try "embl" IDs first because after a first pass I know I can only
> connect about 3,000 out of 20,000 identifiers as "refseq".   Overall,
> trying
> both "embl" and then "refseq" matches perhaps 90% of the dataset of
> 20,000,
> but this doesn't seem very "clean", and perhaps about 1,000 XM probes were
> never matched:
>
>> # Show problem in knowing type of identifier while fetching GO or Gene
> info
>> # using biomaRt.  efg, 30 Jan 2006
>>
>> library(biomaRt)
> Loading required package: RMySQL
> Loading required package: DBI
> Loading required package: XML
> Warning message:
> DLL attempted to change FPU control word from 8001f to 9001f
>> mart <- martConnect()
> connected to:  ensembl_mart_36
>>
>> # First five "Accession Reference" IDs from CAMDA06-related probe
>> dataset:
>> #
> http://ecom2.mwgdna.com/download/arrays/arrays/gene_id/xls/gene_id_human_40k_a.xls
>> # (discard _N or _NN in IDs)
>> probe.list <- c("NM_001533", "NM_031990", "S76822", "AF232742",
> "AB035863")
>>
>> GeneInfo.List <- NULL
>>
>> for (i in 1:length(probe.list))
> + {
> +   probe <- probe.list[i]
> +
> +   # Assume embl ID
> +   GOinfo <- getGO(id=probe,type="embl",species="hsapiens",mart=mart)
> +   if ( (length(GOinfo at table$GOID) == 1) & is.na(GOinfo at table$GOID[1]) )
> +   {
> +     # IF embl ID fails, try as refseq (perhaps 15% refseqs with NM_
> +     GOinfo   <- getGO(
> id=probe,type="refseq",species="hsapiens",mart=mart)
> +     GeneInfo <-
> getGene(id=probe,type="refseq",species="hsapiens",mart=mart)
> +     cat(i, "refseq", probe, "\n")
> +
> +
> +   } else {
> +     cat(i, "embl", probe, "\n")
> +     GeneInfo <-
> getGene(id=probe,type="embl",species="hsapiens",mart=mart)
> +   }
> +
> +   GeneInfo.List <-  rbind( GeneInfo.List,
> +                            c(probe,
> +                              unlist( GeneInfo at table[c(1,3,4,5,6,7,2)])
> ))
> +
> +   cat(GOinfo at id[1], GOinfo at table$GOID, "\n")
> + }
> 1 refseq NM_001533
> NM_001533 GO:0000166 GO:0003723 GO:0006397 GO:0005654 GO:0030530
> GO:0005634
> 2 refseq NM_031990
> NM_031990 GO:0000166 GO:0005515 GO:0008187 GO:0000398 GO:0008380
> GO:0005654
> GO:0005730 GO:0030530 GO:0003676 GO:0003723 GO:0006397 GO:0005634
> 3 embl S76822
> S76822 GO:0000287 GO:0004310 GO:0016491 GO:0016740 GO:0006695 GO:0008299
> GO:0005783 GO:0016021
> 4 embl AF232742
> AF232742 GO:0003807 GO:0004263 GO:0004295 GO:0008233 GO:0006508 GO:0006954
> GO:0007596 GO:0042730 GO:0005615
> 5 embl AB035863
> AB035863 GO:0016874 GO:0008152 GO:0004775 GO:0006099 GO:0006104 GO:0006781
> GO:0005739
>>
>> print(GeneInfo.List)
>                  symbol   band    chromosome start       end
> martID
> [1,] "NM_001533" "HNRPL"  "q13.2" "19"       "44018883"  "44032452"
> "ENSG00000104824"
> [2,] "NM_031990" "PTBP1"  "p13.3" "19"       "748411"    "763327"
> "ENSG00000011304"
> [3,] "S76822"    "FDFT1"  "p23.1" "8"        "11697664"  "11734215"
> "ENSG00000079459"
> [4,] "AF232742"  "KLKB1"  "q35.2" "4"        "187523815" "187554773"
> "ENSG00000164344"
> [5,] "AB035863"  "SUCLA2" "q14.2" "13"       "47414793"  "47473463"
> "ENSG00000136143"
>      description
> [1,] "Heterogeneous nuclear ribonucleoprotein L (hnRNP L).
> [Source:Uniprot/SWISSPROT;Acc:P14866]"
> [2,] "Polypyrimidine tract-binding protein 1 (PTB) (Heterogeneous nuclear
> ribonucleoprotein I) (hnRNP I) (57 kDa RNA-binding protein PPTB-1).
> [Source:Uniprot/SWISSPROT;Acc:P26599]"
> [3,] "Squalene synthetase (EC 2.5.1.21) (SQS) (SS) (Farnesyl-diphosphate
> farnesyltransferase) (FPP:FPP farnesyltransferase).
> [Source:Uniprot/SWISSPROT;Acc:P37268]"
> [4,] "Plasma kallikrein precursor (EC 3.4.21.34) (Plasma prekallikrein)
> (Kininogenin) (Fletcher factor) [Contains: Plasma kallikrein heavy chain;
> Plasma kallikrein light chain]. [Source:Uniprot/SWISSPROT;Acc:P03952]"
> [5,] "Succinyl-CoA ligase [ADP-forming] beta-chain, mitochondrial
> precursor
> (EC 6.2.1.5) (Succinyl-CoA synthetase, betaA chain) (SCS-betaA) (ATP-
> specific succinyl-CoA synthetase beta subunit).
> [Source:Uniprot/SWISSPROT;Acc:Q9P2R7]"
>> write.csv(GeneInfo.List, row.names=F, file="GeneInfo.csv")
>>
>> martDisconnect(mart)
>
>
> efg
> Bioinformatics
> Stowers Institute
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>