[BioC] How to know in advance what kind of ID one has in using getGO or GetGene with biomaRt?
sdurinck@ebi.ac.uk
sdurinck at ebi.ac.uk
Tue Jan 31 09:26:22 CET 2006
Hi,
Ideally your microarray provider should have added an extra column to the
excel sheet indicating which from database each identifier comes.
This is what I found on "XM_*" IDs on the NCBI website:
>RefSeq Model (predicted) Sequence Records from the Human Genome
annotation >process Two letters (XM, XP, or XR), an underscore bar, and
six >digits, e.g.:
>XM_000483
Your XM IDs have an extra underscore and digit, removing that and querying
at NCBI gives (for example for ID XM_170432_1)
XM_170432 Reports
gi|20542485|ref|XM_170432.1|[20542485]
This record was removed as a result of standard genome annotation
processing. See the genome build documentation at
http://www.ncbi.nlm.nih.gov/genome/guide/build.html for further
information, or contact info at ncbi.nlm.nih.gov.
best,
Steffen
> I have a list of about 20,000 "Accession Reference" IDs and I want to find
> corresponding Gene and GO information.
>
> The IDs that start with "NM_" all seem to work fine as type="refseq", but
> others, starting with, "S" or "AB" or "AF" can be found only as
> type="embl".
> Those with "XM" seemingly cannot be found.
>
> What information is stored in the prefix of an ID? What do NM_, S, AB,
> AF,
> or XM mean, and where is information about these prefixes?
>
> Does it make sense to have a function that returns the type of an ID?
> Does
> it make sense to have biomaRt functions automtically "know" about the
> various kinds of IDs? I don't see how to vectorize any of this when one
> must check the type of ID with each call.
>
> Below I try "embl" IDs first because after a first pass I know I can only
> connect about 3,000 out of 20,000 identifiers as "refseq". Overall,
> trying
> both "embl" and then "refseq" matches perhaps 90% of the dataset of
> 20,000,
> but this doesn't seem very "clean", and perhaps about 1,000 XM probes were
> never matched:
>
>> # Show problem in knowing type of identifier while fetching GO or Gene
> info
>> # using biomaRt. efg, 30 Jan 2006
>>
>> library(biomaRt)
> Loading required package: RMySQL
> Loading required package: DBI
> Loading required package: XML
> Warning message:
> DLL attempted to change FPU control word from 8001f to 9001f
>> mart <- martConnect()
> connected to: ensembl_mart_36
>>
>> # First five "Accession Reference" IDs from CAMDA06-related probe
>> dataset:
>> #
> http://ecom2.mwgdna.com/download/arrays/arrays/gene_id/xls/gene_id_human_40k_a.xls
>> # (discard _N or _NN in IDs)
>> probe.list <- c("NM_001533", "NM_031990", "S76822", "AF232742",
> "AB035863")
>>
>> GeneInfo.List <- NULL
>>
>> for (i in 1:length(probe.list))
> + {
> + probe <- probe.list[i]
> +
> + # Assume embl ID
> + GOinfo <- getGO(id=probe,type="embl",species="hsapiens",mart=mart)
> + if ( (length(GOinfo at table$GOID) == 1) & is.na(GOinfo at table$GOID[1]) )
> + {
> + # IF embl ID fails, try as refseq (perhaps 15% refseqs with NM_
> + GOinfo <- getGO(
> id=probe,type="refseq",species="hsapiens",mart=mart)
> + GeneInfo <-
> getGene(id=probe,type="refseq",species="hsapiens",mart=mart)
> + cat(i, "refseq", probe, "\n")
> +
> +
> + } else {
> + cat(i, "embl", probe, "\n")
> + GeneInfo <-
> getGene(id=probe,type="embl",species="hsapiens",mart=mart)
> + }
> +
> + GeneInfo.List <- rbind( GeneInfo.List,
> + c(probe,
> + unlist( GeneInfo at table[c(1,3,4,5,6,7,2)])
> ))
> +
> + cat(GOinfo at id[1], GOinfo at table$GOID, "\n")
> + }
> 1 refseq NM_001533
> NM_001533 GO:0000166 GO:0003723 GO:0006397 GO:0005654 GO:0030530
> GO:0005634
> 2 refseq NM_031990
> NM_031990 GO:0000166 GO:0005515 GO:0008187 GO:0000398 GO:0008380
> GO:0005654
> GO:0005730 GO:0030530 GO:0003676 GO:0003723 GO:0006397 GO:0005634
> 3 embl S76822
> S76822 GO:0000287 GO:0004310 GO:0016491 GO:0016740 GO:0006695 GO:0008299
> GO:0005783 GO:0016021
> 4 embl AF232742
> AF232742 GO:0003807 GO:0004263 GO:0004295 GO:0008233 GO:0006508 GO:0006954
> GO:0007596 GO:0042730 GO:0005615
> 5 embl AB035863
> AB035863 GO:0016874 GO:0008152 GO:0004775 GO:0006099 GO:0006104 GO:0006781
> GO:0005739
>>
>> print(GeneInfo.List)
> symbol band chromosome start end
> martID
> [1,] "NM_001533" "HNRPL" "q13.2" "19" "44018883" "44032452"
> "ENSG00000104824"
> [2,] "NM_031990" "PTBP1" "p13.3" "19" "748411" "763327"
> "ENSG00000011304"
> [3,] "S76822" "FDFT1" "p23.1" "8" "11697664" "11734215"
> "ENSG00000079459"
> [4,] "AF232742" "KLKB1" "q35.2" "4" "187523815" "187554773"
> "ENSG00000164344"
> [5,] "AB035863" "SUCLA2" "q14.2" "13" "47414793" "47473463"
> "ENSG00000136143"
> description
> [1,] "Heterogeneous nuclear ribonucleoprotein L (hnRNP L).
> [Source:Uniprot/SWISSPROT;Acc:P14866]"
> [2,] "Polypyrimidine tract-binding protein 1 (PTB) (Heterogeneous nuclear
> ribonucleoprotein I) (hnRNP I) (57 kDa RNA-binding protein PPTB-1).
> [Source:Uniprot/SWISSPROT;Acc:P26599]"
> [3,] "Squalene synthetase (EC 2.5.1.21) (SQS) (SS) (Farnesyl-diphosphate
> farnesyltransferase) (FPP:FPP farnesyltransferase).
> [Source:Uniprot/SWISSPROT;Acc:P37268]"
> [4,] "Plasma kallikrein precursor (EC 3.4.21.34) (Plasma prekallikrein)
> (Kininogenin) (Fletcher factor) [Contains: Plasma kallikrein heavy chain;
> Plasma kallikrein light chain]. [Source:Uniprot/SWISSPROT;Acc:P03952]"
> [5,] "Succinyl-CoA ligase [ADP-forming] beta-chain, mitochondrial
> precursor
> (EC 6.2.1.5) (Succinyl-CoA synthetase, betaA chain) (SCS-betaA) (ATP-
> specific succinyl-CoA synthetase beta subunit).
> [Source:Uniprot/SWISSPROT;Acc:Q9P2R7]"
>> write.csv(GeneInfo.List, row.names=F, file="GeneInfo.csv")
>>
>> martDisconnect(mart)
>
>
> efg
> Bioinformatics
> Stowers Institute
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>
More information about the Bioconductor
mailing list