[BioC] Genbank to Unigene IDs
John Zhang
jzhang at jimmy.harvard.edu
Fri Apr 16 15:24:03 CEST 2004
>I have a list of GenBank IDs for which I'd like the corresponding Unigene
>cluster IDs. What is the easiest way to do this using Bioconductor
>functions? (I've scanned annotate and AnnBuilder help and vignettes,
>although way too quickly.)
>
>For the sake of being specific, here's a concrete example. What's Unigene
>for GB="NM_004551"?
Sorry for this delayed posting (I took one day off yesterday)
I think the most direct way of getting the ids maped is to use sources available
at LocusLink(ftp://ftp.ncbi.nih.gov/refseq/LocusLink). If your target file
contains GenBank accession numbers (e. g. "AC010642", "AC010642", ...), read
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc using read.table (sep = "\t")
and then do a matching. If your target file contains RefSeq ids (e. g.
"NM_130786", "NM_000014", ...), read
ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref, instead. An example:
> ids <- c("AC010642", "AF414429", "X56654", "Y08432")
> ids2ll <-
as.matrix(read.table("ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc", header =
FALSE, sep = "\t", strip.white = TRUE))
# We only need the second and third column
> ids2ll <- ids2ll[, c(2, 3)]
> colnames(ids2ll) <- c("GB", "LL")
# Drop the version number
> ids2ll[,1] <- gsub("\\..*", "", ids2ll[,1])
> mapped <- ids2ll[is.element(ids2ll[,1], ids),]
> mapped
GB LL
1 "AC010642" "-"
4 "AF414429" "15778556"
10671 "X56654" "30506"
10677 "Y08432" "-"
>
>Thanks a lot
>Gordon
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
Jianhua Zhang
Department of Biostatistics
Dana-Farber Cancer Institute
44 Binney Street
Boston, MA 02115-6084
More information about the Bioconductor
mailing list