[BioC] Genbank to Unigene IDs
Gordon Smyth
smyth at wehi.edu.au
Sat Apr 17 08:49:01 CEST 2004
Dear John,
Thanks for your suggestion. I can see the attraction of going through
LocusLink because the LocusLink files are relatively small. But the fact
that LocusLink is only a subset of GenBank (as pointed out by Dave Waddell)
seems disasterous. I tried your code on a set of Genbank IDs from a human
oligo array based on the Compugen 19k library. The code found LocusLink IDs
for only 4587 of the Genbank IDs. Meanwhile, SOURCE found Unigene IDs for
16230 of them. So going through LocusLink found the UniGene ID in less than
30% of cases in which there was one to find.
Gordon
At 11:24 PM 16/04/2004, John Zhang wrote:
> >I have a list of GenBank IDs for which I'd like the corresponding Unigene
> >cluster IDs. What is the easiest way to do this using Bioconductor
> >functions? (I've scanned annotate and AnnBuilder help and vignettes,
> >although way too quickly.)
> >
> >For the sake of being specific, here's a concrete example. What's Unigene
> >for GB="NM_004551"?
>
>Sorry for this delayed posting (I took one day off yesterday)
>
>I think the most direct way of getting the ids maped is to use sources
>available
>at LocusLink(ftp://ftp.ncbi.nih.gov/refseq/LocusLink). If your target file
>contains GenBank accession numbers (e. g. "AC010642", "AC010642", ...), read
>ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc using read.table (sep = "\t")
>and then do a matching. If your target file contains RefSeq ids (e. g.
>"NM_130786", "NM_000014", ...), read
>ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref, instead. An example:
>
> > ids <- c("AC010642", "AF414429", "X56654", "Y08432")
> > ids2ll <-
>as.matrix(read.table("ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc",
>header =
>FALSE, sep = "\t", strip.white = TRUE))
># We only need the second and third column
> > ids2ll <- ids2ll[, c(2, 3)]
> > colnames(ids2ll) <- c("GB", "LL")
># Drop the version number
> > ids2ll[,1] <- gsub("\\..*", "", ids2ll[,1])
> > mapped <- ids2ll[is.element(ids2ll[,1], ids),]
> > mapped
> GB LL
>1 "AC010642" "-"
>4 "AF414429" "15778556"
>10671 "X56654" "30506"
>10677 "Y08432" "-"
>
>
>
> >
> >Thanks a lot
> >Gordon
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>
>Jianhua Zhang
>Department of Biostatistics
>Dana-Farber Cancer Institute
>44 Binney Street
>Boston, MA 02115-6084
More information about the Bioconductor
mailing list