[BioC] biomart to a data.frame
Steve Lianoglou
mailinglist.honeypot at gmail.com
Wed Jan 25 16:02:30 CET 2012
Hi Assa,
Sorry for top posting.
Your intuition is correct: you should not being querying biomart
inside a for loop. The idea is to create one query for all of your
protein IDs, and query it once.
This is how you might go about it. First, let's look at the protein
IDs you already seem to have somewhere:
> 45 FBpp0070037
> 46 FBpp0070039;FBpp0070040
> 47 FBpp0070041;FBpp0070042;FBpp0070043
> 48 FBpp0070044;FBpp0110571
It seems you have multiple IDs jammed into one column of a data.frame
maybe? The rows which have more than one ID, (eg.
"FBpp0070039;FBpp0070040") will have to be split up so that each row
(or element in a vector) only has one ID. Look into using `strsplit`.
You will need to get a character vector of protein ids -- one protein
per bin, it might look like so:
pids <- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041',
'FBpp0070042', 'FBpp0070043')
Now ... you're basically done. Let's rig up an object to query biomart with:
library(biomaRt)
mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl')
ans <- getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),
filters="flybase_translation_id", values=pids, mart=mart)
Your answer will look like so:
flybase_translation_id flybase_gene_id flybasename_gene
1 FBpp0070037 FBgn0010215 alpha-Cat
2 FBpp0070039 FBgn0052230 CG32230
3 FBpp0070040 FBgn0052230 CG32230
4 FBpp0070041 FBgn0000258 CkIIalpha
5 FBpp0070042 FBgn0000258 CkIIalpha
6 FBpp0070043 FBgn0000258 CkIIalpha
Now you're left with figuring out what to do with multiple
"flybase_translaion_id"s that map to the same "flybasename_gene".
You would have to do this anyway, but the key point here is that you
can now do it without querying biomart in a loop.
HTH,
-steve
> For each of these protein Ids (FBpp...), I would like to extract the gene
> id (Fbgn....) in a third column. the output table should looks like that:
>
> 45 FBpp0070037 FBgn001234
> 46 FBpp0070039;FBpp0070040 FBgn00094432;FBgn002345
> 47 FBpp0070041;FBpp0070042;FBpp0070043 FBgn0001936;FBgn000102;FBgn004527
> 48 FBpp0070044;FBpp0110571 FBgn0097234;FBgn00183
> ...
>
> I was thinking using biomaRt, but I could find a way of automating it for
> the complete protein ids in the line.
>
> What I have done so far is this for loop:
>
> for(i in 1:dim(data)[1]){
> temp=unlist(strsplit(data[i,2],";"))
> temp= gsub("REV__", "", temp)
> result=
> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),filters="flybase_translation_id",values=temp,
> mart=mart, )
> charresult =""
> for (j in 1:length(result[[1]])) {
> # charresult<-paste(charresult,">",
> result[[1]][j],":",result[[2]][j], "\t", sep="")
> charresult<-paste(charresult, result[[2]][j], ";", sep="")
> }
> out<-"CompleteResults.txt"
> cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n")
> write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F,
> col.names=F, row.names=F,append=T)
> }
>
> What I am doing is converting the string of FBpp Ids into a character
> vector and than run each line into the getBM command. I first think it is a
> bad idea, as I am using a loop to inquire an online data base, but i don't
> have a better option at the moment.
>
> The second problem is that it just takes a lot of time.
>
> I would appreciate your Ideas, If there is a better/faster way of doing it
>
> Thanks A.
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list