[BioC] biomart to a data.frame
Hans-Rudolf Hotz
hrh at fmi.ch
Thu Jan 26 10:24:43 CET 2012
On 01/26/2012 08:28 AM, Assa Yeroslaviz wrote:
> Hi Steve,
>
> thanks for the help.
>
> I know about the strsplit function and i used it to split each row on its
> own by the ';' symbol.
> The problem I have is that I need to keep the information of each row in
> the row ( or at least to give it back after the biomaRt extraction).
>
> The table I have contains not only the protein IDs but also a lot of other
> stuff, which is connected to each of the proteins. This is why I need to
> know which proteins came from which line (Id).
>
> It will be nice if there was a possibility to do it as you suggested. Take
> all the Protein IDs, write them into one vector and run them with biomaRt.
> But than I would like to be able to put them back together in a row-wise
> fashion like I suggested at the beginning.
>
Hi
Please allow me to jump in:
If I understand your question correctly, then there is no other (easy)
solution than querying biomart inside a loop.
The problem is not the Bioconductor packagae biomaRt, but the actual
biomart server behind the scene: Apparently there is now way to preserve
the order of the input (or keep duplicates, or indicate which id does
not have a result, etc).
I recently asked the biomart folks about this issue, and the answer was
that I need to post-process the output to get my original order back - I
was lazy, and I queried the server in a loop (for my defense: it was
only a handful of ids)
Regards, Hans
> Thanks again
> Assa
>
> On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou<
> mailinglist.honeypot at gmail.com> wrote:
>
>> Hi Assa,
>>
>> Sorry for top posting.
>>
>> Your intuition is correct: you should not being querying biomart
>> inside a for loop. The idea is to create one query for all of your
>> protein IDs, and query it once.
>>
>> This is how you might go about it. First, let's look at the protein
>> IDs you already seem to have somewhere:
>>
>>> 45 FBpp0070037
>>> 46 FBpp0070039;FBpp0070040
>>> 47 FBpp0070041;FBpp0070042;FBpp0070043
>>> 48 FBpp0070044;FBpp0110571
>>
>> It seems you have multiple IDs jammed into one column of a data.frame
>> maybe? The rows which have more than one ID, (eg.
>> "FBpp0070039;FBpp0070040") will have to be split up so that each row
>> (or element in a vector) only has one ID. Look into using `strsplit`.
>>
>> You will need to get a character vector of protein ids -- one protein
>> per bin, it might look like so:
>>
>> pids<- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041',
>> 'FBpp0070042', 'FBpp0070043')
>>
>> Now ... you're basically done. Let's rig up an object to query biomart
>> with:
>>
>> library(biomaRt)
>> mart<- useMart('ensembl', dataset='dmelanogaster_gene_ensembl')
>> ans<-
>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),
>> filters="flybase_translation_id", values=pids,
>> mart=mart)
>>
>> Your answer will look like so:
>>
>> flybase_translation_id flybase_gene_id flybasename_gene
>> 1 FBpp0070037 FBgn0010215 alpha-Cat
>> 2 FBpp0070039 FBgn0052230 CG32230
>> 3 FBpp0070040 FBgn0052230 CG32230
>> 4 FBpp0070041 FBgn0000258 CkIIalpha
>> 5 FBpp0070042 FBgn0000258 CkIIalpha
>> 6 FBpp0070043 FBgn0000258 CkIIalpha
>>
>> Now you're left with figuring out what to do with multiple
>> "flybase_translaion_id"s that map to the same "flybasename_gene".
>>
>> You would have to do this anyway, but the key point here is that you
>> can now do it without querying biomart in a loop.
>>
>> HTH,
>> -steve
>>
>>
>>
>>> For each of these protein Ids (FBpp...), I would like to extract the gene
>>> id (Fbgn....) in a third column. the output table should looks like that:
>>>
>>> 45 FBpp0070037 FBgn001234
>>> 46 FBpp0070039;FBpp0070040 FBgn00094432;FBgn002345
>>> 47 FBpp0070041;FBpp0070042;FBpp0070043
>> FBgn0001936;FBgn000102;FBgn004527
>>> 48 FBpp0070044;FBpp0110571 FBgn0097234;FBgn00183
>>> ...
>>>
>>> I was thinking using biomaRt, but I could find a way of automating it for
>>> the complete protein ids in the line.
>>>
>>> What I have done so far is this for loop:
>>>
>>> for(i in 1:dim(data)[1]){
>>> temp=unlist(strsplit(data[i,2],";"))
>>> temp= gsub("REV__", "", temp)
>>> result=
>>>
>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),filters="flybase_translation_id",values=temp,
>>> mart=mart, )
>>> charresult =""
>>> for (j in 1:length(result[[1]])) {
>>> # charresult<-paste(charresult,">",
>>> result[[1]][j],":",result[[2]][j], "\t", sep="")
>>> charresult<-paste(charresult, result[[2]][j], ";", sep="")
>>> }
>>> out<-"CompleteResults.txt"
>>> cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n")
>>> write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F,
>>> col.names=F, row.names=F,append=T)
>>> }
>>>
>>> What I am doing is converting the string of FBpp Ids into a character
>>> vector and than run each line into the getBM command. I first think it
>> is a
>>> bad idea, as I am using a loop to inquire an online data base, but i
>> don't
>>> have a better option at the moment.
>>>
>>> The second problem is that it just takes a lot of time.
>>>
>>> I would appreciate your Ideas, If there is a better/faster way of doing
>> it
>>>
>>> Thanks A.
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list