[BioC] biomart to a data.frame

Thu Jan 26 10:24:43 CET 2012

On 01/26/2012 08:28 AM, Assa Yeroslaviz wrote:
> Hi Steve,
>
> thanks for the help.
>
> I know about the strsplit function and i used it to split each row on its
> own by the ';' symbol.
> The problem I have is that I need to keep the information of each row in
> the row ( or at least to give it back after the biomaRt extraction).
>
> The table I have contains not only the protein IDs but also a lot of other
> stuff, which is connected to each of the proteins. This is why I need to
> know which proteins came from which line (Id).
>
> It will be nice if there was a possibility to do it as you suggested. Take
> all the Protein IDs, write them into one vector and run them with biomaRt.
> But than I would like to be able to put them back together in a row-wise
> fashion like I suggested at the beginning.
>

Hi

Please allow me to jump in:

If I understand your question correctly, then there is no other (easy) 
solution than querying biomart inside a loop.

The problem is not the Bioconductor packagae biomaRt, but the actual 
biomart server behind the scene: Apparently there is now way to preserve 
the order of the input (or keep duplicates, or indicate which id does 
not have a result, etc).

I recently asked the biomart folks about this issue, and the answer was 
that I need to post-process the output to get my original order back - I 
was lazy, and I queried the server in a loop (for my defense: it was 
only a handful of ids)

Regards, Hans

> Thanks again
> Assa
>
> On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou<
> mailinglist.honeypot at gmail.com>  wrote:
>
>> Hi Assa,
>>
>> Sorry for top posting.
>>
>> Your intuition is correct: you should not being querying biomart
>> inside a for loop. The idea is to create one query for all of your
>> protein IDs, and query it once.
>>
>> This is how you might go about it. First, let's look at the protein
>> IDs you already seem to have somewhere:
>>
>>> 45  FBpp0070037
>>> 46  FBpp0070039;FBpp0070040
>>> 47  FBpp0070041;FBpp0070042;FBpp0070043
>>> 48  FBpp0070044;FBpp0110571
>>
>> It seems you have multiple IDs jammed into one column of a data.frame
>> maybe? The rows which have more than one ID, (eg.
>> "FBpp0070039;FBpp0070040") will have to be split up so that each row
>> (or element in a vector) only has one ID. Look into using `strsplit`.
>>
>> You will need to get a character vector of protein ids -- one protein
>> per bin, it might look like so:
>>
>> pids<- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041',
>>           'FBpp0070042', 'FBpp0070043')
>>
>> Now ... you're basically done. Let's rig up an object to query biomart
>> with:
>>
>> library(biomaRt)
>> mart<- useMart('ensembl', dataset='dmelanogaster_gene_ensembl')
>> ans<-
>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),
>>                      filters="flybase_translation_id", values=pids,
>> mart=mart)
>>
>> Your answer will look like so:
>>
>>   flybase_translation_id flybase_gene_id flybasename_gene
>> 1            FBpp0070037     FBgn0010215        alpha-Cat
>> 2            FBpp0070039     FBgn0052230          CG32230
>> 3            FBpp0070040     FBgn0052230          CG32230
>> 4            FBpp0070041     FBgn0000258        CkIIalpha
>> 5            FBpp0070042     FBgn0000258        CkIIalpha
>> 6            FBpp0070043     FBgn0000258        CkIIalpha
>>
>> Now you're left with figuring out what to do with multiple
>> "flybase_translaion_id"s that map to the same "flybasename_gene".
>>
>> You would have to do this anyway, but the key point here is that you
>> can now do it without querying biomart in a loop.
>>
>> HTH,
>> -steve
>>
>>
>>
>>> For each of these protein Ids (FBpp...), I would like to extract the gene
>>> id (Fbgn....) in a third column. the output table should looks like that:
>>>
>>> 45  FBpp0070037                          FBgn001234
>>> 46  FBpp0070039;FBpp0070040              FBgn00094432;FBgn002345
>>> 47  FBpp0070041;FBpp0070042;FBpp0070043
>>   FBgn0001936;FBgn000102;FBgn004527
>>> 48  FBpp0070044;FBpp0110571              FBgn0097234;FBgn00183
>>> ...
>>>
>>> I was thinking using biomaRt, but I could find a way of automating it for
>>> the complete protein ids in the line.
>>>
>>> What I have done so far is this for loop:
>>>
>>> for(i in 1:dim(data)[1]){
>>>   temp=unlist(strsplit(data[i,2],";"))
>>>   temp= gsub("REV__", "", temp)
>>>   result=
>>>
>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),filters="flybase_translation_id",values=temp,
>>> mart=mart, )
>>>       charresult =""
>>>       for (j in 1:length(result[[1]])) {
>>> #          charresult<-paste(charresult,">",
>>> result[[1]][j],":",result[[2]][j], "\t", sep="")
>>>           charresult<-paste(charresult, result[[2]][j], ";", sep="")
>>>           }
>>>       out<-"CompleteResults.txt"
>>>       cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n")
>>>       write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F,
>>> col.names=F, row.names=F,append=T)
>>>     }
>>>
>>> What I am doing is converting the string of FBpp Ids into a character
>>> vector and than run each line into the getBM command. I first think it
>> is a
>>> bad idea, as I am using a loop to inquire an online data base, but i
>> don't
>>> have a better option at the moment.
>>>
>>> The second problem is that it just takes a lot of time.
>>>
>>> I would appreciate your Ideas, If there is a better/faster way of doing
>> it
>>>
>>> Thanks A.
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>   | Memorial Sloan-Kettering Cancer Center
>>   | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor