[BioC] biomart to a data.frame
Sebastian Thieme
thieme at mi.fu-berlin.de
Thu Jan 26 09:52:47 CET 2012
Hi Assa,
you can try this
con <- textConnection(data2seperate)
seperatedData <- read.table(con,sep=";",stringsAsFactors=FALSE) #splitten
It's nearly the same as the strsplit function but you get a table as
output sorted by your input. I hope this helps.
Best
Basti
2012/1/26 Assa Yeroslaviz <frymor at gmail.com>:
> Hi Steve,
>
> thanks for the help.
>
> I know about the strsplit function and i used it to split each row on its
> own by the ';' symbol.
> The problem I have is that I need to keep the information of each row in
> the row ( or at least to give it back after the biomaRt extraction).
>
> The table I have contains not only the protein IDs but also a lot of other
> stuff, which is connected to each of the proteins. This is why I need to
> know which proteins came from which line (Id).
>
> It will be nice if there was a possibility to do it as you suggested. Take
> all the Protein IDs, write them into one vector and run them with biomaRt.
> But than I would like to be able to put them back together in a row-wise
> fashion like I suggested at the beginning.
>
> Thanks again
> Assa
>
> On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou <
> mailinglist.honeypot at gmail.com> wrote:
>
>> Hi Assa,
>>
>> Sorry for top posting.
>>
>> Your intuition is correct: you should not being querying biomart
>> inside a for loop. The idea is to create one query for all of your
>> protein IDs, and query it once.
>>
>> This is how you might go about it. First, let's look at the protein
>> IDs you already seem to have somewhere:
>>
>> > 45 FBpp0070037
>> > 46 FBpp0070039;FBpp0070040
>> > 47 FBpp0070041;FBpp0070042;FBpp0070043
>> > 48 FBpp0070044;FBpp0110571
>>
>> It seems you have multiple IDs jammed into one column of a data.frame
>> maybe? The rows which have more than one ID, (eg.
>> "FBpp0070039;FBpp0070040") will have to be split up so that each row
>> (or element in a vector) only has one ID. Look into using `strsplit`.
>>
>> You will need to get a character vector of protein ids -- one protein
>> per bin, it might look like so:
>>
>> pids <- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041',
>> 'FBpp0070042', 'FBpp0070043')
>>
>> Now ... you're basically done. Let's rig up an object to query biomart
>> with:
>>
>> library(biomaRt)
>> mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl')
>> ans <-
>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),
>> filters="flybase_translation_id", values=pids,
>> mart=mart)
>>
>> Your answer will look like so:
>>
>> flybase_translation_id flybase_gene_id flybasename_gene
>> 1 FBpp0070037 FBgn0010215 alpha-Cat
>> 2 FBpp0070039 FBgn0052230 CG32230
>> 3 FBpp0070040 FBgn0052230 CG32230
>> 4 FBpp0070041 FBgn0000258 CkIIalpha
>> 5 FBpp0070042 FBgn0000258 CkIIalpha
>> 6 FBpp0070043 FBgn0000258 CkIIalpha
>>
>> Now you're left with figuring out what to do with multiple
>> "flybase_translaion_id"s that map to the same "flybasename_gene".
>>
>> You would have to do this anyway, but the key point here is that you
>> can now do it without querying biomart in a loop.
>>
>> HTH,
>> -steve
>>
>>
>>
>> > For each of these protein Ids (FBpp...), I would like to extract the gene
>> > id (Fbgn....) in a third column. the output table should looks like that:
>> >
>> > 45 FBpp0070037 FBgn001234
>> > 46 FBpp0070039;FBpp0070040 FBgn00094432;FBgn002345
>> > 47 FBpp0070041;FBpp0070042;FBpp0070043
>> FBgn0001936;FBgn000102;FBgn004527
>> > 48 FBpp0070044;FBpp0110571 FBgn0097234;FBgn00183
>> > ...
>> >
>> > I was thinking using biomaRt, but I could find a way of automating it for
>> > the complete protein ids in the line.
>> >
>> > What I have done so far is this for loop:
>> >
>> > for(i in 1:dim(data)[1]){
>> > temp=unlist(strsplit(data[i,2],";"))
>> > temp= gsub("REV__", "", temp)
>> > result=
>> >
>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),filters="flybase_translation_id",values=temp,
>> > mart=mart, )
>> > charresult =""
>> > for (j in 1:length(result[[1]])) {
>> > # charresult<-paste(charresult,">",
>> > result[[1]][j],":",result[[2]][j], "\t", sep="")
>> > charresult<-paste(charresult, result[[2]][j], ";", sep="")
>> > }
>> > out<-"CompleteResults.txt"
>> > cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n")
>> > write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F,
>> > col.names=F, row.names=F,append=T)
>> > }
>> >
>> > What I am doing is converting the string of FBpp Ids into a character
>> > vector and than run each line into the getBM command. I first think it
>> is a
>> > bad idea, as I am using a loop to inquire an online data base, but i
>> don't
>> > have a better option at the moment.
>> >
>> > The second problem is that it just takes a lot of time.
>> >
>> > I would appreciate your Ideas, If there is a better/faster way of doing
>> it
>> >
>> > Thanks A.
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list