[BioC] libraries or commands to help with parsing or handlingweb based database queries

Benjamin Otto b.otto at uke.uni-hamburg.de
Tue Feb 20 14:07:36 CET 2007


Hi Alan,

Which parts are you interested in exactly? 
Looking at the page there are MID, MASS, Name, Formula information which
seem to be more easily extracted from the code. However the structure seems
a little bit more tricky to me.

Regards

Benjamin





-----Ursprüngliche Nachricht-----
Von: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] Im Auftrag von Thomas Girke
Gesendet: 19 February 2007 19:34
An: ALAN SMITH
Cc: bioconductor at stat.math.ethz.ch
Betreff: Re: [BioC] libraries or commands to help with parsing or
handlingweb based database queries

Alan,
You will need for this some basic knowledge on how to use regular
expressions within R's grep() and gsub() functions. Additional useful
fuctions are paste() and Sys.sleep().

Rcurl also provides some useful utilities for this approach.

Below is a short example on a similar problem for obtaining peptide MW
information from the Expasy site (http://ca.expasy.org/tools/pi_tool.html).


###################################################################
myentries <- c("MKWVTFISLLFLFSSAYS", "MWVTFISLL", "MFISLLFLFSSAYS")
myresult <- NULL
for(i in myentries) {
	myurl <- paste("http://ca.expasy.org/cgi-bin/pi_tool?protein=", 
			i, "&resolution=monoisotopic", sep="")
	x <- url(myurl)
	res <- readLines(x)
	close(x)
	mylines <- res[grep('Theoretical pI/Mw:',res)]
	myresult <- c(myresult, as.numeric(gsub('.*/ ','', mylines)))
	print(myresult)
	Sys.sleep(1) # halts process for one sec to give database a break
}
final <- data.frame(Pep=myentries, MW=myresult)
cat("\n The MW values for my peptides are:\n")
print(final)
###################################################################


Thomas


On Mon 02/19/07 11:41, ALAN SMITH wrote:
> Hello Bioconductors
> I am having a very hard time figuring out how to make web based
> database query results into a nice neat table (if such a thing is
> possible in R).  I am constantly searching the metabolite database
> METLIN by copying and pasting addresses.  I have to search this
> database with several hundred entries, often, and would like to
> automate the process to remove the HUGE amount of time I spend doing
> this carpel tunnel creating routine.  I have found several ways to get
> the pages source like.
> 
> library(RCurl)
>
test<-getURL("http://metlin.scripps.edu/metabo_list.php?mass_min=112.04885&m
ass_max=112.0555")
> #OR
> 


> Once I get the URL info I notice that the data I am interested in is
> between  </form>  and  </table>.
> 
> Are there any packages or methods in R to remove the information I am
> interested in?  I am having problems manipulating STRINGS in R like
> selecting all of the text between two strings.  I am not a programmer.
> 
> Thanks,
> Alan
> 
> Note I am able to use KEGGSOAP without any trouble.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Thomas Girke, Ph.D.
1008 Noel T. Keen Hall
Center for Plant Cell Biology (CEPCEB)
University of California
Riverside, CA 92521

E-mail: thomas.girke at ucr.edu
Website: http://faculty.ucr.edu/~tgirke
Ph: 951-827-2469
Fax: 951-827-4437

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list