[BioC] PostForm() with KEGG

Martin Morgan mtmorgan at fhcrc.org
Tue Feb 28 17:19:13 CET 2012


On 02/28/2012 06:14 AM, Ovokeraye Achinike-Oduaran wrote:
> Hi Duncan,
>
> My understanding is that xpathSApply() combines both the geneSetNode()
> and the sapply(). I hope that this is a correct assumption. In
> attempting to retrieve nodes in general from the pathway, I used  both
>
> xpathSApply(doc, "//li/node()",  xmlGetAttr, "href")
> and
> xpathSApply(doc, "//li/a/node()",  xmlGetAttr, "href")
>
> and the I get nothing (null) back even though no visible error pops
> up. I something wrong with the way I'm using the path or do I just not
> yet grasp the whole XPath concept (I did read the online tutorial)?

the NULL means that no nodes match your xpath query.

>
> Sorry to drag this on, but please help.

I used Duncan's RHTMLForms suggestion

   library(RHTMLForms)
   url = "http://www.genome.jp/kegg/tool/map_pathway1.html"
   u = "http://www.genome.jp/kegg-bin/search_pathway_object"
   ff = getHTMLFormDescription(url)
   fun = createFunction(ff[[1]])
   txt = fun(unclassified = "ko:K01803 cpd:C00111 cpd:C00118 K00134 
C00236", target = "alias", .url = u)

to retrieve the text and then

   library(XML)
   xml = htmlTreeParse(txt, asText=TRUE, useInternalNodes=TRUE)

to parse to xml (maybe there is a more direct way, using the reader 
argument to createFunction?). If I experiment a little, I see for 
instance that

   getNodeSet(xml, "//li/a")

returns the 'li' elements with nested 'a' elements, and

   getNodeSet(xml, "//li/a[@target]")

returns the subset of those elements that have a 'target' attribute. Finally

 > head(xpathSApply(xml, "//li/a[@target]", xmlValue))
[1] "ko00010 Glycolysis / Gluconeogenesis"
[2] "ko01100 Metabolic pathways"
[3] "ko01110 Biosynthesis of secondary metabolites"
[4] "ko01120 Microbial metabolism in diverse environments"
[5] "ko00710 Carbon fixation in photosynthetic organisms"
[6] "ko00562 Inositol phosphate metabolism"

seems to be about what you want, or


head(xpathSApply(xml, "//li/a/@href"))
                                                 href
"/kegg-bin/show_pathway?13304448561022/ko00010.args"
                                                 href
                      "javascript:display('ko00010')"
                                                 href
"/kegg-bin/show_pathway?13304448561022/ko01100.args"
                                                 href
                      "javascript:display('ko01100')"
                                                 href
"/kegg-bin/show_pathway?13304448561022/ko01110.args"
                                                 href
                      "javascript:display('ko01110')"

Maybe the KEGGSOAP package already does what you're interested in? The 
web scraping you're doing is going to break as soon as the web site 
tweaks its presentation.

Or maybe

 > library(org.Hs.eg.db)
 > head(toTable(revmap(org.Hs.egPATH)[c("00232", "04142")]))
   gene_id path_id
1       9   00232
2      10   00232
3      20   04142
4      53   04142
5      54   04142
6     162   04142

The KEGG information in the org.* and KEGG packages dates to the last 
free public release, and so are starting to be dated).

Martin

>
> Thanks.
>
> Avoks
>
> On Mon, Feb 27, 2012 at 4:09 PM, Ovokeraye Achinike-Oduaran
> <ovokeraye at gmail.com>  wrote:
>> Thank you so very much, Duncan. I will go get myself enlightened:).
>> Thanks again.
>>
>> Avoks
>>
>> On Mon, Feb 27, 2012 at 3:50 PM, Duncan Temple Lang
>> <duncan at wald.ucdavis.edu>  wrote:
>>>
>>> Use
>>>
>>>    target = "alias"
>>>
>>> in the call.
>>>
>>> If you don't know how to map form elements to parameters in the request, you
>>> can either read  a tutorial on HTML forms, or alternatively, use
>>> the RHTMLForms package which you have loaded according to your search path, e.g.
>>>
>>>   # read the form  and then turn the information into an R function.
>>> ff = getHTMLFormDescription("http://www.genome.jp/kegg/tool/map_pathway1.html")
>>> fun = createFunction(ff[[1]])
>>>
>>>   # Since the action in the form is javascript, we'll provide the
>>>   # URL manually.
>>> u = "http://www.genome.jp/kegg-bin/search_pathway_object"
>>> out = fun(unclassified = "ko:K01803 cpd:C00111 cpd:C00118 K00134 C00236",
>>>           target = "alias", .url = u)
>>>
>>> The benefits of the RHTMLForms include using the same defaults
>>> as the form on the Web page, adding hidden parameters, identifying
>>> the names of the parameters.
>>>
>>>    D
>>>
>>>
>>> On 2/27/12 3:08 AM, Ovokeraye Achinike-Oduaran wrote:
>>>> Hi Duncan,
>>>>
>>>> I noticed that with the script as is, it doesn't take into
>>>> consideration the "include alias" checkbox. I tried modifying the
>>>> script to force include that option but it still did not work. Any
>>>> ideas?
>>>>
>>>> u = "http://www.genome.jp/kegg-bin/search_pathway_object"
>>>> data = postForm(u,
>>>>                 .params = list(org_name = "hsadd",
>>>>                 unclassified = paste(readLines(file.choose()), collapse = "\n"),
>>>>                 file = "", checkbox = "alias", submit = "Exec"))
>>>>
>>>>
>>>> Thanks again.
>>>>
>>>> Avoks
>>>>
>>>>
>>>> On Mon, Feb 27, 2012 at 10:24 AM, Ovokeraye Achinike-Oduaran
>>>> <ovokeraye at gmail.com>  wrote:
>>>>> Hi Duncan,
>>>>>
>>>>> Thanks a bunch.
>>>>>
>>>>> -Avoks
>>>>>
>>>>> On Fri, Feb 24, 2012 at 11:09 PM, Duncan Temple Lang
>>>>> <duncan at wald.ucdavis.edu>  wrote:
>>>>>> Hi Avoks
>>>>>>
>>>>>> While the form is provided by KEGG and so bio-relatd,
>>>>>> you might have been better posting this to the more general r-help mailing list.
>>>>>>
>>>>>>
>>>>>> You are posting the HTTP request to the wrong URL. That is the URL
>>>>>> of the Web page that displays the form, not the URL that processes
>>>>>> the input from the form.
>>>>>> You have to look at the JavaScript that is referenced in the action
>>>>>> attribute of the HTML form element.
>>>>>>
>>>>>> The second issue is that you are submitting the name of a local file.
>>>>>> This won't work as is.  You either need to identify this is the name of a file and not the contents
>>>>>> of the file to send, or else send the contents.  In this form, you can send the
>>>>>> contents via the the unclassified parameter.
>>>>>>
>>>>>>
>>>>>> u = "http://www.genome.jp/kegg-bin/search_pathway_object"
>>>>>> data = postForm(u,
>>>>>>                 .params = list(org_name = "hsadd",
>>>>>>                                unclassified = "hsa:7167 hsa:GPI cpd:C00118\nALDOA 1.2.1.12 C00236",
>>>>>>                                file = "", submit = "Exec"))
>>>>>>
>>>>>>
>>>>>> If your input is in a file, you can use
>>>>>>
>>>>>>   unclassified = paste(readLines(file.choose()), collapse = "\n")
>>>>>>
>>>>>> as the value for the unclassified parameter.
>>>>>>
>>>>>>
>>>>>> There are additional parameters that the form accepts that may be relevant for your search.
>>>>>>
>>>>>>
>>>>>> As for processing the results, you will want to use
>>>>>>
>>>>>>   doc = htmlParse(data, asText = TRUE)
>>>>>>
>>>>>> and then use getNodeSet()/xpathSApply() or direct tree extraction to access the nodes you want, e.g.
>>>>>>
>>>>>>   xpathSApply(doc, "//li/a",  xmlGetAttr, "href")
>>>>>>
>>>>>>
>>>>>>   D.
>>>>>>
>>>>>>
>>>>>> On 2/24/12 6:09 AM, Ovokeraye Achinike-Oduaran wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am trying to use postForm() with the KEGG website but I am stuck on
>>>>>>> how to get my results. Is it possible (code below) or am I using
>>>>>>> postForm() wrongly? The code appears to run but I'm not quite sure how
>>>>>>> to read the results assuming there are any. Please help.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Avoks
>>>>>>> ____
>>>>>>>
>>>>>>> data = postForm("http://www.genome.jp/kegg/tool/map_pathway1.html",
>>>>>>> org_name = "hsadd",
>>>>>>> file = file.choose(),
>>>>>>> submit = "Exec")
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 2.14.1 (2011-12-22)
>>>>>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] LC_COLLATE=English_xxx.1252  LC_CTYPE=English_xxx.1252
>>>>>>> [3] LC_MONETARY=English_xxx.1252 LC_NUMERIC=C
>>>>>>> [5] LC_TIME=English_xxx.1252
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] RHTMLForms_0.5-1 XML_3.9-4.1      RCurl_1.91-1.1   bitops_1.0-4.1
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] tools_2.14.1
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list