[R] Problem with handling of attributes in xmlToList in XML package

santiago gil sg.ccnr at gmail.com
Wed Apr 17 00:39:19 CEST 2013


I apologize for the multiple posting then, it's just that I received
those emails saying that my post was awaiting approval and more than
four days went by without news. Sorry for the lack of patience.

Thank you very much, Ben. Indeed that's how I've been doing it so far,
but I have accrued too many reasons not to work with the XML object
any more and move all my coding to a list formulation.

I wonder what you mean with

> [...] but I find what you try below works if you specify useInternalNodes = TRUE in your invocation of xmlTreeParse

Actually, the output error that I included happens when I use
useInternalNodes=T (my bad).  If I use useInternalNodes=F I get

> mylist[["ports"]][[2]][["service"]]$.attrs["name"]
NULL

The useInternalNodes clause has proven fatally dangerous for me
before. If I parse a tree with useInternalNodes=T, save the workspace,
close R and reopen it, load the workspace and try to read the tree, it
will completely crash my computer, which has already cost me too many
lost days of work. On the other hand, useInternalNodes=F will result
in any xml operation being ridiculously slow. So the intention was to
move everything to a more R-friendly object like a list.

Any tips?

Best,


Santiago

2013/4/16 Ben Tupper <btupper at bigelow.org>:
> Hi,
>
> On Apr 16, 2013, at 2:49 PM, santiago gil wrote:
>>
>> 2013/4/14 santiago gil <sg.ccnr at gmail.com>:
>>> Hello all,
>>>
>>> I have a problem with the way attributes are dealt with in the
>>> function xmlToList(), and I haven't been able to figure it out for
>>> days now.
>>>
>
> I have not used xmlToList(), but I find what you try below works if you specify useInternalNodes = TRUE in your invocation of xmlTreeParse.  Often that is the solution for many issues with xml.  Also, I have found it best to write a relatively generic getter style function.  So, in the example below I have written a function called getPortAttr - it will get attributes for the child node you name.  I used your example as the defaults: "service" is the child to query and "name" is the attribute to retrieve from that child.  It's a heck of a lot easier to write a function than building the longish parse strings with lots of [[this]][[and]][[that]] stuff, and it is reusable to boot.
>
> Cheers,
> Ben
>
> library(XML)
>
> mydoc <- '<host starttime="1365204834" endtime="1365205860">
>  <status state="up" reason="echo-reply" reason_ttl="127"/>
>  <address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
>  <ports>
>   <port protocol="tcp" portid="135">
>    <state state="open" reason="syn-ack" reason_ttl="127"/>
>    <service name="msrpc" product="Microsoft Windows RPC" ostype="Windows" method="probed" conf="10">
>     <cpe>cpe:/o:microsoft:windows</cpe>
>    </service>
>   </port>
>   <port protocol="tcp" portid="139">
>    <state state="open" reason="syn-ack" reason_ttl="127"/>
>    <service name="netbios-ssn" method="probed" conf="10"/>
>   </port>
>  </ports>
>  <times srtt="647" rttvar="71" to="100000"/>
> </host>'
>
> mytree<-xmlTreeParse(mydoc, useInternalNodes = TRUE)
> myroot<-xmlRoot(mytree)
>
> myports <- myroot[["ports"]]["port"]
>
>
> getPortAttr <- function(x, child = "service", attr = "name") {
>    kid <- x[[child]]
>    att <- xmlAttrs(kid)[[attr]]
>    att
> }
> portNames <- sapply(myports, getPortAttr)
> #> portNames
> #         port          port
> #      "msrpc" "netbios-ssn"
> portReason <- sapply(myports, getPortAttr, child = "state", attr = "reason")
> #> portReason
> #     port      port
> #"syn-ack" "syn-ack"
>
>
>
>
>
>
>
>
>
>
>>> Say I have a document (produced by nmap) like this:
>>>
>>>> mydoc <- '<host starttime="1365204834" endtime="1365205860"><status state="up" reason="echo-reply" reason_ttl="127"/>
>>>    <address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
>>>    <ports><port protocol="tcp" portid="135"><state state="open"
>>> reason="syn-ack" reason_ttl="127"/><service name="msrpc"
>>> product="Microsoft Windows RPC" ostype="Windows" method="probed"
>>> conf="10"><cpe>cpe:/o:microsoft:windows</cpe></service></port>
>>>    <port protocol="tcp" portid="139"><state state="open"
>>> reason="syn-ack" reason_ttl="127"/><service name="netbios-ssn"
>>> method="probed" conf="10"/></port>
>>>    </ports>
>>>    <times srtt="647" rttvar="71" to="100000"/>
>>>    </host>'
>>>
>>> I want to store this as a list of lists, so I do:
>>>
>>> mytree<-xmlTreeParse(mydoc)
>>> myroot<-xmlRoot(mytree)
>>> mylist<-xmlToList(myroot)
>>>
>>> Now my problem is that when I want to fetch the attributes of the
>>> services running of each port, the behavior is not consistent:
>>>
>>>> mylist[["ports"]][[1]][["service"]]$.attrs["name"]
>>>   name
>>> "msrpc"
>>>> mylist[["ports"]][[2]][["service"]]$.attrs["name"]
>>> Error in trash_list[["ports"]][[2]][["service"]]$.attrs :
>>>  $ operator is invalid for atomic vectors
>>>
>>> I understand that the way they are dfined in the documnt is not the
>>> same, but I think there still should be a consistent behavior. I've
>>> tried many combination of parameters for xmlTreeParse() but nothing
>>> has helped me. I can't find a way to call up the name of the service
>>> consistently regardless of whether the node has children or not. Any
>>> tips?
>>>
>>> All the best,
>>>
>>>
>>> S.G.
>>>
>>> --
>>> -------------------------------------------------------------------------------
>>> http://barabasilab.neu.edu/people/gil/
>>
>>
>>
>> --
>> -------------------------------------------------------------------------------
>> http://barabasilab.neu.edu/people/gil/
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> Ben Tupper
> Bigelow Laboratory for Ocean Sciences
> 60 Bigelow Drive, P.O. Box 380
> East Boothbay, Maine 04544
> http://www.bigelow.org
>
>
>
>
>
>
>
>



--
-------------------------------------------------------------------------------
http://barabasilab.neu.edu/people/gil/



More information about the R-help mailing list