[R] Problem with handling of attributes in xmlToList in XML package

Ben Tupper btupper at bigelow.org
Tue Apr 16 22:19:02 CEST 2013


On Apr 16, 2013, at 2:49 PM, santiago gil wrote:
> 2013/4/14 santiago gil <sg.ccnr at gmail.com>:
>> Hello all,
>> I have a problem with the way attributes are dealt with in the
>> function xmlToList(), and I haven't been able to figure it out for
>> days now.

I have not used xmlToList(), but I find what you try below works if you specify useInternalNodes = TRUE in your invocation of xmlTreeParse.  Often that is the solution for many issues with xml.  Also, I have found it best to write a relatively generic getter style function.  So, in the example below I have written a function called getPortAttr - it will get attributes for the child node you name.  I used your example as the defaults: "service" is the child to query and "name" is the attribute to retrieve from that child.  It's a heck of a lot easier to write a function than building the longish parse strings with lots of [[this]][[and]][[that]] stuff, and it is reusable to boot.



mydoc <- '<host starttime="1365204834" endtime="1365205860">
 <status state="up" reason="echo-reply" reason_ttl="127"/>
 <address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
  <port protocol="tcp" portid="135">
   <state state="open" reason="syn-ack" reason_ttl="127"/>
   <service name="msrpc" product="Microsoft Windows RPC" ostype="Windows" method="probed" conf="10">
  <port protocol="tcp" portid="139">
   <state state="open" reason="syn-ack" reason_ttl="127"/>
   <service name="netbios-ssn" method="probed" conf="10"/>
 <times srtt="647" rttvar="71" to="100000"/>
mytree<-xmlTreeParse(mydoc, useInternalNodes = TRUE)

myports <- myroot[["ports"]]["port"]

getPortAttr <- function(x, child = "service", attr = "name") {
   kid <- x[[child]]
   att <- xmlAttrs(kid)[[attr]]
portNames <- sapply(myports, getPortAttr)
#> portNames
#         port          port 
#      "msrpc" "netbios-ssn" 
portReason <- sapply(myports, getPortAttr, child = "state", attr = "reason")
#> portReason
#     port      port 
#"syn-ack" "syn-ack" 

>> Say I have a document (produced by nmap) like this:
>>> mydoc <- '<host starttime="1365204834" endtime="1365205860"><status state="up" reason="echo-reply" reason_ttl="127"/>
>>    <address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
>>    <ports><port protocol="tcp" portid="135"><state state="open"
>> reason="syn-ack" reason_ttl="127"/><service name="msrpc"
>> product="Microsoft Windows RPC" ostype="Windows" method="probed"
>> conf="10"><cpe>cpe:/o:microsoft:windows</cpe></service></port>
>>    <port protocol="tcp" portid="139"><state state="open"
>> reason="syn-ack" reason_ttl="127"/><service name="netbios-ssn"
>> method="probed" conf="10"/></port>
>>    </ports>
>>    <times srtt="647" rttvar="71" to="100000"/>
>>    </host>'
>> I want to store this as a list of lists, so I do:
>> mytree<-xmlTreeParse(mydoc)
>> myroot<-xmlRoot(mytree)
>> mylist<-xmlToList(myroot)
>> Now my problem is that when I want to fetch the attributes of the
>> services running of each port, the behavior is not consistent:
>>> mylist[["ports"]][[1]][["service"]]$.attrs["name"]
>>   name
>> "msrpc"
>>> mylist[["ports"]][[2]][["service"]]$.attrs["name"]
>> Error in trash_list[["ports"]][[2]][["service"]]$.attrs :
>>  $ operator is invalid for atomic vectors
>> I understand that the way they are dfined in the documnt is not the
>> same, but I think there still should be a consistent behavior. I've
>> tried many combination of parameters for xmlTreeParse() but nothing
>> has helped me. I can't find a way to call up the name of the service
>> consistently regardless of whether the node has children or not. Any
>> tips?
>> All the best,
>> S.G.
>> --
>> -------------------------------------------------------------------------------
>> http://barabasilab.neu.edu/people/gil/
> -- 
> -------------------------------------------------------------------------------
> http://barabasilab.neu.edu/people/gil/
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544

More information about the R-help mailing list