[R] Problem with handling of attributes in xmlToList in XML package

Ben Tupper btupper at bigelow.org
Wed Apr 17 04:05:52 CEST 2013


Hi,

On Apr 16, 2013, at 6:39 PM, santiago gil wrote:
> 
> Thank you very much, Ben. Indeed that's how I've been doing it so far,
> but I have accrued too many reasons not to work with the XML object
> any more and move all my coding to a list formulation.
> 
> I wonder what you mean with
> 
>> [...] but I find what you try below works if you specify useInternalNodes = TRUE in your invocation of xmlTreeParse
> 
> Actually, the output error that I included happens when I use
> useInternalNodes=T (my bad).  

My bad right back at you.  It doesn't work here now (and didn't before I guess).  I can't explain why xmlToList splits the two nodes so differently.  That's another good reason for me to shy away from it.

> If I use useInternalNodes=F I get
> 
>> mylist[["ports"]][[2]][["service"]]$.attrs["name"]
> NULL
> 
> The useInternalNodes clause has proven fatally dangerous for me
> before. If I parse a tree with useInternalNodes=T, save the workspace,
> close R and reopen it, load the workspace and try to read the tree, it
> will completely crash my computer, which has already cost me too many
> lost days of work. On the other hand, useInternalNodes=F will result
> in any xml operation being ridiculously slow. So the intention was to
> move everything to a more R-friendly object like a list.

My experience with the XML package seems to be quite different from yours regarding useInternalNodes = TRUE/FALSE.  I get satisfactory and stable performance with useInternalNodes = TRUE, so your experience is very puzzling to me.  I never save workspaces - heck, I'm not sure what XML does with the external pointers in that case.  Can you save an address and expect to get the same address later?  Instead I save the xml formed data using saveXML which dumps to a nicely formed text file.. 

I guess I'm not much help!  You might want to contact the maintainer of XML with a small example, such as the one you posted.  He has been very responsive and help to me in the past.

Cheers,
Ben

> Best,
> 
> 
> Santiago
> 
> 2013/4/16 Ben Tupper <btupper at bigelow.org>:
>> Hi,
>> 
>> On Apr 16, 2013, at 2:49 PM, santiago gil wrote:
>>> 
>>> 2013/4/14 santiago gil <sg.ccnr at gmail.com>:
>>>> Hello all,
>>>> 
>>>> I have a problem with the way attributes are dealt with in the
>>>> function xmlToList(), and I haven't been able to figure it out for
>>>> days now.
>>>> 
>> 
>> I have not used xmlToList(), but I find what you try below works if you specify useInternalNodes = TRUE in your invocation of xmlTreeParse.  Often that is the solution for many issues with xml.  Also, I have found it best to write a relatively generic getter style function.  So, in the example below I have written a function called getPortAttr - it will get attributes for the child node you name.  I used your example as the defaults: "service" is the child to query and "name" is the attribute to retrieve from that child.  It's a heck of a lot easier to write a function than building the longish parse strings with lots of [[this]][[and]][[that]] stuff, and it is reusable to boot.
>> 
>> Cheers,
>> Ben
>> 
>> library(XML)
>> 
>> mydoc <- '<host starttime="1365204834" endtime="1365205860">
>> <status state="up" reason="echo-reply" reason_ttl="127"/>
>> <address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
>> <ports>
>>  <port protocol="tcp" portid="135">
>>   <state state="open" reason="syn-ack" reason_ttl="127"/>
>>   <service name="msrpc" product="Microsoft Windows RPC" ostype="Windows" method="probed" conf="10">
>>    <cpe>cpe:/o:microsoft:windows</cpe>
>>   </service>
>>  </port>
>>  <port protocol="tcp" portid="139">
>>   <state state="open" reason="syn-ack" reason_ttl="127"/>
>>   <service name="netbios-ssn" method="probed" conf="10"/>
>>  </port>
>> </ports>
>> <times srtt="647" rttvar="71" to="100000"/>
>> </host>'
>> 
>> mytree<-xmlTreeParse(mydoc, useInternalNodes = TRUE)
>> myroot<-xmlRoot(mytree)
>> 
>> myports <- myroot[["ports"]]["port"]
>> 
>> 
>> getPortAttr <- function(x, child = "service", attr = "name") {
>>   kid <- x[[child]]
>>   att <- xmlAttrs(kid)[[attr]]
>>   att
>> }
>> portNames <- sapply(myports, getPortAttr)
>> #> portNames
>> #         port          port
>> #      "msrpc" "netbios-ssn"
>> portReason <- sapply(myports, getPortAttr, child = "state", attr = "reason")
>> #> portReason
>> #     port      port
>> #"syn-ack" "syn-ack"
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>>> Say I have a document (produced by nmap) like this:
>>>> 
>>>>> mydoc <- '<host starttime="1365204834" endtime="1365205860"><status state="up" reason="echo-reply" reason_ttl="127"/>
>>>>   <address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
>>>>   <ports><port protocol="tcp" portid="135"><state state="open"
>>>> reason="syn-ack" reason_ttl="127"/><service name="msrpc"
>>>> product="Microsoft Windows RPC" ostype="Windows" method="probed"
>>>> conf="10"><cpe>cpe:/o:microsoft:windows</cpe></service></port>
>>>>   <port protocol="tcp" portid="139"><state state="open"
>>>> reason="syn-ack" reason_ttl="127"/><service name="netbios-ssn"
>>>> method="probed" conf="10"/></port>
>>>>   </ports>
>>>>   <times srtt="647" rttvar="71" to="100000"/>
>>>>   </host>'
>>>> 
>>>> I want to store this as a list of lists, so I do:
>>>> 
>>>> mytree<-xmlTreeParse(mydoc)
>>>> myroot<-xmlRoot(mytree)
>>>> mylist<-xmlToList(myroot)
>>>> 
>>>> Now my problem is that when I want to fetch the attributes of the
>>>> services running of each port, the behavior is not consistent:
>>>> 
>>>>> mylist[["ports"]][[1]][["service"]]$.attrs["name"]
>>>>  name
>>>> "msrpc"
>>>>> mylist[["ports"]][[2]][["service"]]$.attrs["name"]
>>>> Error in trash_list[["ports"]][[2]][["service"]]$.attrs :
>>>> $ operator is invalid for atomic vectors
>>>> 
>>>> I understand that the way they are dfined in the documnt is not the
>>>> same, but I think there still should be a consistent behavior. I've
>>>> tried many combination of parameters for xmlTreeParse() but nothing
>>>> has helped me. I can't find a way to call up the name of the service
>>>> consistently regardless of whether the node has children or not. Any
>>>> tips?
>>>> 
>>>> All the best,
>>>> 
>>>> 
>>>> S.G.
>>>> 
>>>> --
>>>> -------------------------------------------------------------------------------
>>>> http://barabasilab.neu.edu/people/gil/
>>> 
>>> 
>>> 
>>> --
>>> -------------------------------------------------------------------------------
>>> http://barabasilab.neu.edu/people/gil/
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> Ben Tupper
>> Bigelow Laboratory for Ocean Sciences
>> 60 Bigelow Drive, P.O. Box 380
>> East Boothbay, Maine 04544
>> http://www.bigelow.org
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> --
> -------------------------------------------------------------------------------
> http://barabasilab.neu.edu/people/gil/

Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org



More information about the R-help mailing list