[R] Problem with handling of attributes in xmlToList in XML package
Ben Tupper
btupper at bigelow.org
Wed Apr 17 04:05:52 CEST 2013
Hi,
On Apr 16, 2013, at 6:39 PM, santiago gil wrote:
>
> Thank you very much, Ben. Indeed that's how I've been doing it so far,
> but I have accrued too many reasons not to work with the XML object
> any more and move all my coding to a list formulation.
>
> I wonder what you mean with
>
>> [...] but I find what you try below works if you specify useInternalNodes = TRUE in your invocation of xmlTreeParse
>
> Actually, the output error that I included happens when I use
> useInternalNodes=T (my bad).
My bad right back at you. It doesn't work here now (and didn't before I guess). I can't explain why xmlToList splits the two nodes so differently. That's another good reason for me to shy away from it.
> If I use useInternalNodes=F I get
>
>> mylist[["ports"]][[2]][["service"]]$.attrs["name"]
> NULL
>
> The useInternalNodes clause has proven fatally dangerous for me
> before. If I parse a tree with useInternalNodes=T, save the workspace,
> close R and reopen it, load the workspace and try to read the tree, it
> will completely crash my computer, which has already cost me too many
> lost days of work. On the other hand, useInternalNodes=F will result
> in any xml operation being ridiculously slow. So the intention was to
> move everything to a more R-friendly object like a list.
My experience with the XML package seems to be quite different from yours regarding useInternalNodes = TRUE/FALSE. I get satisfactory and stable performance with useInternalNodes = TRUE, so your experience is very puzzling to me. I never save workspaces - heck, I'm not sure what XML does with the external pointers in that case. Can you save an address and expect to get the same address later? Instead I save the xml formed data using saveXML which dumps to a nicely formed text file..
I guess I'm not much help! You might want to contact the maintainer of XML with a small example, such as the one you posted. He has been very responsive and help to me in the past.
Cheers,
Ben
> Best,
>
>
> Santiago
>
> 2013/4/16 Ben Tupper <btupper at bigelow.org>:
>> Hi,
>>
>> On Apr 16, 2013, at 2:49 PM, santiago gil wrote:
>>>
>>> 2013/4/14 santiago gil <sg.ccnr at gmail.com>:
>>>> Hello all,
>>>>
>>>> I have a problem with the way attributes are dealt with in the
>>>> function xmlToList(), and I haven't been able to figure it out for
>>>> days now.
>>>>
>>
>> I have not used xmlToList(), but I find what you try below works if you specify useInternalNodes = TRUE in your invocation of xmlTreeParse. Often that is the solution for many issues with xml. Also, I have found it best to write a relatively generic getter style function. So, in the example below I have written a function called getPortAttr - it will get attributes for the child node you name. I used your example as the defaults: "service" is the child to query and "name" is the attribute to retrieve from that child. It's a heck of a lot easier to write a function than building the longish parse strings with lots of [[this]][[and]][[that]] stuff, and it is reusable to boot.
>>
>> Cheers,
>> Ben
>>
>> library(XML)
>>
>> mydoc <- '<host starttime="1365204834" endtime="1365205860">
>> <status state="up" reason="echo-reply" reason_ttl="127"/>
>> <address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
>> <ports>
>> <port protocol="tcp" portid="135">
>> <state state="open" reason="syn-ack" reason_ttl="127"/>
>> <service name="msrpc" product="Microsoft Windows RPC" ostype="Windows" method="probed" conf="10">
>> <cpe>cpe:/o:microsoft:windows</cpe>
>> </service>
>> </port>
>> <port protocol="tcp" portid="139">
>> <state state="open" reason="syn-ack" reason_ttl="127"/>
>> <service name="netbios-ssn" method="probed" conf="10"/>
>> </port>
>> </ports>
>> <times srtt="647" rttvar="71" to="100000"/>
>> </host>'
>>
>> mytree<-xmlTreeParse(mydoc, useInternalNodes = TRUE)
>> myroot<-xmlRoot(mytree)
>>
>> myports <- myroot[["ports"]]["port"]
>>
>>
>> getPortAttr <- function(x, child = "service", attr = "name") {
>> kid <- x[[child]]
>> att <- xmlAttrs(kid)[[attr]]
>> att
>> }
>> portNames <- sapply(myports, getPortAttr)
>> #> portNames
>> # port port
>> # "msrpc" "netbios-ssn"
>> portReason <- sapply(myports, getPortAttr, child = "state", attr = "reason")
>> #> portReason
>> # port port
>> #"syn-ack" "syn-ack"
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>>> Say I have a document (produced by nmap) like this:
>>>>
>>>>> mydoc <- '<host starttime="1365204834" endtime="1365205860"><status state="up" reason="echo-reply" reason_ttl="127"/>
>>>> <address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
>>>> <ports><port protocol="tcp" portid="135"><state state="open"
>>>> reason="syn-ack" reason_ttl="127"/><service name="msrpc"
>>>> product="Microsoft Windows RPC" ostype="Windows" method="probed"
>>>> conf="10"><cpe>cpe:/o:microsoft:windows</cpe></service></port>
>>>> <port protocol="tcp" portid="139"><state state="open"
>>>> reason="syn-ack" reason_ttl="127"/><service name="netbios-ssn"
>>>> method="probed" conf="10"/></port>
>>>> </ports>
>>>> <times srtt="647" rttvar="71" to="100000"/>
>>>> </host>'
>>>>
>>>> I want to store this as a list of lists, so I do:
>>>>
>>>> mytree<-xmlTreeParse(mydoc)
>>>> myroot<-xmlRoot(mytree)
>>>> mylist<-xmlToList(myroot)
>>>>
>>>> Now my problem is that when I want to fetch the attributes of the
>>>> services running of each port, the behavior is not consistent:
>>>>
>>>>> mylist[["ports"]][[1]][["service"]]$.attrs["name"]
>>>> name
>>>> "msrpc"
>>>>> mylist[["ports"]][[2]][["service"]]$.attrs["name"]
>>>> Error in trash_list[["ports"]][[2]][["service"]]$.attrs :
>>>> $ operator is invalid for atomic vectors
>>>>
>>>> I understand that the way they are dfined in the documnt is not the
>>>> same, but I think there still should be a consistent behavior. I've
>>>> tried many combination of parameters for xmlTreeParse() but nothing
>>>> has helped me. I can't find a way to call up the name of the service
>>>> consistently regardless of whether the node has children or not. Any
>>>> tips?
>>>>
>>>> All the best,
>>>>
>>>>
>>>> S.G.
>>>>
>>>> --
>>>> -------------------------------------------------------------------------------
>>>> http://barabasilab.neu.edu/people/gil/
>>>
>>>
>>>
>>> --
>>> -------------------------------------------------------------------------------
>>> http://barabasilab.neu.edu/people/gil/
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> Ben Tupper
>> Bigelow Laboratory for Ocean Sciences
>> 60 Bigelow Drive, P.O. Box 380
>> East Boothbay, Maine 04544
>> http://www.bigelow.org
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
> --
> -------------------------------------------------------------------------------
> http://barabasilab.neu.edu/people/gil/
Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org
More information about the R-help
mailing list