[R] Failure to understand namespaces in XML::getNodeSet

Mark Sharp msharp at txbiomed.org
Wed Feb 1 05:42:32 CET 2017


Hadley,

It’s sometimes amazing the mistakes I can make. No, it did not do what I wanted, which was
read_xml(str_c(with_ns_xml, collapse = “")

Reproducible example follows:
library(stringr)
library(xml2)
## Given the correct argument value for collapse, the next two lines work
no_ns <- read_xml(str_c(no_ns_xml, collapse = ""))
with_ns <- read_xml(str_c(with_ns_xml, collapse = ""))
## The next line finds the node in the XML without a namespace
xml_find_all(no_ns, "//WorkSet//Description")
## With a namespace designated in the XML
## Neither of the next two work, though I thought the second should
xml_find_all(with_ns, "//WorkSet//Description")
xml_find_all(with_ns, "/WorkSet//Description", ns = xml_ns(with_ns))
## Using xml_ns_strip() works as predicted
xml_find_all(xml_ns_strip(with_ns), "//WorkSet//Description")
## I was surprised to find the incorrect namespace value did not matter
xml_find_all(no_ns, "//WorkSet//Description", ns = xml_ns(with_ns))
## This also seems to ignore the namespace argument value
xml_find_all(xml_ns_strip(with_ns), "/WorkSet//Description", ns = xml_ns(with_ns))


Full output follows:
> ## Given the correct argument value for collapse, the next two lines work
> no_ns <- read_xml(str_c(no_ns_xml, collapse = ""))
> with_ns <- read_xml(str_c(with_ns_xml, collapse = ""))
> ## The next line finds the node in the XML without a namespace
> xml_find_all(no_ns, "//WorkSet//Description")
{xml_nodeset (1)}
[1] <Description>MFIA 9-Plex (CharlesRiver)</Description>
> ## With a namespace designated in the XML
> ## Neither of the next two work, though I thought the second should
> xml_find_all(with_ns, "//WorkSet//Description")
{xml_nodeset (0)}
> xml_find_all(with_ns, "/WorkSet//Description", ns = xml_ns(with_ns))
{xml_nodeset (0)}
> ## Using xml_ns_strip() works as predicted
> xml_find_all(xml_ns_strip(with_ns), "//WorkSet//Description")
{xml_nodeset (1)}
[1] <Description>MFIA 9-Plex (CharlesRiver)</Description>
> ## I was surprised to find the incorrect namespace value did not matter
> xml_find_all(no_ns, "//WorkSet//Description", ns = xml_ns(with_ns))
{xml_nodeset (1)}
[1] <Description>MFIA 9-Plex (CharlesRiver)</Description>
> ## This also seems to ignore the namespace argument value
> xml_find_all(xml_ns_strip(with_ns), "/WorkSet//Description", ns = xml_ns(with_ns))
{xml_nodeset (1)}
[1] <Description>MFIA 9-Plex (CharlesRiver)</Description>
R. Mark Sharp, Ph.D.
msharp at TxBiomed.org





> On Jan 31, 2017, at 5:52 PM, Hadley Wickham <h.wickham at gmail.com> wrote:
>
> I think you want
>
> x <- read_xml('<?xml version="1.0" ?>
>  <WorkSet xmlns="http://labkey.org/etl/xml">
>  <Description>MFIA 9-Plex (CharlesRiver)</Description>
> </WorkSet>')
>
> The collapse argument do what you think it does.
>
> Hadley
>
> On Tue, Jan 31, 2017 at 5:36 PM, Mark Sharp <msharp at txbiomed.org> wrote:
>> Hadley,
>>
>> Thank you. I am able to get the xml_ns_strip() function to work with my file directly so I will likely be able to reach my immediate goal.
>>
>> However, I still have had no success with understanding the namespace problem. I am not able to use read_xml() using the object I generated for the reproducible example, which is simply a character vector of length 4 having the contents of the XML file as produce by readLines(). I then used dput() to define the structure. The resulting structure apparently is not to the liking of read_xml(). I have reproduced the necessary code here for your convenience. There error is below.
>>
>> ##
>> library(xml2)
>> library(stringr)
>> with_ns_xml <- c("<?xml version=\"1.0\" ?>",
>>                 "<WorkSet xmlns=\"http://labkey.org/etl/xml\">",
>>                 "<Description>MFIA 9-Plex (CharlesRiver)</Description>",
>>                 "</WorkSet>")
>> ## without str_c() collapse it complain of a vector of length > 1 also.
>> read_xml(str_c(with_ns_xml, collapse = TRUE))
>> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  :
>>  Start tag expected, '<' not found [4]
>>
>> ## produces the following error message.
>> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  :
>>  Start tag expected, '<' not found [4]
>>
>> I have similar issues with xml2::xml_find_all
>> xml_find_all(str_c(with_ns_xml, collapse = TRUE), "/WorkSet//Description")
>>
>> ## Produces the following error message.
>> Error in UseMethod("xml_find_all") :
>>  no applicable method for 'xml_find_all' applied to an object of class "character"
>>
>>
>>
>> R. Mark Sharp, Ph.D.
>> msharp at TxBiomed.org
>>
>>
>>
>>
>>
>>> On Jan 31, 2017, at 4:27 PM, Hadley Wickham <h.wickham at gmail.com> wrote:
>>>
>>> See the last example in ?xml2::xml_find_all or use xml2::xml2::xml_ns_strip()
>>>
>>> Hadley
>>>
>>> On Tue, Jan 31, 2017 at 9:43 AM, Mark Sharp <msharp at txbiomed.org> wrote:
>>>> I am trying to read a series of XML files that use a namespace and I have failed, thus far, to discover the proper syntax. I have a reproducible example below. I have two XML character strings defined: one without a namespace and one with. I show that I can successfully extract the node using the XML string without the namespace and fail when using the XML string with the namespace.
>>>>
>>>> Mark
>>>> PS I am having the same problem with the xml2 package and am hoping understanding one with help with the other.
>>>>
>>>> ##
>>>> library(XML)
>>>> ## The first XML text (no_ns_xml) does not have a namespace defined
>>>> no_ns_xml <- c("<?xml version=\"1.0\" ?>", "<WorkSet>",
>>>>              "<Description>MFIA 9-Plex (CharlesRiver)</Description>",
>>>>              "</WorkSet>")
>>>> l_no_ns_xml <-xmlTreeParse(no_ns_xml, asText = TRUE, getDTD = FALSE,
>>>>                          useInternalNodes = TRUE)
>>>> ## The node is found
>>>> getNodeSet(l_no_ns_xml, "/WorkSet//Description")
>>>>
>>>> ## The second XML text (with_ns_xml) has a namespace defined
>>>> with_ns_xml <- c("<?xml version=\"1.0\" ?>",
>>>>                "<WorkSet xmlns=\"http://labkey.org/etl/xml\">",
>>>>                "<Description>MFIA 9-Plex (CharlesRiver)</Description>",
>>>>                "</WorkSet>")
>>>>
>>>> l_with_ns_xml <-xmlTreeParse(with_ns_xml, asText = TRUE, getDTD = FALSE,
>>>>                              useInternalNodes = TRUE)
>>>> ## The node is not found
>>>> getNodeSet(l_with_ns_xml, "/WorkSet//Description")
>>>> ## I attempt to provide the namespace, but fail.
>>>> ns <-  "http://labkey.org/etl/xml"
>>>> names(ns)[1] <- "xmlns"
>>>> getNodeSet(l_with_ns_xml, "/WorkSet//Description", namespaces = ns)
>>>>
>>>> R. Mark Sharp, Ph.D.
>>>> Director of Data Science Core
>>>> Southwest National Primate Research Center
>>>> Texas Biomedical Research Institute
>>>> P.O. Box 760549
>>>> San Antonio, TX 78245-0549
>>>> Telephone: (210)258-9476
>>>> e-mail: msharp at TxBiomed.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE: This e-mail and any files and/or...{{dropped:10}}
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>> --
>>> http://hadley.nz
>>
>> CONFIDENTIALITY NOTICE: This e-mail and any files and/or attachments transmitted, may contain privileged and confidential information and is intended solely for the exclusive use of the individual or entity to whom it is addressed. If you are not the intended recipient, you are hereby notified that any review, dissemination, distribution or copying of this e-mail and/or attachments is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender stating that this transmission was misdirected; return the e-mail to sender; destroy all paper copies and delete all electronic copies from your system without disclosing its contents.
>
>
>
> --
> http://hadley.nz

CONFIDENTIALITY NOTICE: This e-mail and any files and/or attachments transmitted, may contain privileged and confidential information and is intended solely for the exclusive use of the individual or entity to whom it is addressed. If you are not the intended recipient, you are hereby notified that any review, dissemination, distribution or copying of this e-mail and/or attachments is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender stating that this transmission was misdirected; return the e-mail to sender; destroy all paper copies and delete all electronic copies from your system without disclosing its contents.


More information about the R-help mailing list