[R] Failure to understand namespaces in XML::getNodeSet

Wed Feb 1 00:52:16 CET 2017

I think you want

x <- read_xml('<?xml version="1.0" ?>
  <WorkSet xmlns="http://labkey.org/etl/xml">
  <Description>MFIA 9-Plex (CharlesRiver)</Description>
</WorkSet>')

The collapse argument do what you think it does.

Hadley

On Tue, Jan 31, 2017 at 5:36 PM, Mark Sharp <msharp at txbiomed.org> wrote:
> Hadley,
>
> Thank you. I am able to get the xml_ns_strip() function to work with my file directly so I will likely be able to reach my immediate goal.
>
> However, I still have had no success with understanding the namespace problem. I am not able to use read_xml() using the object I generated for the reproducible example, which is simply a character vector of length 4 having the contents of the XML file as produce by readLines(). I then used dput() to define the structure. The resulting structure apparently is not to the liking of read_xml(). I have reproduced the necessary code here for your convenience. There error is below.
>
> ##
> library(xml2)
> library(stringr)
> with_ns_xml <- c("<?xml version=\"1.0\" ?>",
>                  "<WorkSet xmlns=\"http://labkey.org/etl/xml\">",
>                  "<Description>MFIA 9-Plex (CharlesRiver)</Description>",
>                  "</WorkSet>")
> ## without str_c() collapse it complain of a vector of length > 1 also.
> read_xml(str_c(with_ns_xml, collapse = TRUE))
> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  :
>   Start tag expected, '<' not found [4]
>
> ## produces the following error message.
> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  :
>   Start tag expected, '<' not found [4]
>
> I have similar issues with xml2::xml_find_all
> xml_find_all(str_c(with_ns_xml, collapse = TRUE), "/WorkSet//Description")
>
> ## Produces the following error message.
> Error in UseMethod("xml_find_all") :
>   no applicable method for 'xml_find_all' applied to an object of class "character"
>
>
>
> R. Mark Sharp, Ph.D.
> msharp at TxBiomed.org
>
>
>
>
>
>> On Jan 31, 2017, at 4:27 PM, Hadley Wickham <h.wickham at gmail.com> wrote:
>>
>> See the last example in ?xml2::xml_find_all or use xml2::xml2::xml_ns_strip()
>>
>> Hadley
>>
>> On Tue, Jan 31, 2017 at 9:43 AM, Mark Sharp <msharp at txbiomed.org> wrote:
>>> I am trying to read a series of XML files that use a namespace and I have failed, thus far, to discover the proper syntax. I have a reproducible example below. I have two XML character strings defined: one without a namespace and one with. I show that I can successfully extract the node using the XML string without the namespace and fail when using the XML string with the namespace.
>>>
>>> Mark
>>> PS I am having the same problem with the xml2 package and am hoping understanding one with help with the other.
>>>
>>> ##
>>> library(XML)
>>> ## The first XML text (no_ns_xml) does not have a namespace defined
>>> no_ns_xml <- c("<?xml version=\"1.0\" ?>", "<WorkSet>",
>>>               "<Description>MFIA 9-Plex (CharlesRiver)</Description>",
>>>               "</WorkSet>")
>>> l_no_ns_xml <-xmlTreeParse(no_ns_xml, asText = TRUE, getDTD = FALSE,
>>>                           useInternalNodes = TRUE)
>>> ## The node is found
>>> getNodeSet(l_no_ns_xml, "/WorkSet//Description")
>>>
>>> ## The second XML text (with_ns_xml) has a namespace defined
>>> with_ns_xml <- c("<?xml version=\"1.0\" ?>",
>>>                 "<WorkSet xmlns=\"http://labkey.org/etl/xml\">",
>>>                 "<Description>MFIA 9-Plex (CharlesRiver)</Description>",
>>>                 "</WorkSet>")
>>>
>>> l_with_ns_xml <-xmlTreeParse(with_ns_xml, asText = TRUE, getDTD = FALSE,
>>>                               useInternalNodes = TRUE)
>>> ## The node is not found
>>> getNodeSet(l_with_ns_xml, "/WorkSet//Description")
>>> ## I attempt to provide the namespace, but fail.
>>> ns <-  "http://labkey.org/etl/xml"
>>> names(ns)[1] <- "xmlns"
>>> getNodeSet(l_with_ns_xml, "/WorkSet//Description", namespaces = ns)
>>>
>>> R. Mark Sharp, Ph.D.
>>> Director of Data Science Core
>>> Southwest National Primate Research Center
>>> Texas Biomedical Research Institute
>>> P.O. Box 760549
>>> San Antonio, TX 78245-0549
>>> Telephone: (210)258-9476
>>> e-mail: msharp at TxBiomed.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE: This e-mail and any files and/or...{{dropped:10}}
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>> http://hadley.nz
>
> CONFIDENTIALITY NOTICE: This e-mail and any files and/or attachments transmitted, may contain privileged and confidential information and is intended solely for the exclusive use of the individual or entity to whom it is addressed. If you are not the intended recipient, you are hereby notified that any review, dissemination, distribution or copying of this e-mail and/or attachments is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender stating that this transmission was misdirected; return the e-mail to sender; destroy all paper copies and delete all electronic copies from your system without disclosing its contents.

-- 
http://hadley.nz