[R] Parsing XML?

@vi@e@gross m@iii@g oii gm@ii@com @vi@e@gross m@iii@g oii gm@ii@com
Thu Jul 28 19:27:38 CEST 2022


Spencer,

You have lots of learning to do it you want to be able to properly play around inside XML. The list you see of main nodesis an example of many things you can do BUT is very incomplete. This is an R forum, not an XML forum, so the focus you need is what R packages not only let you read in XML properly but let you navigate it make queries using an underlying xpath mechanism or anything else.  You must find which package meets some needs AND also be able to in some logical way specify what you want it to find.

But if the XML data is not set up for you to find, you are wasting your time and ours.

I looked at your raw XML file and searched for Dates by looking for things like "Nov" and my suspicion is they are NOT set up with fields specifying things like First Date, Last Date or anything like that. There may be other dates formatted in ways that may not be obvious, like number of seconds since 1970. Do you have access to a description of what the file should look like?

The dates I found seem to often be parts of comments.

If you were using xpath, you might look for something like "//recordData/record/datafield/subfield" carefully as I found dates such as:

    <datafield ind1=" " ind2=" " tag="588">
      <subfield code="a">Description based on: Nov. 4, 2010 (surrogate); title from caption.</subfield>
    </datafield>


    <datafield ind1="0" ind2=" " tag="362">
      <subfield code="a">Vol. 1, no. 1 (Nov. 16, 1994)-</subfield>
    </datafield>

    <datafield ind1="1" ind2=" " tag="362">
      <subfield code="a">Began in May 1992; ceased with Jan. 10, 2013.</subfield>
    </datafield>

Note this entry tells when it ceased. The next when something began;

<datafield ind1="1" ind2=" " tag="362">
      <subfield code="a">Began in Jan. 1990.</subfield>
    </datafield>

BUT looking for patterns, they are NOT all found with code="a" as in this case:

<datafield ind1=" " ind2=" " tag="321">
      <subfield code="a">Semiweekly,</subfield>
      <subfield code="b"><Apr. 4, 1990-></subfield>
    </datafield>

My guess is this XML pretty much has the same info from your perspective as the JSON version, albeit a challenge to search. It likely does NOT have the info you need for your speculation about when a newpaper or magazine STOPPED publishing except maybe in a note here and there that needs a human.

You can write R software that after getting the XML will perform searches and let you print something carefully sich as searching for subfield containing the attribute code="t" is clearly a title of a periodical.

Again, well-designed XML might have had the data you want. If this was info about people and contained fields identifiable as a date of birth and a date of death, you could play games. But it does not seem to be and is more like notes about what issues something is in than when a periodical stopped publishing.

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Spencer Graves
Sent: Thursday, July 28, 2022 6:53 AM
To: Richard O'Keefe <raoknz using gmail.com>
Cc: R-help <r-help using r-project.org>
Subject: Re: [R] Parsing XML?

Hi, Richard et al.:


On 7/28/22 1:50 AM, Richard O'Keefe wrote:
> What do you mean by "a list that I can understand"?
> A quick tally of the number of XML elements by identifier:
> 1 echoedSearchRetrieveRequest
> 1 frbrGrouping
> 1 maximumRecords
> 1 nextRecordPosition
> 1 numberOfRecords
> 1 query
> 1 records
> 1 resultSetIdleTime
> 1 searchRetrieveResponse
> 1 servicelevel
> 1 sortKeys
> 1 startRecord
> 1 wskey
> 2 version
> 50 leader
> 50 recordData
> 51 recordPacking
> 51 recordSchema
> 100 record
> 105 controlfield
> 923 datafield
> 1900 subfield


	  How did you get that?


	  Please forgive me for being so dense.  I've done several web searches 
and tried to work several tutorials, etc., without so far seeing what I 
might do that could be informative.


	  Even this list of "XML elements by identifiers" STILL does not 
include things like the name of the newspaper and publisher plus start 
and end dates.  I believe these fields are there, but I can't see how to 
parse them.  I earlier parsed a JSON version of essentially the same 
dataset.  However, the JSON version seemed not to distinguish between 
newspapers that were still publishing and those for which the end date 
was unknown.  My contact at the Library of Congress then suggested I 
parse the XML version.


	  Thanks,
	  Spencer

> 
> What of this information do you actually want?
> The elements of the list should be what?
> 
> 
> On Thu, 28 Jul 2022 at 08:52, Spencer Graves 
> <spencer.graves using effectivedefense.org 
> <mailto:spencer.graves using effectivedefense.org>> wrote:
> 
>     Hello, All:
> 
> 
>                What would you suggest I do to parse the following XML
>     file into a
>     list that I can understand:
> 
> 
>     XMLfile <-
>     "https://chroniclingamerica.loc.gov/data/bib/worldcat_titles/bulk5/ndnp_Alabama_all-yrs_e_0001_0050.xml
>     <https://chroniclingamerica.loc.gov/data/bib/worldcat_titles/bulk5/ndnp_Alabama_all-yrs_e_0001_0050.xml>"
> 
> 
> 
> 
>                This is the first of 6666 XML files containing "U.S.
>     Newspaper
>     Directory" maintained by the US Library of Congress discussed in the
>     thread below.  I've tried various things using the XML and xml2.
> 
> 
>     XMLdata <- xml2::read_xml(XMLfile)
>     str(XMLdata)
>     XMLdat <- XML::xmlParse(XMLdata)
>     str(XMLdat)
>     XMLtxt <- xml2::xml_text(XMLdata)
>     nchar(XMLtxt)
>     #[1] 29415
> 
> 
>                Someplace there's a schema for this.  I don't know if
>     it's embedded
>     in this XML file or in a separate file.  If it's in a separate file,
>     how
>     could I describe it to my contacts with the Library of Congress so they
>     would understand what I needed and could help me get it.
> 
> 
>                Thanks,
>                Spencer Graves
> 
> 
>     p.s.  All 29415 characters in XMLtext appear in the thread below.
> 
> 
>     -------- Forwarded Message --------
>     Subject:        [Newspapers and Current Periodicals] How can I get
>     counts of
>     the numbers of newspapers by year in the US, and preferably also
>     elsewhere? A search of "U.S. Newspaper Directory,
>     Date:   Wed, 27 Jul 2022 14:59:03 +0000
>     From:   Kerry Huller <serials using ask.loc.gov <mailto:serials using ask.loc.gov>>
>     To:     Spencer Graves <spencer.graves using effectivedefense.org
>     <mailto:spencer.graves using effectivedefense.org>>
>     CC: twes using loc.gov <mailto:twes using loc.gov>
> 
> 
> 
>     --# Type your reply above this line #--
> 
>     ------------------------------------------------------------------------
> 
>     Newspapers and Current Periodicals Reference Librarian
> 
>     Jul 27 2022, 10:59am via System
> 
>     Hello Spencer,
> 
>     So, when I view the xml, I'm actually looking at it in XML editor
>     software, so I can view the tags and it's structured neatly. I've
>     copied
>     and pasted the text from the beginning of the file and the first
>     newspaper title below from my XML editor:
> 
>     <?xml version="1.0" encoding="UTF-8" standalone="no"?>
>     <?xml-stylesheet type='text/xsl'
>     href='/webservices/catalog/xsl/searchRetrieveResponse.xsl'?>
> 
>     <searchRetrieveResponse xmlns="http://www.loc.gov/zing/srw/
>     <http://www.loc.gov/zing/srw/>"
>     xmlns:oclcterms="http://purl.org/oclc/terms/
>     <http://purl.org/oclc/terms/>"
>     xmlns:dc="http://purl.org/dc/elements/1.1/
>     <http://purl.org/dc/elements/1.1/>"
>     xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/
>     <http://www.loc.gov/zing/srw/diagnostic/>"
>     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance
>     <http://www.w3.org/2001/XMLSchema-instance>">
>     <version>1.1</version>
>     <numberOfRecords>2250</numberOfRecords>
>     <records>
>     <record>
>     <recordSchema>info:srw/schema/1/marcxml</recordSchema>
>     <recordPacking>xml</recordPacking>
>     <recordData>
>     <record xmlns="http://www.loc.gov/MARC21/slim
>     <http://www.loc.gov/MARC21/slim>">
>            <leader>00000nas a22000007i 4500</leader>
>            <controlfield tag="001">1030438981</controlfield>
>            <controlfield tag="008">180404c20159999aluwr n       0   a0eng
>        </controlfield>
>            <datafield ind1=" " ind2=" " tag="010">
>              <subfield code="a">  2018200464</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="040">
>              <subfield code="a">DLC</subfield>
>              <subfield code="e">rda</subfield>
>              <subfield code="c">DLC</subfield>
>              <subfield code="b">eng</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="012">
>              <subfield code="m">1</subfield>
>            </datafield>
>            <datafield ind1="0" ind2=" " tag="022">
>              <subfield code="a">2577-5316</subfield>
>              <subfield code="2">1</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="032">
>              <subfield code="a">021110</subfield>
>              <subfield code="b">USPS</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="037">
>              <subfield code="b">711 Alabama Avenue, Selma, AL
>     36701</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="042">
>              <subfield code="a">nsdp</subfield>
>              <subfield code="a">pcc</subfield>
>            </datafield>
>            <datafield ind1="1" ind2="0" tag="050">
>              <subfield code="a">ISSN RECORD</subfield>
>            </datafield>
>            <datafield ind1="1" ind2="0" tag="082">
>              <subfield code="a">071</subfield>
>              <subfield code="2">15</subfield>
>            </datafield>
>            <datafield ind1=" " ind2="0" tag="222">
>              <subfield code="a">Selma sun</subfield>
>            </datafield>
>            <datafield ind1="0" ind2="0" tag="245">
>              <subfield code="a">Selma sun.</subfield>
>            </datafield>
>            <datafield ind1=" " ind2="1" tag="264">
>              <subfield code="a">Selma, AL :</subfield>
>              <subfield code="b">North Shore Press, LLC</subfield>
>              <subfield code="c">2016-</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="310">
>              <subfield code="a">Weekly</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="336">
>              <subfield code="a">text</subfield>
>              <subfield code="b">txt</subfield>
>              <subfield code="2">rdacontent</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="337">
>              <subfield code="a">unmediated</subfield>
>              <subfield code="b">n</subfield>
>              <subfield code="2">rdamedia</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="338">
>              <subfield code="a">volume</subfield>
>              <subfield code="b">nc</subfield>
>              <subfield code="2">rdacarrier</subfield>
>            </datafield>
>            <datafield ind1="1" ind2=" " tag="362">
>              <subfield code="a">Began in 2015.</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="588">
>              <subfield code="a">Description based on: Volume 2, Issue 40
>     (October 5, 2017) (surrogate); title from caption.</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="588">
>              <subfield code="a">Latest issue consulted: Volume 2, Issue 40
>     (October 5, 2017).</subfield>
>            </datafield>
>            <datafield ind1=" " ind2=" " tag="752">
>              <subfield code="a">United States</subfield>
>              <subfield code="b">Alabama</subfield>
>              <subfield code="c">Dallas</subfield>
>              <subfield code="d">Selma.</subfield>
>            </datafield>
>          </record>
>     </recordData>
>     </record>
> 
>     When I view the records in the XML editor, these 2 lines below do begin
>     each of the records for each individual title, but of course this is
>     including the xml tags:
> 
>     <recordSchema>info:srw/schema/1/marcxml</recordSchema>
>     <recordPacking>xml</recordPacking>
> 
>     Hopefully this helps you decide where to break or parse each record.
> 
>     On another note, I just noticed as well that at the top of this first
>     file it lists the total number of records for the Alabama grouping -
>     2250. This also appeared to be the case for the Alaska records when I
>     took a look at the first one for that state. I imagine that should be
>     consistent throughout each "grouping" of records.
> 
>     Let me know if you have follow-up questions!
> 
>     Best wishes,
> 
>     Kerry Huller
>     Newspaper & Current Periodical Reading Room
>     Serial & Government Publications Division
>     Library of Congress
> 
>     ------------------------------------------------------------------------
> 
>     Newspapers and Current Periodicals Reference Librarian
> 
>     Jul 27 2022, 10:21am via Email
> 
>     Hi, Kerry:
> 
> 
>     Thanks. I understand the chunking in files of at most 50. I've read
>     the first file "ndnp_Alabama_all-yrs_e_0001_0050.xml" into a string of
>     29415 characters, copied below. Might you have any suggestions on the
>     next step in parsing this? Staring at it now, it looks splitting on
>     "info:srw/schema/1/marcxmlxml" might convert the 29415 characters into
>     shorter chunks, each of which could then be parsed further.
> 
> 
>     This is not as bad as reading ancient Egyptian heiroglyphics without
>     the Rosetta Stone, but I wondered if you might have something that could
>     make this work easier and more reliable? I guess I could compare with
>     what I already read as JSON ;-)
> 
> 
>     Thanks,
>     Spencer Graves
> 



More information about the R-help mailing list