[R] XML to CSV
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Wed Jan 4 23:08:59 CET 2017
Andrew... you really need to understand the outline/tree nature of your XML schema to understand why blanks might appear in your data when you try to squeeze it into a rectangular layout like CSV. Opening the file in a modern Web browser like Firefox can help you see the forest among the trees, since they can collapse/expand the subtrees. Keep in mind that understanding how XML works is really not the purpose of this list, but there are lots of books and tutorials about it such as the one mentioned by Ben. If your schema has many irregular subtrees it may be a poor match for fitting into one CSV and you might need to resort to putting it into a relational group of CSV files... but relational schema design is another off-topic area of study for this list. Once you know what you want to accomplish a little better (enough to make example input and output data sets) we can help you more with the R coding aspect of your problem... but guessing at your needs with no access to data is really not effective use of anyone's time.
--
Sent from my phone. Please excuse my brevity.
On January 4, 2017 12:45:08 PM PST, Ben Tupper <btupper at bigelow.org> wrote:
>Hi,
>
>You should keep replies on the list - you never know when someone will
>swoop in with the right answer to make your life easier.
>
>Below is a simple example that uses xpath syntax to identify (and in
>this case retrieve) children that match your xpath expression. xpath
>epxressions are sort of like /a/directory/structure/description so you
>can visualize elements of XML like nested folders or subdirectories.
>
>Hopefully this will get you started. A lot more on xpath here
>http://www.w3schools.com/xml/xml_xpath.asp There are other extraction
>tools in xml2 - just type ?xml2 at the command prompt to see more.
>
>Since you have more deeply nested elements you'll need to play with
>this a bit first.
>
>library(xml2)
>uri = 'http://www.w3schools.com/xml/simple.xml'
>x = read_xml(uri)
>
>name_nodes = xml_find_all(x, "//name")
>name = xml_text(name_nodes)
>
>price_nodes = xml_find_all(x, "//price")
>price = xml_text(price_nodes)
>
>calories_nodes = xml_find_all(x, "//calories")
>calories = xml_double(calories_nodes)
>
>X = data.frame(name, price, calories, stringsAsFactors = FALSE)
>write.csv(X, file = 'foo.csv')
>
>Cheers,
>Ben
>
>> On Jan 4, 2017, at 2:13 PM, Andrew Lachance <alachanc at bates.edu>
>wrote:
>>
>> Hello Ben,
>>
>> Thank you for the advice. I am extremely new to any sort of coding so
>I have learned a lot already. Essentially, I was given an XML file and
>was told to convert all of it to a csv so that it could be uploaded
>into a database. Unfortunately the information I am working with is
>medical information and can't really share it. I initially tried to
>convert it using online programs, however that ended up with a large
>amount of blank spaces that wasn't useful for uploading into the
>database.
>>
>> So essentially, my goal is to parse all the data in the XML to a
>coherent, succinct CSV that could be uploaded. In the document, there
>are 361 patient files with 13 subcategories for each patient which
>further branches off to around 150 categories total. Since I am so new,
>I have been having a hard time seeing the bigger picture or knowing if
>there are any intermediary steps that will prevent all the blank spaces
>that the online conversion programs created.
>>
>> I will look through the information on the xml2 package. Any advice
>or recommendations would be greatly appreciated as I have felt fairly
>stuck. Once again, thank you very much for your help.
>>
>> Best,
>> Andrew
>>
>> On Tue, Jan 3, 2017 at 2:29 PM, Ben Tupper <btupper at bigelow.org
><mailto:btupper at bigelow.org>> wrote:
>> Hi,
>>
>> It's hard to know what to advise - much depends upon the XML data you
>have and what you want to extract from it. Without knowing about those
>two things there is little anyone could do to help. Can you post to
>the internet a to example data and provide the link here? Then state
>explicitly what you want to have in hand at the end.
>>
>> If you are just starting out I suggest that you try xml2 package (
>https://cran.r-project.org/web/packages/xml2/
><https://cran.r-project.org/web/packages/xml2/> ) rather than XML
>package ( https://cran.r-project.org/web/packages/XML/
><https://cran.r-project.org/web/packages/XML/> ). I have been using it
>much more since the authors added the ability to create xml nodes
>(rather than just extracting data from existing xml nodes).
>>
>> Cheers,
>> Ben
>>
>> P.S. Hello to my niece Olivia S on the Bates EMS team.
>>
>>
>> > On Jan 3, 2017, at 11:27 AM, Andrew Lachance <alachanc at bates.edu
><mailto:alachanc at bates.edu>> wrote:
>> >
>> > up votdown votefavorite
>> >
><http://stats.stackexchange.com/questions/254328/how-to-convert-a-large-xml-file-to-a-csv-file-using-r?noredirect=1#
><http://stats.stackexchange.com/questions/254328/how-to-convert-a-large-xml-file-to-a-csv-file-using-r?noredirect=1#>>
>> >
>> > I am completely new to R and have tried to use several functions
>within the
>> > xml packages to convert an XML to a csv and have had little
>success. Since
>> > I am so new, I am not sure what the necessary steps are to complete
>this
>> > conversion without a lot of NA.
>> >
>> > --
>> > Andrew D. Lachance
>> > Chief of Service, Bates Emergency Medical Service
>> > Residence Coordinator, Hopkins House
>> > Bates College Class of 2017
>> > alachanc at bates.edu <mailto:alachanc at bates.edu> <wcurley at bates.edu
><mailto:wcurley at bates.edu>>
>> > (207) 620-4854
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
>To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
><https://stat.ethz.ch/mailman/listinfo/r-help>
>> > PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
><http://www.r-project.org/posting-guide.html>
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> Ben Tupper
>> Bigelow Laboratory for Ocean Sciences
>> 60 Bigelow Drive, P.O. Box 380
>> East Boothbay, Maine 04544
>> http://www.bigelow.org <http://www.bigelow.org/>
>>
>>
>>
>>
>>
>>
>> --
>> Andrew D. Lachance
>> Chief of Service, Bates Emergency Medical Service
>> Residence Coordinator, Hopkins House
>> Bates College Class of 2017
>> alachanc at bates.edu <mailto:wcurley at bates.edu>
>> (207) 620-4854
>
>Ben Tupper
>Bigelow Laboratory for Ocean Sciences
>60 Bigelow Drive, P.O. Box 380
>East Boothbay, Maine 04544
>http://www.bigelow.org
>
>
>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list