[R] Extracting complete information from XML data file using R-Nested Lists

Oliver Keyes okeyes at wikimedia.org
Sun Jan 24 21:19:28 CET 2016


Hey Sowmiyan,

I would recommend taking a look at the xml2, rather than xml, package
for a start. It's a lot more structured and traversing between
elements far easier :)

On 24 January 2016 at 12:27, sowmiyan <sowmiyan0508 at gmail.com> wrote:
> I am working with a XML, which can be found in the link Sample XML file
> <https://www.dropbox.com/s/8kn9g8xev2u5n8o/Dummy.xml?dl=0&preview=Dummy.xml>
>
> I am trying to extract each and every fields information to a csv file. I
> want my output to be as below: Required output:
> *Total of 20 columns and 2 rows*
> DateCreated DateModified Creator.UserAccountName Creator.PersonName
> Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName
> Modifier..attrs.referenceNumber AdditionalEmailStr AdditionalComment
> DateIssued DocumentaryInstructions NominationParcel.attr.Referencenumber
> NominationParcel.SecondContractNumber
> NominationParcel.Coordinator.RefernceNumber
> NominationParcel.Coordinator.Username NominationParcel.Coordinator.Email
> NominationParcel.Coordinator.Office.Name
> NominationParcel.Coordinator.Office.Email
> NominationParcel.Coordinator.Office.attrs.referenceNumber
> Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker
> 15351 mkolker Merryn Kolker 15351 Good work   7 sam
> Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker
> 15351 mkolker Merryn Kolker 15351 Nicely Performed   10 107 102
>
> But I am not able to get my output in the required format. I have tried in
> two different ways
>
> 1 Below is my first code, the problem with this is that my NULL fields are
> not getting captured correctly and there is spillover of data. Also I am
> not able to capture all the fields of nested lists in the XML
>
> *Code 1*
>
>   doc <- xmlParse("Dummy.xml")
>   lst<-xmlToList(doc)
>   f <- function(col) do.call(rbind, lapply(lst, function(x)
> unlist(x[cols])));
>   cols
> <-c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued",
> "DocumentaryInstructions", "NominationParcel" );
>   res <- setNames(lapply(cols, f), cols);
>   list2env(res, .GlobalEnv)
> *Output 1*
>
>
> DateCreated DateModified Creator.UserAccountName Creator.PersonName
> Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName
> Modifier..attrs.referenceNumber AdditionalComment
> NominationParcel.Coordinator.UserAccountName
> NominationParcel.Coordinator.Office..attrs.referenceNumber
> NominationParcel.Coordinator..attrs.referenceNumber
> NominationParcel..attrs.referenceNumber
> Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker
> 15351 mkolker Merryn Kolker 15351 Good Work sam 7
> Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker
> 15351 mkolker Merryn Kolker 15351 Nicely performed 102 107 10
> 2007-11-25T17:18:01
>
> 2 To avoid spillover of information of one cell to other because of "NULL",
> I have used for loop to replace the NULL cells with NA. By using this I was
> able to capture the correct data, but I could not get all the fields
> information present in the XML
>
> *Code 2*
>
>    doc <- xmlParse("Dummy.xml")
>    lstsub<-xmlToList(doc)
>    for(i in 1:length(lstsub))
>    {
>     for(j in 1:length(lstsub[[i]]))
>      {
>        lstsub[[i]][[j]]=
> ifelse(is.null(lstsub[[i]][[j]]),NA,lstsub[[i]][[j]])
>        if(length(lstsub[[i]][[j]])>1)
>        {
>        for(k in 1:length(lstsub[[i]][[j]]))
>        {
>           lstsub[[i]][[j]][[k]]=
>  ifelse(is.null(lstsub[[i]][[j]][[k]]),NA,lstsub[[i]][[j]][[k]])
>          if(length(lstsub[[i]][[j]][[k]])>1)
>           {
>          for(l in 1:length(lstsub[[i]][[j]][[k]]))
>            {
>             lstsub[[i]][[j]][[k]][[l]]=
>  ifelse(is.null(lstsub[[i]][[j]][[k]][[l]]),NA,lstsub[[i]][[j]][[k]][[l]])
>            }
>           }
>         }
>       }
>     }
>   }
>    f <- function(col) do.call(rbind, lapply(lstsub, function(x)
> unlist(x[cols])));
>      cols <-
> c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued",
> "DocumentaryInstructions", "NominationParcel" );
>      res <- setNames(lapply(cols, f), cols);
>      list2env(res, .GlobalEnv)
>      write.csv(Creator,"dummy_2.csv")
>
> *Output 2*
>
>             DateCreated DateModified    Creator Modifier
>  AdditionalEmailStr  AdditionalComment   DateIssued  DocumentaryInstructions
>
> Nomination  2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker mkolker NA
>  Good Work   NA  NA
> Nomination  2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker mkolker NA
>  Nicely performed    NA  NA
>
> Could somebody please help me in how could I get the required output
>
> I have posted the same question in Stackoverflow and the link is here (it
> might help in giving more clear picture)
>
> http://stackoverflow.com/questions/34963724/extracting-complete-information-from-nested-lists-in-xml-to-a-data-frame-using-r/34963821#34963821
>
>
> Regards,
> Sowmiyan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation



More information about the R-help mailing list