[R] Relational Databases or XML?

Martin Morgan mtmorgan at fhcrc.org
Thu Apr 10 23:31:00 CEST 2008


Harold -- you'll really want to check out the XML package. xmlTreeParse 
+ xpathApply provides a very flexible solution. As a recent example, 
parsing 189 XML files to extract 4 attributes from deeply nested 
elements into a data frame:

fls <- list.files('~/runBrowser', pattern=".*xml", full=TRUE)
f <- function(fl) {
     xq <- function(xml, q)
         unlist(xpathApply(xml, q, xmlValue, namespaces="xsi"))
     xml <- xmlTreeParse(fl, useInternal=TRUE)
     data.frame(idx=rep(as.numeric(xq(xml, "//xsi:tile/@idx")), each=4),
         lane=rep(as.numeric(xq(xml, "//xsi:tile/@lane")), each=4),
         base=xq(xml, '//xsi:image/@base'),
         medSigInt=as.numeric(xq(xml, "//xsi:sgnInt/@median")))
}
res <- do.call('rbind', lapply(fls, f))

'res' has 54800 rows and 4 columns. The XML stays in C, so this is fast. 
The data can be effectively (your mileage may vary) visualized with 
lattice, e.g.,

xyplot(log(medSigInt)~idx|lane*base, res, strip=FALSE, pch=".", cex=2)

Martin

Doran, Harold wrote:
> I'm not sure it is possible to parse an XML file in R directly. Well, I
> guess it's *possible*, but may not be the best way to do it. ElementTree
> in Python is an easy-to-use parser that you might use to first parse
> your XML file (or others hierarchically structured data), organize it
> anyway you want, and then bring those data into R for subsequent
> analysis.
> 
> In fact, I have recently done just this. I have another statistical
> program that outputs data as an XML file. So, I wrote a python program
> that parses that XML file, pulls out the data of interest into a text
> file, and then I bring those data into R for analysis.
> 
>> -----Original Message-----
>> From: r-help-bounces at r-project.org 
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Keith Alan 
>> Chamberlain
>> Sent: Thursday, April 10, 2008 4:14 PM
>> To: r-help at r-project.org
>> Subject: [R] Relational Databases or XML?
>>
>> Dear R-Help,
>>
>> I am working on a paper in an R course for large file support 
>> in R using scan(), relational databases, and XML. I have 
>> never used SQL or heirarchical document formats such as XML 
>> (except where it occurs without user interaction), and 
>> knowledge in RDBs and XML is lacking in my program. I have 
>> tried finding a working example for the novices-novice on the 
>> topic, read many postings, the r-data I/O manual several 
>> times, and descriptions of packages RODBC, DBI, XML, among 
>> others. I understand that RDBs are (assumed at least) used 
>> widely among the R community. I have not been able to put all 
>> of the pieces together, but assuming that RDB use is actually 
>> quite widespread, it should be quite easy to fill me in 
>> and/or correct my understanding where necessary.
>>
>> For a cross-platform solution (PC/OSX at least, or in part) 
>> my questions/problems are about what preliminary steps are 
>> needed to get an SQL or XML query "to work" in R to begin 
>> with, what the appropriate data-file formats are, and how to 
>> convert to them if starting out with data in, say, a 
>> delimited ASCII text file. Very basic examples should 
>> suffice, say, a table with 20 random observations, a grouping 
>> variable with 2 levels, and a factor with 2 levels.
>>
>> ## untested code
>> set.seed(1024)
>> write.table("junk.txt", 
>> data.frame(Subj=c(rep(1,10),rep(2,10)),block=rep(c(rep(-1,5),r
>> ep(1,5)),2), obs=rnorm(20,0,1)))
>>
>> Specifically,
>>
>> 1- what are the minimum required non R components that are 
>> needed to support SQL or XML functionality, which may or may 
>> not need to be installed?
>>
>> 2- what R packages need to be installed, at a minimum (also 
>> as a cross-PC/Mac solution if possible or at least as much as 
>> possible)
>>
>> 3- I keep seeing reference to connections of a given name "if 
>> previously setup". What kind of setup is needed outside of R, if any?
>>
>> 4- what steps are needed in R to then connect to a file and 
>> import a subset based on a query?
>>
>> 5- Do I then use standard R routines (e.g. write()) to export 
>> as a DB, or an RDB/XML specific function?
>>
>> Sincerely,
>> KeithC. [U.S]
>>
>> 1/k^c
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-help mailing list