[R] Loading problem with XML_1.9

Duncan Temple Lang duncan at wald.ucdavis.edu
Mon Jul 9 21:54:23 CEST 2007


Weijun and I corresponded off-list so that I could
get a copy of the data.

On a relatively modest machine with 2G of RAM, 10G swap,
dual core 1Ghz 64bit AMDs, the code below takes approximately
100 seconds. It is not optimized in any particular way, so
there is room for improvement.

doc <- xmlTreeParse("mi1.txt.gz", useInternal = TRUE)
mols <- getNodeSet(doc, "//molecule")

ans =
lapply(mols,
        function(node, targets) {

           names = as.character(xpathApply(node, "//name/text()",
                                             xmlValue))
           if(any(names %in% targets))
             xpathApply(node, "//moleculeName", xmlValue)
           else
             character()
        }, c("Vps4b", "SKD1", "frm-1"))

ans = ans[sapply(ans, length) > 0]


We can read the file without uncompressing which probably
slows things down slightly.
The parsing of the tree takes about 20 seconds and occupies
approximately 1G (very roughly).
Then we find all the <molecule> nodes of which there
are 25452.  Then we loop over each of these and
do a sub-query using XPath and see if
the text child in the <name> nodes are in the set
of interest (entirely made up for my test), and if so
fetch the content of any <moleculeName> within this
<molecule> node.

It would be nice if we could build the hash for targets
just once.
And we could get clever with the XPath query to try do
the matching and selection in one query. This might actually slow things
down.

(There are garbage collection issues with XPath sub-queries
for which I am still deciding on the optimal strategy.)

So perhaps the lesson her is that for those working with XML,
XPath is worth using before more specialized approaches
and large XML data files can fit into memory. The tree
is not using contiguous memory so nodes can be squeezed into
available spaces.

D.


Luo Weijun wrote:
> Thanks, Dr. Lang,
> I used xmlEventParse() + branches concept as you
> suggested, it really works, and the memory issue is
> gone. Now I can query large XML files from within R.
> but here is another problem: it is too slows (a simple
> query has not finished for 1.5h), even though the
> number of relevant records is very limited, but the
> whole XML file has more than 500 thousand
> similarly-structured records. And the parser has to go
> through all of them as to find the matches. Attached
> is part of the XML files with two records. I am trying
> to retrieve the content of <moleculeName> nodes from
> <molecule> records where <name> nodes bear specific
> gene names.
> Is it possible to locate based on node content (or
> xmlValue) rather than node names (since they are the
> same in all records) first and then parse the xml
> record locally? Would query based on XPath be faster
> in this case? I understand that we do have the
> facility in the XML package for XPath based queries,
> called getNodeSet(). But that requires reading the
> whole XML tree into the memory first, which is not
> feasible for my large XML file. Or can I call
> XML::XPath statements using your R-Perl interface
> package? Any suggestions/thoughts? Thank you!
> Weijun
> 
> 
> Part of my XML file: 
> 
> <molecule>
> <prov><im><imid>20</imid></im></prov><moleculeID>119043</moleculeID>
> <moleculeType>protein<prov><im><imid>20</imid></im></prov></moleculeType>
> <organismID>10090<prov><im><imid>20</imid></im></prov></organismID>
> <id><prov><im><imid>20</imid></im></prov><idType>GI</idType><idValue>6677981</idValue></id>
> <name>SKD1<prov><im><imid>20</imid></im></prov></name>
> <name>Vps4b<prov><im><imid>20</imid></im></prov></name>
> <name>8030489C12Rik<prov><im><imid>20</imid></im></prov></name>
> <description><distribution><value>Mouse homologue of
> yeast Vacuolar protein sorting 4 (Vps4); Suppressor of
> potassium transport defect 1. Mem
> ber of mammalian class E Vps proteins involved in
> endosomal transport; AAA-type
> ATPase.<prov><im><imid>20</imid></im></prov></value><value>Mo
> use homologue of yeast  Vacuolar protein sorting 4
> (Vps4); Suppressor of potassium  transport defect 1.
> Member of  mammalian class E Vps prot
> eins involved in endosomal transport; AAA-type
> ATPase.<prov><im><imid>20</imid></im></prov></value></distribution></description>
> <orthologue>
> <method><methodID>337974</methodID><methodName>miClust80</methodName></method>
> </orthologue>
> <variant>
> <prov><im><imid>20</imid></im></prov><variantID>0</variantID>
> </variant>
> <interaction><interactionRef>201581</interactionRef><moleculeRef>89434</moleculeRef><moleculeName>SBP1</moleculeName>
> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
> <interaction><interactionRef>201582</interactionRef><moleculeRef>17953</moleculeRef><moleculeName>mVps2</moleculeName>
> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
> </molecule>
> 
> <molecule>
> <prov><im><imid>30</imid></im></prov><moleculeID>116226</moleculeID>
> <moleculeType>protein<prov><im><imid>30</imid></im></prov></moleculeType>
> <organismID>9606<prov><im><imid>30</imid></im></prov></organismID>
> <id><prov><im><imid>30</imid></im></prov><idType>HGNC</idType><idValue>9859</idValue></id>
> <name>RAP1GDS1<prov><im><imid>30</imid></im></prov></name>
> <name>GDS1<prov><im><imid>30</imid></im></prov></name>
> <name>MGC118859<prov><im><imid>30</imid></im></prov></name>
> <name>MGC118861<prov><im><imid>30</imid></im></prov></name>
> <variant>
> <prov><im><imid>30</imid></im></prov><variantID>0</variantID>
> </variant>
> <interaction><interactionRef>93569</interactionRef><moleculeRef>116280</moleculeRef><moleculeName>RAC1</moleculeName>
> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
> <interaction><interactionRef>104132</interactionRef><moleculeRef>103040</moleculeRef><moleculeName>RHOA</moleculeName>
> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
> <interaction><interactionRef>121818</interactionRef><moleculeRef>74726</moleculeRef><moleculeName>MBIP</moleculeName>
> <selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
> </molecule>
> 
> --- Duncan Temple Lang <duncan at wald.ucdavis.edu>
> wrote:
> 
>> Well, as you mention at the end of the mail,
>> several people have given you suggestions about
>> how to solve the problem using different approaches.
>> You might search on the Web for how to install a 64
>> bit version of libxml2?
>> Using xmlTreeParse(, useInternalNodes = TRUE) is an
>> approach
>> to reducing the memory consumption as is using the
>> handlers
>> argument. And if size is really the issue, you
>> should consider
>> the SAX model which is very memory efficient and
>> made available
>> via the xmlEventParse() function in the XML package.
>> And it even provides the concepts of branches to
>> provide a
>> hybrid of SAX and DOM-style parsing together.
>>
>> However, to solve the problem of the xmlMemDisplay
>> symbol not being found, you can look for where
>> that is used and remove it.    It is in
>> src/DocParse.c
>> in the routine RS_XML_MemoryShow().  You can remove
>> the line
>>   xmlMemDisplay(stderr)
>> or indeed the entire routine.  Then re-install and
>> reload the package.
>>
>>  D.
>>
>>
>> Luo Weijun wrote:
>>> Hello Dr. Lang and all,
>>> I posted this message in R-help mail list, but
>> haven’t
>>> solved my problem so far. Therefore, could you
>> help me
>>> look at it?
>>> I have loading problem with XML_1.9 under 64 bit
>>> R2.3.1 for Mac OS X, which I got from
>>> http://R.research.att.com/. XML_1.9 works fine
>> under
>>> 32 bit R2.5.0. I thought that could be
>> installation
>>> problem, and I tried install.packages or biocLite,
>>> every time the package installed fine, except some
>>> warning messages below:
>>> ld64 warning: in /usr/lib/libxml2.dylib, file does
>> not
>>> contain requested architecture
>>> ld64 warning: in /usr/lib/libz.dylib, file does
>> not
>>> contain requested architecture
>>> ld64 warning: in /usr/lib/libiconv.dylib, file
>> does
>>> not contain requested architecture
>>> ld64 warning: in /usr/lib/libz.dylib, file does
>> not
>>> contain requested architecture
>>> ld64 warning: in /usr/lib/libxml2.dylib, file does
>> not
>>> contain requested architecture
>>>
>>> Here is the error messages I got, when XML is
>> loaded:
>>>> library(XML)
>>> Error in dyn.load(x, as.logical(local),
>>> as.logical(now)) : 
>>>         unable to load shared library
>>> '/usr/local/lib64/R/library/XML/libs/XML.so':
>>>  
>> dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
>>> 6): Symbol not found: _xmlMemDisplay
>>>   Referenced from:
>>> /usr/local/lib64/R/library/XML/libs/XML.so
>>>   Expected in: flat namespace
>>> Error: .onLoad failed in 'loadNamespace' for 'XML'
>>> Error: package/namespace load failed for 'XML'
>>>
>>> Session information
>>>> sessionInfo()
>>> Version 2.3.1 Patched (2006-06-27 r38447) 
>>> powerpc64-apple-darwin8.7.0 
>>>
>>> attached base packages:
>>> [1] "methods"   "stats"     "graphics" 
>> "grDevices"
>>> "utils"     "datasets" 
>>> [7] "base"     
>>>
>>> Prof Brian Ripley also suggested that this could
>> be
>>> that I don’t have a 64-bit version of libxml2
>>> installed. Where I get it and where/how to install
>> it,
>>> if that’s the problem? 
>>> The reason I need to use R64 is that I have memory
>>> limitation issue with R 32 bit version when I load
>>> some very large XML trees (the data file is about
>>> 800M). And Martin suggested me to use 'handler'
>>> argument of xmlTreeParse, tried 'handler' with
>>> useInternalNodes=T, but I still got this memory
>>> problem with R 32 bit version. Please tell me what
>> I
>>> can do now. Thank you so much!
>>> Weijun
>>>
>>>
>>>
>>>        
>>>
> ____________________________________________________________________________________
>>> Comedy with an Edge to see what's on, when.
>>>
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained,
>> reproducible code.
>>
> 
> 
> 
>        
> ____________________________________________________________________________________
> Pinpoint customers who are looking for what you sell.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list