[R] Loading problem with XML_1.9

Luo Weijun luo_weijun at yahoo.com
Sun Jul 8 21:30:11 CEST 2007


Thanks, Dr. Lang,
I used xmlEventParse() + branches concept as you
suggested, it really works, and the memory issue is
gone. Now I can query large XML files from within R.
but here is another problem: it is too slows (a simple
query has not finished for 1.5h), even though the
number of relevant records is very limited, but the
whole XML file has more than 500 thousand
similarly-structured records. And the parser has to go
through all of them as to find the matches. Attached
is part of the XML files with two records. I am trying
to retrieve the content of <moleculeName> nodes from
<molecule> records where <name> nodes bear specific
gene names.
Is it possible to locate based on node content (or
xmlValue) rather than node names (since they are the
same in all records) first and then parse the xml
record locally? Would query based on XPath be faster
in this case? I understand that we do have the
facility in the XML package for XPath based queries,
called getNodeSet(). But that requires reading the
whole XML tree into the memory first, which is not
feasible for my large XML file. Or can I call
XML::XPath statements using your R-Perl interface
package? Any suggestions/thoughts? Thank you!
Weijun


Part of my XML file: 

<molecule>
<prov><im><imid>20</imid></im></prov><moleculeID>119043</moleculeID>
<moleculeType>protein<prov><im><imid>20</imid></im></prov></moleculeType>
<organismID>10090<prov><im><imid>20</imid></im></prov></organismID>
<id><prov><im><imid>20</imid></im></prov><idType>GI</idType><idValue>6677981</idValue></id>
<name>SKD1<prov><im><imid>20</imid></im></prov></name>
<name>Vps4b<prov><im><imid>20</imid></im></prov></name>
<name>8030489C12Rik<prov><im><imid>20</imid></im></prov></name>
<description><distribution><value>Mouse homologue of
yeast Vacuolar protein sorting 4 (Vps4); Suppressor of
potassium transport defect 1. Mem
ber of mammalian class E Vps proteins involved in
endosomal transport; AAA-type
ATPase.<prov><im><imid>20</imid></im></prov></value><value>Mo
use homologue of yeast  Vacuolar protein sorting 4
(Vps4); Suppressor of potassium  transport defect 1.
Member of  mammalian class E Vps prot
eins involved in endosomal transport; AAA-type
ATPase.<prov><im><imid>20</imid></im></prov></value></distribution></description>
<orthologue>
<method><methodID>337974</methodID><methodName>miClust80</methodName></method>
</orthologue>
<variant>
<prov><im><imid>20</imid></im></prov><variantID>0</variantID>
</variant>
<interaction><interactionRef>201581</interactionRef><moleculeRef>89434</moleculeRef><moleculeName>SBP1</moleculeName>
<selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
<interaction><interactionRef>201582</interactionRef><moleculeRef>17953</moleculeRef><moleculeName>mVps2</moleculeName>
<selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
</molecule>

<molecule>
<prov><im><imid>30</imid></im></prov><moleculeID>116226</moleculeID>
<moleculeType>protein<prov><im><imid>30</imid></im></prov></moleculeType>
<organismID>9606<prov><im><imid>30</imid></im></prov></organismID>
<id><prov><im><imid>30</imid></im></prov><idType>HGNC</idType><idValue>9859</idValue></id>
<name>RAP1GDS1<prov><im><imid>30</imid></im></prov></name>
<name>GDS1<prov><im><imid>30</imid></im></prov></name>
<name>MGC118859<prov><im><imid>30</imid></im></prov></name>
<name>MGC118861<prov><im><imid>30</imid></im></prov></name>
<variant>
<prov><im><imid>30</imid></im></prov><variantID>0</variantID>
</variant>
<interaction><interactionRef>93569</interactionRef><moleculeRef>116280</moleculeRef><moleculeName>RAC1</moleculeName>
<selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
<interaction><interactionRef>104132</interactionRef><moleculeRef>103040</moleculeRef><moleculeName>RHOA</moleculeName>
<selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
<interaction><interactionRef>121818</interactionRef><moleculeRef>74726</moleculeRef><moleculeName>MBIP</moleculeName>
<selfVariantRef>0</selfVariantRef><partnerVariantRef>0</partnerVariantRef></interaction>
</molecule>

--- Duncan Temple Lang <duncan at wald.ucdavis.edu>
wrote:

> 
> Well, as you mention at the end of the mail,
> several people have given you suggestions about
> how to solve the problem using different approaches.
> You might search on the Web for how to install a 64
> bit version of libxml2?
> Using xmlTreeParse(, useInternalNodes = TRUE) is an
> approach
> to reducing the memory consumption as is using the
> handlers
> argument. And if size is really the issue, you
> should consider
> the SAX model which is very memory efficient and
> made available
> via the xmlEventParse() function in the XML package.
> And it even provides the concepts of branches to
> provide a
> hybrid of SAX and DOM-style parsing together.
> 
> However, to solve the problem of the xmlMemDisplay
> symbol not being found, you can look for where
> that is used and remove it.    It is in
> src/DocParse.c
> in the routine RS_XML_MemoryShow().  You can remove
> the line
>   xmlMemDisplay(stderr)
> or indeed the entire routine.  Then re-install and
> reload the package.
> 
>  D.
> 
> 
> Luo Weijun wrote:
> > Hello Dr. Lang and all,
> > I posted this message in R-help mail list, but
> haven’t
> > solved my problem so far. Therefore, could you
> help me
> > look at it?
> > I have loading problem with XML_1.9 under 64 bit
> > R2.3.1 for Mac OS X, which I got from
> > http://R.research.att.com/. XML_1.9 works fine
> under
> > 32 bit R2.5.0. I thought that could be
> installation
> > problem, and I tried install.packages or biocLite,
> > every time the package installed fine, except some
> > warning messages below:
> > ld64 warning: in /usr/lib/libxml2.dylib, file does
> not
> > contain requested architecture
> > ld64 warning: in /usr/lib/libz.dylib, file does
> not
> > contain requested architecture
> > ld64 warning: in /usr/lib/libiconv.dylib, file
> does
> > not contain requested architecture
> > ld64 warning: in /usr/lib/libz.dylib, file does
> not
> > contain requested architecture
> > ld64 warning: in /usr/lib/libxml2.dylib, file does
> not
> > contain requested architecture
> > 
> > Here is the error messages I got, when XML is
> loaded:
> >> library(XML)
> > Error in dyn.load(x, as.logical(local),
> > as.logical(now)) : 
> >         unable to load shared library
> > '/usr/local/lib64/R/library/XML/libs/XML.so':
> >  
> dlopen(/usr/local/lib64/R/library/XML/libs/XML.so,
> > 6): Symbol not found: _xmlMemDisplay
> >   Referenced from:
> > /usr/local/lib64/R/library/XML/libs/XML.so
> >   Expected in: flat namespace
> > Error: .onLoad failed in 'loadNamespace' for 'XML'
> > Error: package/namespace load failed for 'XML'
> > 
> > Session information
> >> sessionInfo()
> > Version 2.3.1 Patched (2006-06-27 r38447) 
> > powerpc64-apple-darwin8.7.0 
> > 
> > attached base packages:
> > [1] "methods"   "stats"     "graphics" 
> "grDevices"
> > "utils"     "datasets" 
> > [7] "base"     
> > 
> > Prof Brian Ripley also suggested that this could
> be
> > that I don’t have a 64-bit version of libxml2
> > installed. Where I get it and where/how to install
> it,
> > if that’s the problem? 
> > The reason I need to use R64 is that I have memory
> > limitation issue with R 32 bit version when I load
> > some very large XML trees (the data file is about
> > 800M). And Martin suggested me to use 'handler'
> > argument of xmlTreeParse, tried 'handler' with
> > useInternalNodes=T, but I still got this memory
> > problem with R 32 bit version. Please tell me what
> I
> > can do now. Thank you so much!
> > Weijun
> > 
> > 
> > 
> >        
> >
>
____________________________________________________________________________________
> > 
> > Comedy with an Edge to see what's on, when.
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained,
> reproducible code.
> 



       
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell.



More information about the R-help mailing list