[R] xmlEventParse returning trimmed content?

Duncan Temple Lang duncan at wald.ucdavis.edu
Fri Apr 10 18:46:15 CEST 2009


Just to hopefully complete this for the record.
Johannes attachment didn't make it to the list, but
in an off-list conversation, I believe the problem
is not trimming of text (i.e. removing leading and trailing whitespace)
but apparent truncation of text in an XML node.
The xmlEventParse() function is intended for very large documents
for which memory consumption may be problematic. As a result,
it passes text in an XML node to event handler functions in
segments, e.g. of up to approximatly 3000 characters.
If one expects all the text for a node to appear in
the first call, this may not be the case for large
strings. Instead, one should concatenate them across
calls to the text handler and process them when the
end of the node is encountered.

   D.

Johannes Graumann wrote:
> Hi Duncan,
> 
> Thanks for your thoughts. "trim=FALSE" does not fix my issues, so I attach 
> pared down versions of my script and data file. Thanks for any further hint.
> 
> Joh
> 
> Duncan Temple Lang wrote:
> 
>> Hi Johannes
>>
>>   I would "guess" that the trimming of the text occurs because
>> you do not specify trim = FALSE in the call to xmlEventParse().
>> If you specify this, you might well get the results you expect.
>> If not, can you post the actual file you are reading so we can
>> reproduce your results.
>>
>>    D.
>>
>> Johannes Graumann wrote:
>>> Hello,
>>>
>>> I wrote the function below and have the problem, that the "text" bit
>>> returns only a trimmed version (686 chars as far as I can see) of the
>>> content under the "fetchPeaks" condition.
>>> Any hunches why that might be?
>>>
>>> Thanks for pointer, Joh
>>>
>>> xmlEventParse(fileName,
>>>     list(
>>>       startElement=function(name, attrs){
>>> if(name == "scan"){
>>> if(.GlobalEnv$ms2Scan == TRUE & .GlobalEnv$scanDone == TRUE){
>>> cat(.GlobalEnv$scanNum,"\n")
>>> MakeSpektrumEntry()
>>> }
>>> .GlobalEnv$scanDone <- FALSE
>>> .GlobalEnv$fetchPrecMz <- FALSE
>>> .GlobalEnv$fetchPeaks <- FALSE
>>> .GlobalEnv$ms2Scan <- FALSE
>>> if(attrs[["msLevel"]] == "2"){
>>> .GlobalEnv$ms2Scan <- TRUE
>>> .GlobalEnv$scanNum <- as.integer(attrs[["num"]])
>>> }
>>> } else if(name == "precursorMz" & .GlobalEnv$ms2Scan == TRUE){
>>> .GlobalEnv$fetchPrecMz <- TRUE
>>> } else if(name == "peaks" & .GlobalEnv$ms2Scan == TRUE){
>>> .GlobalEnv$fetchPeaks <- TRUE
>>> }
>>>       },
>>>       text=function(text){
>>> if(.GlobalEnv$fetchPrecMz == TRUE){
>>> .GlobalEnv$precursorMz <- as.numeric(text)
>>> .GlobalEnv$fetchPrecMz <- FALSE
>>> }
>>> if(.GlobalEnv$fetchPeaks == TRUE){
>>> .GlobalEnv$peaks <- text
>>> .GlobalEnv$fetchPeaks <- FALSE
>>> .GlobalEnv$scanDone <- TRUE
>>> }
>>>       }
>>>     )
>>>   )
>>>
>>>> sessionInfo()
>>> R version 2.9.0 beta (2009-04-03 r48277)
>>> x86_64-pc-linux-gnu
>>>
>>> locale:
>>>
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=en_US.UTF-8;LC_ADDRESS=en_US.UTF-8;LC_TELEPHONE=en_US.UTF-8;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=en_US.UTF-8
>>> attached base packages:
>>> [1] splines   stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>>  [1] caMassClass_1.6 MASS_7.2-46     digest_0.3.1    caTools_1.9
>>>  [5] bitops_1.0-4.1  rpart_3.1-43    nnet_7.2-46     e1071_1.5-19
>>>  [9] class_7.2-46    PROcess_1.19.1  Icens_1.15.2    survival_2.35-4
>>> [13] RCurl_0.94-1    XML_2.3-0       rkward_0.5.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.9.0
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html and provide commented,
>>> minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html and provide commented,
>> minimal, self-contained, reproducible code.
> 
> 
> ------------------------------------------------------------------------
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list