[BioC] silent data corruption in flowFlowJo, and fix
htl10 at users.sourceforge.net
Mon Mar 15 03:20:00 CET 2010
Commit r41352 from j.gosink broke flowFlowJo Bioc's nightly check for most of summer/autumn 2009 until just before BioC 2.5 code freeze, p.aboyoun committed r42419 which involves using iconv() to strip multibyte data to make the nightly check pass. Unfortunately it "fixes" some flowjo workspace files but breaks others. I finally find the time to look at it - it is actually fairly serious and causes silent data corruption and here is the fix - please review and commit.
The underlying issue is this: FlowJo workspaces files are, in most(?all) cases, XML with iso8859-1 encoding (a.k.a. 'latin1'). With win32 R which defaults to codepage 1252 (a superset of latin1), R check passes - everything is in latin1 and the data stripping has no effort. On Linux and other "modern" unix systems, which defaults to UTF-8, R check fails - not all iso8859-1 text is valid UTF-8 text and vice versa, and also, the multibyte data strip causes data corruption.
The proper fix is to query libxml2 about the xml encoding and set the encoding explicitly - it is a substantial rewrite. As a side-effect, the code possibly run faster as well - most of the gsub() don't not need to be 'g'. The regular expressions are only concerned with manipulating the header and only need to match the first instance.
More information about the Bioconductor