[Bioc-devel] Invalid multibyte string

Sean Davis sdavis2 at mail.nih.gov
Sat Mar 4 03:04:32 CET 2006


I have a user of GEOquery that is having these problems.

> getGEO('GSE94')
trying URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_series/GSE94_family.soft
<ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_series/GSE94_family.soft> .
gz'
ftp data connection made, file length 100291059 bytes
opened URL
=================================================
downloaded 97940Kb

File stored at:
/tmp/RtmpA18116/GSE94.soft.gz
Parsing....
^PLATFORM = GPL218
Error in substr(x, start = matches + patlen, stop = 1e+07) :
        invalid multibyte string
In addition: Warning messages:
1: input string 1 is invalid in this locale in: grep.perl(pattern, x,
ignore.cas e, value, useBytes)
2: input string 1 is invalid in this locale in: grep.perl(pattern, x,
ignore.cas e, value, useBytes)
3: input string 1 is invalid in this locale in: grep.perl(pattern, x,
ignore.cas e, value, useBytes)
4: input string 42 is invalid in this locale in: grep.perl(pattern, x,
ignore.ca <http://ignore.ca/>  se, value, useBytes)
5: input string 67 is invalid in this locale in: grep.perl(pattern, x,
ignore.ca <http://ignore.ca/>  se, value, useBytes)
6: input string 80 is invalid in this locale in: grep.perl(pattern, x,
ignore.ca <http://ignore.ca/>  se, value, useBytes)
7: input string 4 is invalid in this locale in: regexpr(pattern, text,
extended,  fixed, useBytes)
8: input string 29 is invalid in this locale in: regexpr(pattern,
text, extended , fixed, useBytes)


Here is the output of sys.getlocale:

> Sys.getlocale ()
 [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Setting the locale to "C" results in the code working, except that he then
gets:

> Sys.setlocale("LC_ALL","C")
 [1] 
"LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en
_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;
LC_IDENTIFICATION=C"
> gse<-getGEO('GSE94')
trying URL '
ftp://ftp.ncbi.nih.gov/pub/geo/data/geo/by_series/GSE94_family.soft.gz'
ftp data connection made, file length 100291059 bytes
opened URL
=================================================
downloaded 97940Kb

File stored at:
/tmp/RtmpA18116/GSE94.soft.gz
Parsing....
^PLATFORM = GPL218 
*** glibc detected *** double free or corruption (!prev): 0x08bf1d80 ***
Aborted
[root at phylo juan]#


And, finally, his sessionInfo()
> sessionInfo()
R version 2.2.1, 2005-12-20, i686-redhat-linux-gnu

attached base packages:
[1] "methods"   "stats"    "graphics"  "grDevices" "utils"     "datasets"
[7] "base"

other attached packages:
GEOquery
 "1.5.5"

I assume there is a bug lurking in one of the libraries or R itself that
causes the doublefree.  This sounds like it happens only under the "C"
locale.  However, it also appears that he needs to be using the "C" locale
and not UTF-8 to read the file as downloaded from GEO.  Any suggestions on
what to try?  Is there something I can do to fix the problem, or is this in
the user's court?

Thanks,
Sean



More information about the Bioc-devel mailing list