[BioC] Error with read.maimages

Wed Apr 7 18:20:25 CEST 2010

On 04/07/2010 08:33 AM, lapereir at uc.cl wrote:
> Hi!
> 
> Now i realize what Vincent tell me (sory!). I did it and it works!!!, very
> thanks.
> But i dont have any idea what was the problem, if you are to kind, could you
> send me a web page or tutorial web page to explain the problem, it doest
> matter if i have to read a lot.

Hi Luis -- glad that it works.

I'm not sure of a web page. The issue has to do with representing
different (human) languages. Remember that computers represent data as
'bytes'. At one point in computer history most programs expected an
ASCII character set, which maps for instance the letter A to the byte
encoding the digit '65', B to the byte encoding '66', and so on. This is
all well and good if the (human) language has only 256 characters (the
number of distinct bytes), but many human languages do not. So various
schemes have been developed to encode larger character sets into groups
of bytes. en_US.UTF8 is one such scheme.

An impressive effort made R understand more than just ASCII, both when
reading in files but also in an R session. Your problem came up because
the R session expected one character encoding, whereas the file was in
another (simpler) encoding. Unfortunately, your file also contained a
character sequence that, in the encoding of your R session, was not valid.

It is possible to guess at the encoding of the file, but the guess is
not always correct, and an incorrect guess can result in valid input of
incorrect data. So guessing is not the right strategy.

Perhaps one additional thing. The locale influences things other than
interpretation, and in particular the 'collation' order can be a problem
for bioinformaticians. For instance some locales (including en_US.UTF8)
sort strings with '-' as though it were not there, so the DNA sequences
with '-' representing missing characters, or probe IDs with '-', sort
differently in different locales.

> Sys.setlocale(locale="C")
> sort(c("A-C", "AAA", "AGT"))
[1] "A-C" "AAA" "AGT"
> Sys.setlocale(locale="en_US.UTF8")
> sort(c("A-C", "AAA", "AGT"))
[1] "AAA" "A-C" "AGT"

Apparently collation order can be quite exotic, including for instance
depending on the preceding character. This has had serious consequences
when people have assumed that the fourth probe, for instance, is the
same under different locales ('on my Windows machine' versus 'on linux',
where really the difference is the default locale on the different
computers).

As a bioinformatician, it is likely that one deals almost exclusively in
'plain text', and a best practice might be to always have the locale set
to "C".

Martin

> 
> Well, here it the result:
>> RG <- read.maimages(targets, source="genepix")
> Read GSM307461.gpr
> Read GSM307462.gpr
> Read GSM307464.gpr
> Read GSM307465.gpr
> Read GSM307466.gpr
> 
> Thanks, for all.
> Luis
> 
> Martin Morgan escribió:
>> Hi Luis --
>>
>> On 04/05/2010 09:10 AM, lapereir at uc.cl wrote:
>>> HI!!
>>>
>>> I am sorry, i read the posting guide but for some reason i dont read that
>>> requirement. So here i write the output of sessionInfo(), also include the
>>> traceback() ouput and Sys.getlocale().
>>>
>>>> Sys.getlocale()
>>> [1] "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8"
>>
>> I think the suggestion was to change this like
>>
>>> Sys.setlocale(locale="C")
>> [1]
>> "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
>>
>> I think this will work on your operating system, but it may be necessary
>> to start R in the correct locale.
>>
>> Martin
>>
>>
>>>> traceback()
>>> 6: gsub("\\.", "\\\\.", x)
>>> 5: protectMetachar(allcnames[i])
>>> 4: grep(protectMetachar(allcnames[i]), text.to.search)
>>> 3: read.columns(fullname, required.col, text.to.search, skip = skip,
>>>        sep = sep, quote = quote, stringsAsFactors = FALSE, fill = TRUE,
>>>        flush = TRUE, ...)
>>> 2: switch(source2, quantarray = {
>>>        firstfield <- scan(fullname, what = "", sep = "\t", flush = TRUE,
>>>            quiet = TRUE, blank.lines.skip = FALSE, multi.line = FALSE,
>>>            allowEscapes = FALSE)
>>>        skip <- grep("Begin Data", firstfield)
>>>        if (length(skip) == 0)
>>>            stop("Cannot find \"Begin Data\" in image output file")
>>>        nspots <- grep("End Data", firstfield) - skip - 2
>>>        obj <- read.columns(fullname, required.col, text.to.search,
>>>            skip = skip, sep = sep, quote = quote, stringsAsFactors = FALSE,
>>>            fill = TRUE, nrows = nspots, flush = TRUE, ...)
>>>    }, arrayvision = {
>>>        skip <- 1
>>>        cn <- scan(fullname, what = "", sep = sep, quote = quote,
>>>            skip = 1, nlines = 1, quiet = TRUE, allowEscape = FALSE)
>>>        fg <- grep(" Dens - ", cn)
>>>        if (length(fg) != 2)
>>>            stop(paste("Cannot find foreground columns in", fullname))
>>>        bg <- grep("^Bkgd$", cn)
>>>        if (length(bg) != 2)
>>>            stop(paste("Cannot find background columns in", fullname))
>>>        columns <- list(R = fg[1], Rb = bg[1], G = fg[2], Gb = bg[2])
>>>        obj <- read.columns(fullname, required.col, text.to.search,
>>>            skip = skip, sep = sep, quote = quote, stringsAsFactors = FALSE,
>>>            fill = TRUE, flush = TRUE, ...)
>>>        fg <- grep(" Dens - ", names(obj))
>>>        bg <- grep("^Bkgd$", names(obj))
>>>        columns <- list(R = fg[1], Rb = bg[1], G = fg[2], Gb = bg[2])
>>>        nspots <- nrow(obj)
>>>    }, bluefuse = {
>>>        skip <- readGenericHeader(fullname, columns = c(columns$G,
>>>            columns$R))$NHeaderRecords
>>>        obj <- read.columns(fullname, required.col, text.to.search,
>>>            skip = skip, sep = sep, quote = quote, stringsAsFactors = FALSE,
>>>            fill = TRUE, flush = TRUE, ...)
>>>        nspots <- nrow(obj)
>>>    }, genepix = {
>>>        h <- readGPRHeader(fullname)
>>>        if (verbose && source == "genepix.custom")
>>>            cat("Custom background:", h$Background, "\n")
>>>        skip <- h$NHeaderRecords
>>>        obj <- read.columns(fullname, required.col, text.to.search,
>>>            skip = skip, sep = sep, quote = quote, stringsAsFactors = FALSE,
>>>            fill = TRUE, flush = TRUE, ...)
>>>        nspots <- nrow(obj)
>>>    }, smd = {
>>>        skip <- readSMDHeader(fullname)$NHeaderRecords
>>>        obj <- read.columns(fullname, required.col, text.to.search,
>>>            skip = skip, sep = sep, quote = quote, stringsAsFactors = FALSE,
>>>            fill = TRUE, flush = TRUE, ...)
>>>        nspots <- nrow(obj)
>>>    }, {
>>>        skip <- readGenericHeader(fullname, columns = columns, sep =
>>> sep)$NHeaderRecords
>>>        obj <- read.columns(fullname, required.col, text.to.search,
>>>            skip = skip, sep = sep, quote = quote, stringsAsFactors = FALSE,
>>>            fill = TRUE, flush = TRUE, ...)
>>>        nspots <- nrow(obj)
>>>    })
>>> 1: read.maimages(targets, source = "genepix", wt.fun = f)
>>>
>>>> sessionInfo()
>>> R version 2.10.0 (2009-10-26)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] limma_3.2.1
>>>
>>> Well, sorry for the extention, but hopefully i cant get an anwser and if
>>> this
>>> not the
>>>
>>> Greets
>>>
>>>
>>>
>>> Vincent Carey escribió:
>>>> Please read the posting guide.  You did not provide the result of
>>>> sessionInfo().  You may be using an inconvenient locale.  Typically if the
>>>> following holds
>>>>
>>>>> Sys.getlocale()
>>>> [1] "C"
>>>>
>>>> you will not run into the error noted for this task.
>>>>
>>>> On Sun, Apr 4, 2010 at 9:31 PM, <lapereir at uc.cl> wrote:
>>>>
>>>>> Dear list
>>>>>
>>>>> I am getting a couple of erros when trying to  import gpr files using the
>>>>> read.maimages of Limma.
>>>>>
>>>>>> targets<-readTargets("targets.txt")
>>>>>> RG <- read.maimages(targets, source="genepix", wt.fun=f)
>>>>> Error in gsub("\\.", "\\\\.", x) :
>>>>> input string 1 is invalid in this locale
>>>>>
>>>>> I search in R Help for the function of gsub, but i cannot fix the error
>>>>> that
>>>>> give me, so that i cant import any genepix (.gpr) files.
>>>>>
>>>>> Thank
>>>>> Luis
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
> 
> 
> 

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793