[R] watch out for quotes in data files

Fri Jul 13 11:07:18 CEST 2001

I had exactly the same problem with some GenePix Results Data files. The
solution is to add an argument quote="" to read.table() and/or scan(). In
your case I believe you should use

  read.table(filename, sep = "\t", quote = "", header = TRUE)

instead. You don't have to modify the source files.

Henrik Bengtsson

>I have had similar problems with R 1.2.2. Everytime a string has the '
>single quote, it reads up to a maximum of 8192 characters into the item,
>creating memory and parsing problems. I make it a habit now to remove all '
>(single quotes) from the text or replace them with double quotes.
>
>
>--
>Vele Samak, Vice President
>Global Quantitative Research
>Salomon Smith Barney
>7 WTC, New York, NY 10048, 212-783-7007
>>-----Original Message-----
>>From: Douglas Bates [mailto:bates at stat.wisc.edu]
>>Sent: Monday, July 09, 2001 11:53 PM
>>To: R-help at stat.math.ethz.ch
>>Subject: [R] watch out for quotes in data files
>>I have just spent a day trying to determine why I seemed to be unable
>>to read a file of microarray expression results into R properly.  The
>>file was produced by the Dchip software developed by Li and Wong at
>>Harvard's Department of Biostatistics.  It contains rows of
>>tab-delimited fields in the order
>>Probe set identifier
>>Probe set description
>>Array 1 expression
>>Array 1 call
>>Array 2 expression
>>Array 2 call
>>...
>>plus an extra tab (which I think is due to a programming glich).
>>There are 7130 rows, including the column headers, for results from
>>Affymetrix Hu6800 chips.
>>When I read this file using read.table(filename, sep = "\t", head = TRUE)
>>I got only 3720 rows.  Furthermore count.fields(filename, sep = "\t")
>>gave a result of length 7130 but several of the rows were reported as
>>having only two fields instead of 15 like the other rows.
>>It seemed to me that the important characteristic of these rows was
>>their having a very long "Probe set description" and I wasted quite a
>>bit of time looking for possible buffer overflows that might be
>>triggered by this.
>>When I finally came to my senses and created a much smaller input file
>>that only contained a few rows, including one that was giving an
>>aberrant field count, I could directly examine the results of scan()
>>applied to it.  I noticed that the second field for the aberrant line
>>contained all the subsequent lines and then I saw that its description
>>included "5'" (as in the 5' end of the sequence versus the 3' end).
>>Other descriptions had this written as "5 prime" but this one used
>>"5'".  What was happening was that everything from there to the next
>>"'" character in the file was being included as part of that
>>description.
>>I could read the file properly by adding the optional argument quote =
>>"" to the call to read.table.
>>The moral of the story is to watch out for molecular biologists who
>>use unpaired quote characters in their descriptions.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._