[R] Unexpected behaviour from read.table

peter dalgaard pdalgd at gmail.com
Mon Feb 5 10:57:06 CET 2018


This looks like a bug. Specifically, inside read.table

    lines <- .External(C_readtablehead, file, nlines, comment.char, 
        blank.lines.skip, quote, sep, skipNul)

returns "lines" as

[1] "ID\tValue"                         "=\"Total\"\t1000"                 
[3] "=\"CJ01   \"\t550\n=\"CF02\"\t450"

Notice the embedded \n in the 3rd line. I.e., there are really 4 lines there. This gets pushed back twice and the first 3 (not 4) lines get read again as part of the header logic. Then when it comes to reading the data proper, the 4th line has ended up duplicated as the top row...

As you suggest, it seems that something is up with the quote matching logic.

-pd


> On 4 Feb 2018, at 23:45 , Michael <michael77allen at gmail.com> wrote:
> 
> I’ve been struggling with seemingly ‘corrupt’ data.frames for a few days, and believe I’ve narrowed the problem down to some odd behaviour from read.table
> 
> I receive a tab delimited file from an external provider where strings are encoded as =“content”. Not sure why, perhaps as most users open it in Excel. 
> My specific issue is that trailing spaces in any of the strings are causing strange results from read.table
> 
> # No trailing spaces
> read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01\"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t’)
>      V1    V2
> 1     ID Value
> 2 =Total  1000
> 3  =CJ01   550
> 4  =CF02   450
> 
> # Now with trailing spaces in line 3
> read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01   \"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t')
>        V1    V2
> 1    =CF02   450
> 2       ID Value
> 3   =Total  1000
> 4 =CJ01      550
> 5    =CF02   450
> 
> I solved my specific problem by setting quote=‘’, and extracting the string content after calling read.table. As my original code had header=TRUE, I was finding random rows were being used as column names! 
> 
> Flagging a potential issue with read.table, although I can easily accept I'm missing something obvious here. 
> 
> Best,
> Michael
> 
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-apple-darwin15.6.0 (64-bit)  / x86_64-pc-linux-gnu (64-bit)
> Running under: macOS High Sierra 10.13.2 /  Ubuntu 16.04.3 LTS
> 
> 
> 
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list