[R] count.fields inconsistent with read.table?

Sam Steingold sds at gnu.org
Fri Feb 24 16:51:39 CET 2012


> * peter dalgaard <cqnytq at tznvy.pbz> [2012-02-24 08:41:07 +0100]:
> On Feb 24, 2012, at 06:58 , Sam Steingold wrote:
>
>> batch is a vector of lines returned by readLines from a
>> NL-line-terminated file, here is the relevant section:
>> =========================================================
>> AA	BB	CC	DD			EE	FF
>> GG	H
>> 
>> H	JJ	KK			LL	MM
>> =========================================================
>> as you can see, a line is corrupt; two CRLF's are inserted.
>
> Actually, I don't see... (It's pretty hard to count TAB characters by eye.)

how about this?
>> =========================================================
>> AA^IBB^ICC^IDD^I^I^IEE^IFF
>> GG^IH^M
>> ^M
>> H^IJJ^IKK^I^I^ILL^IMM
>> =========================================================

I replaced TAB with ^I and CR with ^M.
is this better?

here I use <TAB> and <CR> instead:
>> =========================================================
>> AA<TAB>BB<TAB>CC<TAB>DD<TAB><TAB><TAB>EE<TAB>FF
>> GG<TAB>H<CR>
>> <CR>
>> H<TAB>JJ<TAB>KK<TAB><TAB><TAB>LL<TAB>MM
>> =========================================================

so, you see, there are two data lines here: A..F - good, with 8 fields.
G..M - BAD two CRLF's inserted inside the 2nd field, turning one line
into 3 lines.
so I must drop 3 input lines from the input.

>> This is okay, I drop the bad lines, at least I hope I do:
>> 
>>  conn <- textConnection(batch)
>>  field.counts <- count.fields(conn, sep="\t", comment.char="", quote="")
>>  close(conn)
>>  good <- field.counts == 8  # this should drop all bad lines
>>  if (!all(good))
>>    batch <- batch[good]
>>  conn <- textConnection(batch)
>>  ret <- read.table(conn, sep="\t", comment.char="", quote="")
>>  close(conn)
>> 
>> I get this error in read.table():
>> 
>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
>>  line 7151 did not have 8 elements
>> 
>> how come?!
>
> You can do better than this in terms of providing clues for us:
> "batch" is a character vector, right? So recheck that count.fields
> returns all 8's after removal of bad lines. Also check that dimensions
> match -- is length(batch) actually the same as length(field.counts)?

 batch <- lines[807000:808000]
 conn <- textConnection(batch)
 field.counts <- count.fields(conn, sep="\t", comment.char="", quote="")
 close(conn)
 good <- field.counts == length(col.names)
 which(!good)
[1] 152 153

## WRONG: it should be 3 lines, 154 is also bad - see above

 batch[!good]
[1] "GG\tH" ""                     
 length(batch)
[1] 1001
 length(good)
[1] 1000

## WRONG: batch, field.counts and good should have the same length
 
AHA! blank.lines.skip !!!
I must set it to FALSE!!!
and it does fix the problem...

> Finally, what is in line 7151?

that's the first line with a <CR>:

GG<TAB>H<CR>


>> also, is there some error recovery?
>
> Well you can try().

it appears that try gives me access to the error message, not the
erroneous data, i.e., I still have to reload the file to get the batch
string vector.


-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://www.memritv.org http://americancensorship.org
http://memri.org http://jihadwatch.org http://dhimmi.com http://iris.org.il
Democracy is like a car: you can ride it or you can run people over with it.



More information about the R-help mailing list