[R] strange behavior when reading csv - line wraps

Sat May 30 22:15:20 CEST 2009

In a private correspondence with Martin Tomko, I think the reason
for the problem has been found.

The numbers of ";"-separated fields in the 82 successive lines of
his file are as follows:

  01:26   02:26   03:33   04:33   05:12   06:12   07:12   08:12,
  09:19   10:19   11:17   12:17   13:23   14:23   15:23   16:23,
  17:23   18:23   19:23   20:23   21:23   22:23   23:23   24:23,
  25:23   26:23   27:23   28:23   29:23   30:23   31:23   32:23,
  33:23   34:23   35:23   36:23   37:23   38:23   39:23   40:23,
  41:23   42:23   43:23   44:23   45:23   46:23   47:23   48:23,
  49:23   50:23   51:23   52:23   53:23   54:23   55:23   56:23,
  57:23   58:23   59:23   60:23   61:34   62:34   63:34   64:34,
  65:13   66:13   67:38   68:38   69:20   70:20   71:44   72:20,
  73:19   74:19   75:20   76:44   77:20   78:19   79:19   80:20,
  81:25   82:25

So in the first 5 lines there is a maximum of 33 fields. Hence, since
there is no header line, read.csv() decides to allocate 33 columns.
(See ?read.csv).

There are the following distinct numbers of fields in the lines:

  12 13 17 19 20 23 25 26 33 34 38 44

so there are lines with 34, 38 and 44 fields. All lines in the CSV
file end with ";", hence there is an implicit blank field at the
end of each line. The lines with 34 fields have the 34th field blank,
so after the break there is presumably a "quasi blank input line"
where the 34th (blank) field has spilled over. Such input will be
ignored with the default "blank.lines.skip = TRUE" option to read,csv().
The longer lines (2 with 38 fields, 2 with 44) will be split after
the 33rd field, the remainder being taken as an additional input
line. As a result, there are 82 (= 82+4) rows in the resulting
dataframe.

This explanation is compatible with what Martin has observed.
The underlying forensic details were sniffed out with a couple
of passes through 'awk' scripts.

One solution is to call read.csv() with option "col.names=Xnn"
where Xnn is a constructed character vector with elements such
as "X01" "X02" ... "X44" (once one has determined, as above, that
there is a maximum of 44 fields per line in the file).

Ted.

On 30-May-09 19:43:47, jim holtman wrote:
> It is still not clear to me exactly how you want to read the lines in. 
> If
> the lines have a variable number of fields, and some of the lines might
> be
> wrapped, is there some way to determine where the start of each line
> is.
> 
> If you are reading them in with read.csv, then the system is assuming
> that
> each line starts a new row.  If this is not the case, then you will
> have to
> state the rules that determine where the lines start.  You can always
> read
> the data in with 'scan' to separate each line and then do whatever
> processing is required to put together the rows in a data frame that
> you
> want.
> 
> In one of your examples, you indicated that the line was split starting
> at
> the word "kempten"; if this is in the middle of the line, then you
> would
> have to create the break after reading the line in with 'scan' and then
> creating the rows in the dataframe.  All of this can be done in R if
> you can
> state what the criteria is.
> On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
> <martin.tomko at geo.uzh.ch>wrote:
> 
>> Jim,
>> the two lines I put in are the actual problematic input lines.
>> In these examples, there are no quotes nor # signs, although I have no
>> means to make sure they do not occur in the inputs (any hints how I
>> could
>> deal with that?).
>> I am trying to avoid as much pre-processing outside R as possible, and
>> I
>> have to process about 500 files with up to 3000 records each, so I
>> need a
>> more or less automated/batch solution. - so any string substitution
>> will
>> have to occur in R. But for the moment, I do not see a reaason for
>> substitution, and the wrapping still occurs.
>>
>> Cheers
>> Martin
>>
>>
>>
>> jim holtman wrote:
>>
>>> You need to supply the actual input line so we can see what is
>>> happening.
>>>  Are you sure you do not have unbalanced quotes in your input (try
>>>  quote='')
>>> or do you have comment characters ("#") in your input?
>>>
>>>  On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
>>>  <martin.tomko at geo.uzh.ch<mailto:
>>> martin.tomko at geo.uzh.ch>> wrote:
>>>
>>>    Dear All,
>>>    I am observing a strange behavior and searching the archives and
>>>    help pages didn't help much.
>>>    I have a csv with a variable number of fields in each line.
>>>
>>>    I use
>>>    dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill =TRUE);
>>>
>>>    to read it in, and it works. But - some lines are long and 'wrap',
>>>    or split and continue on the next line. So when I check the dim of
>>>    the frame, they are not correct and I can see when I do a printout
>>>    that the lines is split into two in the frame. I checked the input
>>>    file and all is good.
>>>
>>>    an example of the input is:
>>>    37;2175168475;13;8.522729;47.19537;16366682 at N00
>>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
>>> erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
>>> mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
>>> otnet;
>>>
>>>    where the last values occurs on the next line in the data frame.
>>>
>>>    It does not have to be the last value, as in the follwong example,
>>>    the word "kempten" starts the next line:
>>>    39;167757703;12;10.309295;47.724545;21903142 at N00
>>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
>>> ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
>>> ;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
>>> pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
> aeischeunion;germanio;
>>>
>>>    What could be the reason?
>>>
>>>    I ws thinking about solving the issue by using a different
>>>    separator, that I would use for the first 7 fields and
>>>    concatenating all of the remaining values into a single stirng
>>>    value, but could not figure out how to do such a substitution in
>>>    R. Unfortunately, on my system I cannot specify a range for sed...
>>>
>>>    Thanks for any help/pointers
>>>    Martin
>>>
>>>    ______________________________________________
>>>    R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>>>    https://stat.ethz.ch/mailman/listinfo/r-help
>>>    PLEASE do read the posting guide
>>>    http://www.R-project.org/posting-guide.html<http://www.r-project.or
>>>    g/posting-guide.html>
>>>    <http://www.r-project.org/posting-guide.html>
>>>    and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>>
> 
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 30-May-09                                       Time: 21:15:13
------------------------------ XFMail ------------------------------