[R] strange behavior when reading csv - line wraps
(Ted Harding)
Ted.Harding at manchester.ac.uk
Sat May 30 22:15:20 CEST 2009
In a private correspondence with Martin Tomko, I think the reason
for the problem has been found.
The numbers of ";"-separated fields in the 82 successive lines of
his file are as follows:
01:26 02:26 03:33 04:33 05:12 06:12 07:12 08:12,
09:19 10:19 11:17 12:17 13:23 14:23 15:23 16:23,
17:23 18:23 19:23 20:23 21:23 22:23 23:23 24:23,
25:23 26:23 27:23 28:23 29:23 30:23 31:23 32:23,
33:23 34:23 35:23 36:23 37:23 38:23 39:23 40:23,
41:23 42:23 43:23 44:23 45:23 46:23 47:23 48:23,
49:23 50:23 51:23 52:23 53:23 54:23 55:23 56:23,
57:23 58:23 59:23 60:23 61:34 62:34 63:34 64:34,
65:13 66:13 67:38 68:38 69:20 70:20 71:44 72:20,
73:19 74:19 75:20 76:44 77:20 78:19 79:19 80:20,
81:25 82:25
So in the first 5 lines there is a maximum of 33 fields. Hence, since
there is no header line, read.csv() decides to allocate 33 columns.
(See ?read.csv).
There are the following distinct numbers of fields in the lines:
12 13 17 19 20 23 25 26 33 34 38 44
so there are lines with 34, 38 and 44 fields. All lines in the CSV
file end with ";", hence there is an implicit blank field at the
end of each line. The lines with 34 fields have the 34th field blank,
so after the break there is presumably a "quasi blank input line"
where the 34th (blank) field has spilled over. Such input will be
ignored with the default "blank.lines.skip = TRUE" option to read,csv().
The longer lines (2 with 38 fields, 2 with 44) will be split after
the 33rd field, the remainder being taken as an additional input
line. As a result, there are 82 (= 82+4) rows in the resulting
dataframe.
This explanation is compatible with what Martin has observed.
The underlying forensic details were sniffed out with a couple
of passes through 'awk' scripts.
One solution is to call read.csv() with option "col.names=Xnn"
where Xnn is a constructed character vector with elements such
as "X01" "X02" ... "X44" (once one has determined, as above, that
there is a maximum of 44 fields per line in the file).
Ted.
On 30-May-09 19:43:47, jim holtman wrote:
> It is still not clear to me exactly how you want to read the lines in.
> If
> the lines have a variable number of fields, and some of the lines might
> be
> wrapped, is there some way to determine where the start of each line
> is.
>
> If you are reading them in with read.csv, then the system is assuming
> that
> each line starts a new row. If this is not the case, then you will
> have to
> state the rules that determine where the lines start. You can always
> read
> the data in with 'scan' to separate each line and then do whatever
> processing is required to put together the rows in a data frame that
> you
> want.
>
> In one of your examples, you indicated that the line was split starting
> at
> the word "kempten"; if this is in the middle of the line, then you
> would
> have to create the break after reading the line in with 'scan' and then
> creating the rows in the dataframe. All of this can be done in R if
> you can
> state what the criteria is.
> On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
> <martin.tomko at geo.uzh.ch>wrote:
>
>> Jim,
>> the two lines I put in are the actual problematic input lines.
>> In these examples, there are no quotes nor # signs, although I have no
>> means to make sure they do not occur in the inputs (any hints how I
>> could
>> deal with that?).
>> I am trying to avoid as much pre-processing outside R as possible, and
>> I
>> have to process about 500 files with up to 3000 records each, so I
>> need a
>> more or less automated/batch solution. - so any string substitution
>> will
>> have to occur in R. But for the moment, I do not see a reaason for
>> substitution, and the wrapping still occurs.
>>
>> Cheers
>> Martin
>>
>>
>>
>> jim holtman wrote:
>>
>>> You need to supply the actual input line so we can see what is
>>> happening.
>>> Are you sure you do not have unbalanced quotes in your input (try
>>> quote='')
>>> or do you have comment characters ("#") in your input?
>>>
>>> On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
>>> <martin.tomko at geo.uzh.ch<mailto:
>>> martin.tomko at geo.uzh.ch>> wrote:
>>>
>>> Dear All,
>>> I am observing a strange behavior and searching the archives and
>>> help pages didn't help much.
>>> I have a csv with a variable number of fields in each line.
>>>
>>> I use
>>> dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill =TRUE);
>>>
>>> to read it in, and it works. But - some lines are long and 'wrap',
>>> or split and continue on the next line. So when I check the dim of
>>> the frame, they are not correct and I can see when I do a printout
>>> that the lines is split into two in the frame. I checked the input
>>> file and all is good.
>>>
>>> an example of the input is:
>>> 37;2175168475;13;8.522729;47.19537;16366682 at N00
>>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
>>> erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
>>> mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
>>> otnet;
>>>
>>> where the last values occurs on the next line in the data frame.
>>>
>>> It does not have to be the last value, as in the follwong example,
>>> the word "kempten" starts the next line:
>>> 39;167757703;12;10.309295;47.724545;21903142 at N00
>>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
>>> ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
>>> ;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
>>> pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
> aeischeunion;germanio;
>>>
>>> What could be the reason?
>>>
>>> I ws thinking about solving the issue by using a different
>>> separator, that I would use for the first 7 fields and
>>> concatenating all of the remaining values into a single stirng
>>> value, but could not figure out how to do such a substitution in
>>> R. Unfortunately, on my system I cannot specify a range for sed...
>>>
>>> Thanks for any help/pointers
>>> Martin
>>>
>>> ______________________________________________
>>> R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html<http://www.r-project.or
>>> g/posting-guide.html>
>>> <http://www.r-project.org/posting-guide.html>
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 30-May-09 Time: 21:15:13
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list