[R] strange behavior when reading csv - line wraps

(Ted Harding) Ted.Harding at manchester.ac.uk
Sun May 31 17:24:32 CEST 2009


Ah!!! It was count.fields() which we had overlooked! We discoveered
a work-round which involved using 

  Data0 <- readLines(file)

to create a vector of strings, one for each line of the input file,
and then using

  NF <- unlist(lapply(R0,function(x)
        length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE))))))

to count the number of occurrences of ";" (the separator) in each line.
(NF+1) produces the same result as count.fields(file,sep=";"). 

Thanks for pointing out the existence of count.fields()!
Ted.

On 31-May-09 15:04:23, jim holtman wrote:
> You can do something like this: count the number of fields in each line
> of
> the file and use the max to determine the number of columns for
> read.table:
> 
> file <- '/tempxx.txt'
> maxFields <- max(count.fields(file))  # max
># now setup read.table for max number
> input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE,
>     col.names=paste("V", seq(maxFields), sep=''))
> 
> 
> On Sun, May 31, 2009 at 6:06 AM, Martin Tomko
> <martin.tomko at geo.uzh.ch>wrote:
> 
>> Dear Jim,
>> with the help of Ted, we diagnosed that the cause is in the extreme
>> variability in line length during reading in. As the table column
>> number is
>> apparently determined fro mthe first five lines, what exceeds this
>> length
>> gets automatically on the next line.
>> I am now trying to find a way to read in the data despite this. I have
>> no
>> control over the table extent, the only thing that would make sense
>> according to my data would be to read in a fixed number of columns and
>> merge
>> all remaining columns as a long string in the last one. No idea how to
>> do
>> this, though.
>>
>> Thanks
>> Martin
>>
>>
>> jim holtman wrote:
>>
>>> It is still not clear to me exactly how you want to read the lines
>>> in.  If
>>> the lines have a variable number of fields, and some of the lines
>>> might be
>>> wrapped, is there some way to determine where the start of each line
>>> is.
>>>  If you are reading them in with read.csv, then the system is
>>>  assuming
>>> that each line starts a new row.  If this is not the case, then you
>>> will
>>> have to state the rules that determine where the lines start.  You
>>> can
>>> always read the data in with 'scan' to separate each line and then do
>>> whatever processing is required to put together the rows in a data
>>> frame
>>> that you want.
>>>  In one of your examples, you indicated that the line was split
>>>  starting
>>> at the word "kempten"; if this is in the middle of the line, then you
>>> would
>>> have to create the break after reading the line in with 'scan' and
>>> then
>>> creating the rows in the dataframe.  All of this can be done in R if
>>> you can
>>> state what the criteria is.
>>> On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
>>> <martin.tomko at geo.uzh.ch<mailto:
>>> martin.tomko at geo.uzh.ch>> wrote:
>>>
>>>    Jim,
>>>    the two lines I put in are the actual problematic input lines.
>>>    In these examples, there are no quotes nor # signs, although I
>>>    have no means to make sure they do not occur in the inputs (any
>>>    hints how I could deal with that?).
>>>    I am trying to avoid as much pre-processing outside R as possible,
>>>    and I have to process about 500 files with up to 3000 records
>>>    each, so I need a more or less automated/batch solution. - so any
>>>    string substitution will have to occur in R. But for the moment, I
>>>    do not see a reaason for substitution, and the wrapping still
>>>    occurs.
>>>
>>>    Cheers
>>>    Martin
>>>
>>>
>>>
>>>    jim holtman wrote:
>>>
>>>        You need to supply the actual input line so we can see what is
>>>        happening.  Are you sure you do not have unbalanced quotes in
>>>        your input (try quote='') or do you have comment characters
>>>        ("#") in your input?
>>>
>>>        On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
>>>        <martin.tomko at geo.uzh.ch <mailto:martin.tomko at geo.uzh.ch>
>>>        <mailto:martin.tomko at geo.uzh.ch
>>>        <mailto:martin.tomko at geo.uzh.ch>>> wrote:
>>>
>>>           Dear All,
>>>           I am observing a strange behavior and searching the
>>>        archives and
>>>           help pages didn't help much.
>>>           I have a csv with a variable number of fields in each line.
>>>
>>>           I use
>>>           dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
>>>        =TRUE);
>>>
>>>           to read it in, and it works. But - some lines are long and
>>>        'wrap',
>>>           or split and continue on the next line. So when I check the
>>>        dim of
>>>           the frame, they are not correct and I can see when I do a
>>>        printout
>>>           that the lines is split into two in the frame. I checked
>>>        the input
>>>           file and all is good.
>>>
>>>           an example of the input is:
>>>                 37;2175168475;13;8.522729;47.19537;16366682 at N00
>>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
>>> erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
>>> mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
>>> otnet;
>>>
>>>           where the last values occurs on the next line in the data
>>>        frame.
>>>
>>>           It does not have to be the last value, as in the follwong
>>>        example,
>>>           the word "kempten" starts the next line:
>>>                 39;167757703;12;10.309295;47.724545;21903142 at N00
>>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
>>> ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
>>> ;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
>>> pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
> aeischeunion;germanio;
>>>
>>>           What could be the reason?
>>>
>>>           I ws thinking about solving the issue by using a different
>>>           separator, that I would use for the first 7 fields and
>>>           concatenating all of the remaining values into a single
>>>           stirng
>>>           value, but could not figure out how to do such a
>>>        substitution in
>>>           R. Unfortunately, on my system I cannot specify a range for
>>>        sed...
>>>
>>>           Thanks for any help/pointers
>>>           Martin
>>>
>>>           ______________________________________________
>>>           R-help at r-project.org <mailto:R-help at r-project.org>
>>>        <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
>>>        mailing list
>>>
>>>           https://stat.ethz.ch/mailman/listinfo/r-help
>>>           PLEASE do read the posting guide
>>>           http://www.R-project.org/posting-guide.html<http://www.r-pro
>>>           ject.org/posting-guide.html>
>>>        <http://www.r-project.org/posting-guide.html>
>>>           <http://www.r-project.org/posting-guide.html>
>>>
>>>           and provide commented, minimal, self-contained,
>>>        reproducible code.
>>>
>>>
>>>
>>>
>>>        --        Jim Holtman
>>>        Cincinnati, OH
>>>        +1 513 646 9390
>>>
>>>        What is the problem that you are trying to solve?
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>>
>> --
>> Martin Tomko
>> Postdoctoral Research Assistant   Geographic Information Systems
>> Division
>> Department of Geography
>> University of Zurich - Irchel
>> Winterthurerstr. 190
>> CH-8057 Zurich, Switzerland
>>
>> email:  martin.tomko at geo.uzh.ch
>> site:   http://www.geo.uzh.ch/~mtomko
>> mob:    +41-788 629 558
>> tel:    +41-44-6355256
>> fax:    +41-44-6356848
>>
>>
> 
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 31-May-09                                       Time: 16:24:27
------------------------------ XFMail ------------------------------




More information about the R-help mailing list