[R] strange behavior when reading csv - line wraps
(Ted Harding)
Ted.Harding at manchester.ac.uk
Sun May 31 17:24:32 CEST 2009
Ah!!! It was count.fields() which we had overlooked! We discoveered
a work-round which involved using
Data0 <- readLines(file)
to create a vector of strings, one for each line of the input file,
and then using
NF <- unlist(lapply(R0,function(x)
length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE))))))
to count the number of occurrences of ";" (the separator) in each line.
(NF+1) produces the same result as count.fields(file,sep=";").
Thanks for pointing out the existence of count.fields()!
Ted.
On 31-May-09 15:04:23, jim holtman wrote:
> You can do something like this: count the number of fields in each line
> of
> the file and use the max to determine the number of columns for
> read.table:
>
> file <- '/tempxx.txt'
> maxFields <- max(count.fields(file)) # max
># now setup read.table for max number
> input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE,
> col.names=paste("V", seq(maxFields), sep=''))
>
>
> On Sun, May 31, 2009 at 6:06 AM, Martin Tomko
> <martin.tomko at geo.uzh.ch>wrote:
>
>> Dear Jim,
>> with the help of Ted, we diagnosed that the cause is in the extreme
>> variability in line length during reading in. As the table column
>> number is
>> apparently determined fro mthe first five lines, what exceeds this
>> length
>> gets automatically on the next line.
>> I am now trying to find a way to read in the data despite this. I have
>> no
>> control over the table extent, the only thing that would make sense
>> according to my data would be to read in a fixed number of columns and
>> merge
>> all remaining columns as a long string in the last one. No idea how to
>> do
>> this, though.
>>
>> Thanks
>> Martin
>>
>>
>> jim holtman wrote:
>>
>>> It is still not clear to me exactly how you want to read the lines
>>> in. If
>>> the lines have a variable number of fields, and some of the lines
>>> might be
>>> wrapped, is there some way to determine where the start of each line
>>> is.
>>> If you are reading them in with read.csv, then the system is
>>> assuming
>>> that each line starts a new row. If this is not the case, then you
>>> will
>>> have to state the rules that determine where the lines start. You
>>> can
>>> always read the data in with 'scan' to separate each line and then do
>>> whatever processing is required to put together the rows in a data
>>> frame
>>> that you want.
>>> In one of your examples, you indicated that the line was split
>>> starting
>>> at the word "kempten"; if this is in the middle of the line, then you
>>> would
>>> have to create the break after reading the line in with 'scan' and
>>> then
>>> creating the rows in the dataframe. All of this can be done in R if
>>> you can
>>> state what the criteria is.
>>> On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
>>> <martin.tomko at geo.uzh.ch<mailto:
>>> martin.tomko at geo.uzh.ch>> wrote:
>>>
>>> Jim,
>>> the two lines I put in are the actual problematic input lines.
>>> In these examples, there are no quotes nor # signs, although I
>>> have no means to make sure they do not occur in the inputs (any
>>> hints how I could deal with that?).
>>> I am trying to avoid as much pre-processing outside R as possible,
>>> and I have to process about 500 files with up to 3000 records
>>> each, so I need a more or less automated/batch solution. - so any
>>> string substitution will have to occur in R. But for the moment, I
>>> do not see a reaason for substitution, and the wrapping still
>>> occurs.
>>>
>>> Cheers
>>> Martin
>>>
>>>
>>>
>>> jim holtman wrote:
>>>
>>> You need to supply the actual input line so we can see what is
>>> happening. Are you sure you do not have unbalanced quotes in
>>> your input (try quote='') or do you have comment characters
>>> ("#") in your input?
>>>
>>> On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
>>> <martin.tomko at geo.uzh.ch <mailto:martin.tomko at geo.uzh.ch>
>>> <mailto:martin.tomko at geo.uzh.ch
>>> <mailto:martin.tomko at geo.uzh.ch>>> wrote:
>>>
>>> Dear All,
>>> I am observing a strange behavior and searching the
>>> archives and
>>> help pages didn't help much.
>>> I have a csv with a variable number of fields in each line.
>>>
>>> I use
>>> dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
>>> =TRUE);
>>>
>>> to read it in, and it works. But - some lines are long and
>>> 'wrap',
>>> or split and continue on the next line. So when I check the
>>> dim of
>>> the frame, they are not correct and I can see when I do a
>>> printout
>>> that the lines is split into two in the frame. I checked
>>> the input
>>> file and all is good.
>>>
>>> an example of the input is:
>>> 37;2175168475;13;8.522729;47.19537;16366682 at N00
>>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
>>> erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
>>> mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
>>> otnet;
>>>
>>> where the last values occurs on the next line in the data
>>> frame.
>>>
>>> It does not have to be the last value, as in the follwong
>>> example,
>>> the word "kempten" starts the next line:
>>> 39;167757703;12;10.309295;47.724545;21903142 at N00
>>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
>>> ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
>>> ;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
>>> pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
> aeischeunion;germanio;
>>>
>>> What could be the reason?
>>>
>>> I ws thinking about solving the issue by using a different
>>> separator, that I would use for the first 7 fields and
>>> concatenating all of the remaining values into a single
>>> stirng
>>> value, but could not figure out how to do such a
>>> substitution in
>>> R. Unfortunately, on my system I cannot specify a range for
>>> sed...
>>>
>>> Thanks for any help/pointers
>>> Martin
>>>
>>> ______________________________________________
>>> R-help at r-project.org <mailto:R-help at r-project.org>
>>> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
>>> mailing list
>>>
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html<http://www.r-pro
>>> ject.org/posting-guide.html>
>>> <http://www.r-project.org/posting-guide.html>
>>> <http://www.r-project.org/posting-guide.html>
>>>
>>> and provide commented, minimal, self-contained,
>>> reproducible code.
>>>
>>>
>>>
>>>
>>> -- Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>>
>> --
>> Martin Tomko
>> Postdoctoral Research Assistant Geographic Information Systems
>> Division
>> Department of Geography
>> University of Zurich - Irchel
>> Winterthurerstr. 190
>> CH-8057 Zurich, Switzerland
>>
>> email: martin.tomko at geo.uzh.ch
>> site: http://www.geo.uzh.ch/~mtomko
>> mob: +41-788 629 558
>> tel: +41-44-6355256
>> fax: +41-44-6356848
>>
>>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 31-May-09 Time: 16:24:27
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list