[R] Problem reading mixed CSV file

Petr PIKAL petr.pikal at precheza.cz
Mon Mar 19 11:29:42 CET 2012


Hi
 
> This is quite a CPu consuming process. My system got hung up for the
> big file I have.
> 
> Within the for loop that you have suggested, can't I have a case
> statement for different value of nfields to be read and specify what
> format does the variable needs to be read?
> something like
> case
> # input format for 6 fields
> when nFields == 6
> read.csv as string, string, string, numeric, numeric, numeric into 
dataframe1
> #input format for 7 fields
> when nFields == 7
> read.csv as string, string, string, string, numeric, numeric, numeric
> into dataframe2
> end case
> # Output the two dataframes via some way of tracking the original line
> numbers of the input file - similar to _N_ in SAS
> . Dataframe1 to be outputed as it is while in dataframe2,
> concatenating the 3rd and the 4th strings.
> 
> Could you please help with the format for the above?

I would follow Jims suggestion, 
nFields <- count.fields(fileName, sep = ',')
count fields and read chunks to different files by using scan with 
modifying skip and nlines parameters. However if there is only few lines 
which differ it would be better to correct those few lines manually in 
some suitable editor.

Elaborating omnipotent function for reading any kind of 
corrupted/nonstandard files seems to me suited only if you expect to read 
such files many times.

Regards
Petr


> 
> 
> 
> On Sat, Mar 17, 2012 at 4:54 AM, jim holtman <jholtman at gmail.com> wrote:
> > Here is a solution that looks for the line with 7 elements and inserts
> > the quotes:
> >
> >
> >> fileName <- '/temp/text.txt'
> >> input <- readLines(fileName)
> >> # count the fields to find 7
> >> nFields <- count.fields(fileName, sep = ',')
> >> # now fix the data
> >> for (i in which(nFields == 7)){
> > +     # split on comma
> > +     z <- strsplit(input[i], ',')[[1]]
> > +     input[i] <- paste(z[1], z[2]
> > +         , paste('"', z[3], ',', z[4], '"', sep = '') # put on quotes
> > +         , z[5], z[6], z[7], sep = ','
> > +         )
> > + }
> >>
> >> # now read in the data
> >> result <- read.table(textConnection(input), sep = ',')
> >>
> >>         result
> >                         V1       V2                   V3   V4 V5 V6
> > 1                                                         1968 21  0
> > 2                                                  Boston 1968 13  0
> > 3                                                  Boston 1968 18  0
> > 4                                                 Chicago 1967 44  0
> > 5                                              Providence 1968 17  0
> > 6                                              Providence 1969 48  0
> > 7                                                   Binky 1968 24  0
> > 8                                                 Chicago 1968 23  0
> > 9                                                   Dally 1968  7  0
> > 10                                   Raleigh, North Carol 1968 25  0
> > 11 Addy ABC-Dogs Stars-W8.1                    Providence 1968 38  0
> > 12              DEF_REQPRF/                     Dartmouth 1967 31  1
> > 13                       PL                               1967 38  1
> > 14                       XY PopatLal                      1967  5  1
> > 15                       XY PopatLal                      1967  6  8
> > 16                       XY PopatLal                      1967  7  7
> > 17                       XY PopatLal                      1967  9  1
> > 18                       XY PopatLal                      1967 10  1
> > 19                       XY PopatLal                      1967 13  1
> > 20                       XY PopatLal               Boston 1967  6  1
> > 21                       XY PopatLal               Boston 1967  7 11
> > 22                       XY PopatLal               Boston 1967  9  2
> > 23                       XY PopatLal               Boston 1967 10  3
> > 24                       XY PopatLal               Boston 1967  7  2
> >>
> >
> >
> > On Fri, Mar 16, 2012 at 2:17 PM, Ashish Agarwal
> > <ashish.agarwala at gmail.com> wrote:
> >> I have a file that is 5000 records and to edit that file is not easy.
> >> Is there any way to line 10 differently to account for changes in the
> >> third field?
> >>
> >> On Fri, Mar 16, 2012 at 11:35 PM, Peter Ehlers <ehlers at ucalgary.ca> 
wrote:
> >>> On 2012-03-16 10:48, Ashish Agarwal wrote:
> >>>>
> >>>> Line 10 has City and State that too separated by comma. For line 10
> >>>> how can I read differently as compared to the other lines?
> >>>
> >>>
> >>> Edit the file and put quotes around the city-state combination:
> >>>  "Raleigh, North Carol"
> >>>
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >
> > --
> > Jim Holtman
> > Data Munger Guru
> >
> > What is the problem that you are trying to solve?
> > Tell me what you want to do, not how you want to do it.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list