[R] Problem reading mixed CSV file

Tue Mar 20 17:07:22 CET 2012

use 'count.fields' to determine which line have 6 and 7 fields in them.

then use 'readLines' to read in the entire file and the use the data
from count.fields to write out to separate files"

x <- count.fields(...)
input <- readLines(..)
writeLines(input[x == 6], file = '6fields.csv')
writeLines(input[x==7], file = '7fields.csv')

On Tue, Mar 20, 2012 at 11:43 AM, Ashish Agarwal
<ashish.agarwala at gmail.com> wrote:
> The file is 20MB having 2 Million rows.
> I understand that I two different formats  - 6 columns and 7 columns.
> How do I read chunks to different files by using scan with modifying
> skip and nlines parameters?
>
> On Mon, Mar 19, 2012 at 3:59 PM, Petr PIKAL <petr.pikal at precheza.cz> wrote:
>>
>> I would follow Jims suggestion,
>> nFields <- count.fields(fileName, sep = ',')
>> count fields and read chunks to different files by using scan with
>> modifying skip and nlines parameters. However if there is only few lines
>> which differ it would be better to correct those few lines manually in
>> some suitable editor.
>>
>> Elaborating omnipotent function for reading any kind of
>> corrupted/nonstandard files seems to me suited only if you expect to read
>> such files many times.
>>
>> Regards
>> Petr
>>
>>
>>>
>>>
>>>
>>> On Sat, Mar 17, 2012 at 4:54 AM, jim holtman <jholtman at gmail.com> wrote:
>>> > Here is a solution that looks for the line with 7 elements and inserts
>>> > the quotes:
>>> >
>>> >
>>> >> fileName <- '/temp/text.txt'
>>> >> input <- readLines(fileName)
>>> >> # count the fields to find 7
>>> >> nFields <- count.fields(fileName, sep = ',')
>>> >> # now fix the data
>>> >> for (i in which(nFields == 7)){
>>> > +     # split on comma
>>> > +     z <- strsplit(input[i], ',')[[1]]
>>> > +     input[i] <- paste(z[1], z[2]
>>> > +         , paste('"', z[3], ',', z[4], '"', sep = '') # put on quotes
>>> > +         , z[5], z[6], z[7], sep = ','
>>> > +         )
>>> > + }
>>> >>
>>> >> # now read in the data
>>> >> result <- read.table(textConnection(input), sep = ',')
>>> >>
>>> >>         result
>>> >                         V1       V2                   V3   V4 V5 V6
>>> > 1                                                         1968 21  0
>>> > 2                                                  Boston 1968 13  0
>>> > 3                                                  Boston 1968 18  0
>>> > 4                                                 Chicago 1967 44  0
>>> > 5                                              Providence 1968 17  0
>>> > 6                                              Providence 1969 48  0
>>> > 7                                                   Binky 1968 24  0
>>> > 8                                                 Chicago 1968 23  0
>>> > 9                                                   Dally 1968  7  0
>>> > 10                                   Raleigh, North Carol 1968 25  0
>>> > 11 Addy ABC-Dogs Stars-W8.1                    Providence 1968 38  0
>>> > 12              DEF_REQPRF/                     Dartmouth 1967 31  1
>>> > 13                       PL                               1967 38  1
>>> > 14                       XY PopatLal                      1967  5  1
>>> > 15                       XY PopatLal                      1967  6  8
>>> > 16                       XY PopatLal                      1967  7  7
>>> > 17                       XY PopatLal                      1967  9  1
>>> > 18                       XY PopatLal                      1967 10  1
>>> > 19                       XY PopatLal                      1967 13  1
>>> > 20                       XY PopatLal               Boston 1967  6  1
>>> > 21                       XY PopatLal               Boston 1967  7 11
>>> > 22                       XY PopatLal               Boston 1967  9  2
>>> > 23                       XY PopatLal               Boston 1967 10  3
>>> > 24                       XY PopatLal               Boston 1967  7  2
>>> >>
>>> >
>>> >
>>> > On Fri, Mar 16, 2012 at 2:17 PM, Ashish Agarwal
>>> > <ashish.agarwala at gmail.com> wrote:
>>> >> I have a file that is 5000 records and to edit that file is not easy.
>>> >> Is there any way to line 10 differently to account for changes in the
>>> >> third field?
>>> >>
>>> >> On Fri, Mar 16, 2012 at 11:35 PM, Peter Ehlers <ehlers at ucalgary.ca>
>> wrote:
>>> >>> On 2012-03-16 10:48, Ashish Agarwal wrote:
>>> >>>>
>>> >>>> Line 10 has City and State that too separated by comma. For line 10
>>> >>>> how can I read differently as compared to the other lines?
>>> >>>
>>> >>>
>>> >>> Edit the file and put quotes around the city-state combination:
>>> >>>  "Raleigh, North Carol"
>>> >>>
>>> >>
>>> >> ______________________________________________
>>> >> R-help at r-project.org mailing list
>>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> >> and provide commented, minimal, self-contained, reproducible code.
>>> >
>>> >
>>> >
>>> > --
>>> > Jim Holtman
>>> > Data Munger Guru
>>> >
>>> > What is the problem that you are trying to solve?
>>> > Tell me what you want to do, not how you want to do it.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.