[R] reading in data with variable length

Tue Dec 6 19:08:57 CET 2005

On 06-Dec-05 John McHenry wrote:
> I have very large csv files (up to 1GB each of ASCII text). I'd like to
> be able to read them directly in to R. The problem I am having is with
> the variable length of the data in each record.
>    
>   Here's a (simplified) example:
>    
>   $ cat foo.csv
> Name,Start Month,Data
> Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
> Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.8546,0.2696,0
> .3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114
>    
>   The records consist of rows with some set comma-separated fields
> (e.g. the "Name" & "Start Month" fields in the above) and then the data
> follow as a variable-length list of comma-separated values until a new
> line is encountered.

While you may well get a good R solution from the experts,
in such a situation (as in so many) I would be tempted to
pre-process the file with 'awk' (installed by default on
Unix/Linux systems, available also for Windows).

The following will give you a CSV file with a constant number
of fields per line. While this does not eliminate the NAs which
you apparently find unsightly, it should be a fast and clean way
of doing the basic job, since it a line-by-line operation in
two passes, so there should be no question. of choking the
system (unless you run out of HD space as a result of creating
the second file).

Two passes, on the lines of
Pass 1:

  cat foo.csv | awk '
    BEGIN{FS=","; n=0}
    {m=NF; if(m>n){n=m}}
    END{print n} '

which gives you the maximum number of fields in any line.
Suppose (for example) that this number is 37.
Then Pass 2:

  cat foo.csv | awk -v maxF=37 '
    BEGIN{FS=","; OFS=","}
    {if(NF<maxF){$maxF=""}}
    {print $0} ' > newfoo.csv

Tiny example:
1) See foo.csv

  cat foo.csv 
  1
  1,2
  1,2,3
  1,2,3,4
  1,2

2) Pass 1:

  cat foo.csv | awk '
     BEGIN{FS=","; n=0}
     {m=NF; if(m>n){n=m}}
     END{print n} '
> 4

3) So we need 4 fields per line. With maxF=4, Pass 2:

  cat foo.csv | awk -v maxF=4 '
     BEGIN{FS=","; OFS=","}
     {if(NF<maxF){$maxF=""}}
     {print $0} ' > newfoo.csv

4) See newfoo.csv

  cat newfoo.csv
  1,,,
  1,2,,
  1,2,3,
  1,2,3,4
  1,2,,

So you now have a CSV file with a constant number of fields per line.

This doesn't make it into lists, though.

Hoping this helps,
Ted.

>    
>   Now I can use e.g.
>    
>   fileName="foo.csv"  
> ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)  
>    
>   which does the job nicely:
>    
>      V1 V2      V3     V4     V5      V6      V7     V8    V9    V10   
> V11    V12    V13     V14     V15    V16     V17
> 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649  1.5955     NA    NA     NA    
> NA     NA     NA      NA      NA     NA      NA
> 2 Bar 21  0.0880 0.5733 0.0081  2.0253 -0.7602 0.7765 0.281 1.8546
> 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114
> 
>    
>   but the problem is with files on the order of 1GB this either
> crunches for ever or runs out of memory trying ... plus having all
> those NAs isn't too pretty to look at. 
>    
>   (I have a MATLAB version that can read this stuff into an array of
> cells in about 3 minutes).
>    
>   I really want a fast way to read the data part into a list; that way
> I can access data in the array of lists containing the records by doing
> something ta[[i]]$data.
>    
>   Ideas?
>    
>   Thanks,
>    
>   Jack.
> 
>                       
> ---------------------------------
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 06-Dec-05                                       Time: 18:08:54
------------------------------ XFMail ------------------------------