[R] reading in data with variable length

Gabor Grothendieck ggrothendieck at gmail.com
Tue Dec 6 23:51:20 CET 2005


On 12/6/05, John McHenry <john_d_mchenry at yahoo.com> wrote:
>
> Everything has slowed down with #1 and #3 by about 50%. Can't do #2 & #4 :
>
> > ta.num <- lapply(ta0, scan, sep = ",")
> Error in file(file, "r") : unable to open connection
> scan seems to want a file or a connection ...

Building on Andy's variation:

n <- length(ta)
ta.sub <- sub("^[^,]*,[^.]*,", "", ta)
ta.con <- textConnection(ta.sub)
out <- replicate(n, scan(ta.con, nlines = 1, sep = ","))
close(ta.con)

Also consider writing ta.sub back out and defining ta.con as a
file connection to that file but testing both would be needed to
determine which is faster.


>
>
> Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> Could you time these and see how each of these do:
>
> # 1
> ta.split <- strsplit(ta, split = ",")
> ta.num <- lapply(ta.split, function(x) as.numeric(x[-(1:2)]))
>
> # 2
> ta0 <- sub("^[^,]*,[^.]*,", "", ta)
> ta.num <- lapply(ta0, scan, sep = ",")
>
> # 3 - loop version of #1
> n <- length(ta)
> ta.split <- strsplit(ta, split = ",")
> ta.num <- list(length = n)
> for(i in 1:n) ta.num[[i]] <- as.numeric(ta.split[[i]][-(1:2)])
>
> # 4 - loop version of #2
> n <- length(ta)
> ta0 <- sub("^[^,]*,[^.]*,", "", ta)
> ta.num <- list(length = n)
> for(i in 1:n) ta.num[[i]] <- scan(t0[[i])
>
>
>
> On 12/6/05, John McHenry wrote:
> > I should have mentioned that I already tried the readLines() approach:
> >
> > ta<-readLines("foo.csv")
> > ptm<-proc.time()
> > f<-character(length(ta))
> > for (k in 2:length(ta)) {
> f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <- PARSING EACH
> LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS
> > (proc.time()-ptm)[3]
> > [1] 102.75
> >
> > on a 62M file, so I'm guessing that on my 1GB files this will be about
> >
> > > (102.75*(1000/61))/60
> > [1] 28.07377
> >
> > minutes...which is way, way too long.
> >
> > I'm new to R but I'm kind of surprised that this problem isn't well known
> (couldn't find anything after a long hunt).
> >
> > As I mentioned, MATLAB does it using textread which makes a call to its
> dll dataread. The data are read using something like:
> >
> > [name, startMonth, data]=textread(fileName,'%s%n%[^\n]',
> 'delimiter',',', 'bufsize', 1000000, 'headerlines',1);
> >
> > which is kind of fscanf-like. data in the above is then a cell array with
> each cell being the variable-length data.
> >
> > "Liaw, Andy" wrote:
> > Use file() connection in conjunction with readLines() and strsplit()
> should
> > do it. I would try to count the number of lines in the file first, and
> > create a list with that many components, then fill it in. I believe the
> > "array of cells" in Matlab is sort of equivalent to a list in R, but
> that's
> > beyond my knowledge of Matlab...
> >
> > Andy
> >
> > From: John McHenry
> > >
> > > I have very large csv files (up to 1GB each of ASCII text).
> > > I'd like to be able to read them directly in to R. The
> > > problem I am having is with the variable length of the data
> > > in each record.
> > >
> > > Here's a (simplified) example:
> > >
> > > $ cat foo.csv
> > > Name,Start Month,Data
> > > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
> > > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854
> > > 6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114
> > >
> > > The records consist of rows with some set comma-separated
> > > fields (e.g. the "Name" & "Start Month" fields in the above)
> > > and then the data follow as a variable-length list of
> > > comma-separated values until a new line is encountered.
> > >
> > > Now I can use e.g.
> > >
> > > fileName="foo.csv"
> > > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)
> > >
> > > which does the job nicely:
> > >
> > > V1 V2 V3 V4 V5 V6 V7 V8 V9
> > > V10 V11 V12 V13 V14 V15 V16 V17
> > > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA
> > > NA NA NA NA NA NA NA NA
> > > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281
> > > 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114
> > >
> > >
> > > but the problem is with files on the order of 1GB this
> > > either crunches for ever or runs out of memory trying ...
> > > plus having all those NAs isn't too pretty to look at.
> > >
> > > (I have a MATLAB version that can read this stuff into an
> > > array of cells in about 3 minutes).
> > >
> > > I really want a fast way to read the data part into a list;
> > > that way I can access data in the array of lists containing
> > > the records by doing something ta[[i]]$data.
> > >
> > > Ideas?
> > >
> > > Thanks,
> > >
> > > Jack.
> > >
> > >
> > > ---------------------------------
> > >
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> > >
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> >
> ------------------------------------------------------------------------------
> >
> >
> >
> >
> > ---------------------------------
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
>
>
>
> ________________________________
> Yahoo! Shopping
> Find Great Deals on Gifts at Yahoo! Shopping
>
>




More information about the R-help mailing list