[R] select data from large CSV file
Stephen C. Upton
supton at referentia.com
Thu Jul 5 17:40:20 CEST 2007
Hi Lars,
I haven't tried this, but I believe there were a couple of messages on
the list recently on reading large files that basically used scan with
connections, and reading in by blocks.
see ?scan, ?connections
HTH
steve
Lars Modig wrote:
> Hello
>
>
> I’ve got a large CSV file (>500M) with statistical data. It’s devided in
> 12 columns and I don’t know how many lines.
> The second column is the date and the second is a unique code for the
> location, the rest is (lets say different whether data. See example
> below.
> 070704, 25, --,--,--,temperature, 22, --,--,30, 20,Y
> 070705, 25, --,--,--,temperature, 22, --,--,30, 20,Y
> 070705, 25, --,--,--,pressure, 1200, --,--,1000, 1100,N
> 070705, 26, --,--,--,temperature, 22, --,--,30, 20,Y
> …
> First I tried with data <- read.csv. and of course the memory got full.
> Then I found in the archive that you could use scan. So then I wrote the
> following lines below to search for location and store one location with
> all different data in one variable.
>
> # collect the different pnc's
> b=2 #compare from second number
> alike=TRUE #Dim alike like a boolean
> stored = 910286609 #first number is known
> for(i in 1: 100){ #start counting and scaning
> data_final <- matrix(unlist(scan("C:/Documents and
> Settings/modiglar/Desktop/temp/et.csv",sep="," ,
> what=list("","","","","","","","","","","",""), skip=i ,
> n=12)),ncol=12, byrow=TRUE)
>
>
> a=1 #compare from the 1:th stored
> while( a < b){ #---
> #
> if(as.numeric(data_final[2] != stored[a])) #compare
> { a=a+1 #
> alike=FALSE } #
> else{ #
> alike=TRUE #
> break } #
> } # ---
>
> if (alike==FALSE){ #
> stored[b]=as.numeric(data_final[2]) # Store new data
> b=b+1 #
> }
> }
>
> #------------------------------------------------------------
> # save 1 pnc at the time
> d=1
> saved_data = 1:1200 ; dim(saved_data) <- c(12,100)
> save_data_nr = 1 #Stored number
> for(i in 1: 100){ #start counting and scaning
> data_final <- matrix(unlist(scan("C:/Documents and
> Settings/modiglar/Desktop/temp/et.csv",sep="," ,
> what=list("","","","","","","","","","","",""), skip=i ,
> n=12)),ncol=12, byrow=TRUE)
>
>
> if(as.numeric(data_final[2] == stored[save_data_nr])) #compare
> { saved_data[,d] <- matrix(unlist(data_final),ncol=12,
> byrow=TRUE) #Store new data
> d=d+1 } #
> #
> #
> }
> As you can see I’m not so familiar with R, and therefore I have probably
> done this the wrong way.
>
> As I understand when running this, is that scan opens up the file count
> down to the line that should be read and read it, then closing the file
> again. So when I’m starting to come to line number at 10000 then it
> starting to take time. I let the computer run over night, but it was still
> far from finished when I stopped the loop.
>
> So how should I do this? Maybe I also need to sort on the date, and that
> is hopefully in order so then you should be able to cut the file every
> time you hit a new month but that will also take time if I do it like
> this.
>
> Thank you for your help in advance.
>
> Lars
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
More information about the R-help
mailing list