[R] select data from large CSV file
Lars Modig
eb99lamo at kth.se
Thu Jul 5 13:54:10 CEST 2007
Hello
Ive got a large CSV file (>500M) with statistical data. Its devided in
12 columns and I dont know how many lines.
The second column is the date and the second is a unique code for the
location, the rest is (lets say different whether data. See example
below.
070704, 25, --,--,--,temperature, 22, --,--,30, 20,Y
070705, 25, --,--,--,temperature, 22, --,--,30, 20,Y
070705, 25, --,--,--,pressure, 1200, --,--,1000, 1100,N
070705, 26, --,--,--,temperature, 22, --,--,30, 20,Y
First I tried with data <- read.csv. and of course the memory got full.
Then I found in the archive that you could use scan. So then I wrote the
following lines below to search for location and store one location with
all different data in one variable.
# collect the different pnc's
b=2 #compare from second number
alike=TRUE #Dim alike like a boolean
stored = 910286609 #first number is known
for(i in 1: 100){ #start counting and scaning
data_final <- matrix(unlist(scan("C:/Documents and
Settings/modiglar/Desktop/temp/et.csv",sep="," ,
what=list("","","","","","","","","","","",""), skip=i ,
n=12)),ncol=12, byrow=TRUE)
a=1 #compare from the 1:th stored
while( a < b){ #---
#
if(as.numeric(data_final[2] != stored[a])) #compare
{ a=a+1 #
alike=FALSE } #
else{ #
alike=TRUE #
break } #
} # ---
if (alike==FALSE){ #
stored[b]=as.numeric(data_final[2]) # Store new data
b=b+1 #
}
}
#------------------------------------------------------------
# save 1 pnc at the time
d=1
saved_data = 1:1200 ; dim(saved_data) <- c(12,100)
save_data_nr = 1 #Stored number
for(i in 1: 100){ #start counting and scaning
data_final <- matrix(unlist(scan("C:/Documents and
Settings/modiglar/Desktop/temp/et.csv",sep="," ,
what=list("","","","","","","","","","","",""), skip=i ,
n=12)),ncol=12, byrow=TRUE)
if(as.numeric(data_final[2] == stored[save_data_nr])) #compare
{ saved_data[,d] <- matrix(unlist(data_final),ncol=12,
byrow=TRUE) #Store new data
d=d+1 } #
#
#
}
As you can see Im not so familiar with R, and therefore I have probably
done this the wrong way.
As I understand when running this, is that scan opens up the file count
down to the line that should be read and read it, then closing the file
again. So when Im starting to come to line number at 10000 then it
starting to take time. I let the computer run over night, but it was still
far from finished when I stopped the loop.
So how should I do this? Maybe I also need to sort on the date, and that
is hopefully in order so then you should be able to cut the file every
time you hit a new month but that will also take time if I do it like
this.
Thank you for your help in advance.
Lars
More information about the R-help
mailing list