[R] How long does skipping in read.table take
Dimitri Liakhovitski
dimitri.liakhovitski at gmail.com
Sat Oct 23 16:31:46 CEST 2010
Gabor, thanks a lot. So, I don't really need sql? That's great.
I'll try your code.
To finish with sql, I've run this:
(I wanted to skip the first 11 million rows)
mydata<-read.csv.sql("my.file.txt", sep="|", eol="\r\n", sql =
"select * from file limit 1000000, 10999999")
After 20 min (on a 4-core 64-bit Windows 7 PC with 6 GB RAM (I assume
only 4 can be used?)) I got this error:
Error: cannot allocate vector of size 42.0 Mb
So, I guess
On Sat, Oct 23, 2010 at 10:19 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Sat, Oct 23, 2010 at 10:07 AM, Dimitri Liakhovitski
> <dimitri.liakhovitski at gmail.com> wrote:
>> I just tried it:
>>
>> for(i in 11:16){ #i<-11
>> start<-Sys.time()
>> print(start)
>> flush.console()
>> filename<-paste("skipped millions- ",i,".txt",sep="")
>> mydata<-read.csv.sql("myfilel.txt", sep="|", eol="\r\n", sql =
>> "select * from file limit 1000000, (1000000*i-1)")
>
> The SQL statement does not know anything about R variables. You would
> need something like this:
>
>> i <- 1
>> s <- sprintf("select from file limit 10, %d", 10*1-1)
>> s
> [1] "select from file limit 10, 9"
>> read.csv.sql(..., sql = s, ...)
>
> Also if you just want to read it in as chunks reading from a
> connection in R would be sufficient:
>
> k <- 5000 # no of rows per chunk
> first <- TRUE
> con <- file('myfile.csv', "r")
> repeat {
>
> # skip header
> if (first) hdgs <- readLines(con, 1)
> first <- FALSE
>
> x <- readLines(con, k)
> if (length(x) == 0) break
> DF <- read.csv(textConnection(x), header = FALSE)
>
> # process chunk -- we just print last row here
> print(tail(DF, 1))
>
> }
> close(con)
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
--
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com
More information about the R-help
mailing list