[R] How long does skipping in read.table take

Sat Oct 23 00:41:35 CEST 2010

----------------------------------------
> From: ggrothendieck at gmail.com
> Date: Fri, 22 Oct 2010 18:28:14 -0400
> To: dimitri.liakhovitski at gmail.com
> CC: r-help at r-project.org
> Subject: Re: [R] How long does skipping in read.table take
>
> On Fri, Oct 22, 2010 at 5:17 PM, Dimitri Liakhovitski
>  wrote:
> > I know I could figure it out empirically - but maybe based on your
> > experience you can tell me if it's doable in a reasonable amount of
> > time:
> > I have a table (in .txt) with a 17,000,000 rows (and 30 columns).
> > I can't read it all in (there are many strings). So I thought I could
> > read it in in parts (e.g., 1 milllion) using nrows= and skip.
> > I was able to read in the first 1,000,000 rows no problem in 45 sec.
> > But then I tried to skip 16,999,999 rows and then read in things. Then
> > R crashed. Should I try again - or is it too many rows to skip for R?
> >
>
> You could try read.csv.sql in sqldf.
>
> library(sqldf)
> read.csv.sql("myfile.csv", skip = 1000, header = FALSE)
> or
> read.csv.sql("myfile.csv, sql = "select * from file 2000, 1000")
>
> The first skips the first 1000 lines including the header and the
> second one skips 1000 rows (but still reads in the header) and then
> reads 2000 rows. You may or may not need to specify other arguments
> as well. For example, you may need to specify eol = "\n" or other
> depending on your line endings.
>
> Unlike read.csv, read.csv.sql reads the data directly into an sqlite
> database (which it creates on the fly for you). The data does not go
> through R during this operation. From there it reads only the data
> you ask for into R so R never sees the skipped over data. After all
> that it automatically deletes the database.

The first time I saw this suggested I thought I would wait to 
reply because it seemed a bit of an odd suggestion and I thought
I was missing some R-speak and a reply would waste everyone's time. However,
I still don't see what I'm missing here. A database is generally a big table
of data with various indicies and locks that facilitate concurrent updates and 
responses to arbitrary queries. This is fine for hotel reservation systems
where you need "ACID" performance but makes little sense with constant
data which will be accessed sequentially. A fast DB could take milliseonds to response,
an anticipatory streaming system could always have data in nanoseconds. 
Is this thing really acting as a "DB" or is there something more to it?
Is there no well buffered streaming system for data you will use in order?

It sounds like you are just building indicies and then deleteing them
but never really using random access. Is there not better way?

Thanks

>

> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

Mike Marchywka | V.P. Technology

415-264-8477
marchywka at phluant.com

Online Advertising and Analytics for Mobile
http://www.phluant.com