[R] read.table() 1Gb text dataframe

Sat Sep 20 20:03:19 CEST 2014

On Fri, Sep 19, 2014 at 10:07 AM, Stephen HK Wong <honkit at stanford.edu> wrote:
> Thanks Henrick. Seems it fits my needs. One my question is the argument, length.out=0.10*n, is it "randomly" taking out 10% ? I found it basically takes every 10th row if I put length.out=0.1*n, and every 100th row if I put length.out=0.01*n till the end. I couldn't find this information on documentation.

If you look at the call, argument 'rows' is just an integer (index)
vector that specified which rows to read.  I used

  seq(from=1, to=n, length.out=0.10*n)

as an illustration.  See ?seq for how that works.  If you want to get
a random sample, I recommend to use sample() to generate that index
vector.

If you're going to read the same data file many many times, I
recommend to also look into what Greg suggested, particularly 'sqldf'
which does not take that much to learn.

/Henrik

>
> Stephen HK Wong
> Stanford, California 94305-5324
>
> ----- Original Message -----
> From: Henrik Bengtsson <hb at biostat.ucsf.edu>
> To: Stephen HK Wong <honkit at stanford.edu>
> Cc: r-help at r-project.org
> Sent: Thu, 18 Sep 2014 18:33:15 -0700 (PDT)
> Subject: Re: [R] read.table() 1Gb text dataframe
>
> As a start, make sure you specify the 'colClasses' argument.  BTW,
> using that you can even go to the extreme and read one column at the
> time, if it comes down to that.
>
> To read a 10% subset of the rows, you can use R.filesets as:
>
> library(R.filesets)
> db <- TabularTextFile(pathname)
> n <- nbrOfRows(db)
> data <- readDataFrame(db, rows=seq(from=1, to=n, length.out=0.10*n))
>
> It is also useful to specify 'colClasses' here. In addition to
> specifying them ordered by column, as for read.table(), you also
> specify them by column names (or regular expressions of the column
> names), e.g.
>
> data <- readDataFrame(db, colClasses=c("*"="NULL", "(x|y)"="integer",
> outcome="numeric", "id"="character"), rows=seq(from=1, to=n,
> length.out=0.10*n))
>
> That 'colClasses' specifies that the default is drop all columns, read
> columns 'x' and 'y' as integers, and so on.
>
> BTW, if you know 'n' upfront you can skip the setup of TabularTextFile
> and just do:
>
> data <- readDataFrame(pathname, rows=seq(from=1, to=n, length.out=0.10*n))
>
>
> Hope this helps
>
> Henrik
>
> On Thu, Sep 18, 2014 at 4:48 PM, Stephen HK Wong <honkit at stanford.edu> wrote:
>> Dear All,
>>
>> I have a table of 4 columns and many millions rows separated by tab-delimited. I don't have enough memory to read.table in that 1 Gb file. And actually I have 12 text files like that. Is there a way that I can just randomly read.table() in 10% of rows ? I was able to do that using colbycol package, but it is not not available. Many thanks!!
>>
>>
>>
>> Stephen HK Wong
>> Stanford, California 94305-5324
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>