[R] Big Data reading subsample csv

jim holtman jholtman at gmail.com
Thu Aug 16 18:12:43 CEST 2012


Why not put this into a database, and then you can easily extract the
records you want specifying the record numbers.  You play the one time
expense of creating the database, but then have much faster access to
the data as you make subsequent runs.

On Thu, Aug 16, 2012 at 9:44 AM, Tudor Medallion
<tudormedallion at googlemail.com> wrote:
> Hello,
>
> I'm most grateful for your time to read this.
>
> I have a uber size 30GB file of 6 million records and 3000 (mostly
> categorical data) columns in csv format. I want to bootstrap subsamples for
> multinomial regression, but it's proving difficult even with my 64GB RAM
>  in my machine and twice that swap file , the process becomes super slow
> and halts.
>
> I'm thinking about generating subsample indicies in R and feeding them into
> a system command using sed or awk, but don't know how to do this. If
> someone knew of a clean way to do this using just R commands, I would be
> really grateful.
>
> One problem is that I need to pick complete observations of subsamples,
> that is I need to have all the rows of a particular multinomial observation
> - they are not the same length from observation to observation. I plan to
> use glmnet and then some fancy transforms to get an approximation to the
> multinomial case. One other point is that I don't know how to choose sample
> size to fit around memory limits.
>
> Appreciate your thoughts greatly.
>
>
>> R.version
>
> platform       x86_64-pc-linux-gnu
> arch           x86_64
> os             linux-gnu
> system         x86_64, linux-gnu
> status
> major          2
> minor          15.1
> year           2012
> month          06
> day            22
> svn rev        59600
> language       R
> version.string R version 2.15.1 (2012-06-22)
> nickname       Roasted Marshmallows
>
>
> tags: read.csv(), system(), awk, sed, sample(), glmnet, multinomial, MASS.
>
> Yoda
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list