[R] Importing random subsets of a data file

Greg Snow 538280 at gmail.com
Wed Jul 23 18:56:37 CEST 2014


For speed your best choice is probably to load your data into a
database, then pull your samples from the database.  A simple database
is SQLite and there are R packages that work directly with that
database.

Can the later samples contain some of the same rows as previous
samples?  Or once a row is used in a sample, it can never be used
again in a later sample?  If the former you could use R to choose a
sample of "row numbers" then ask the database for those rows (some
databases have the concept of rows built in, others would need a
sequential column of "row numbers" added), then repeat for each
sample.  If the later then you could add a column to the database
based on randomly generated numbers and create an index (sort) by that
column, then select the 1st n observations as the 1st sample, the next
n observations as the 2nd sample, etc.

On Wed, Jul 23, 2014 at 9:33 AM, Khurram Nadeem <khurram.nadee at gmail.com> wrote:
> Hi R folks,
>
> Here is my problem.
>
> *1.* I have a large data file (say, in .csv or .txt format) containing 1
> million rows and 500 variables (columns).
>
> *2.* My statistical algorithm does not require the entire dataset but just
> a small random sample from the original 1 million rows.
>
> *3. *This algorithm needs to be applied 10000 times, each time generating a
> different random sample from the 'big' file as described in (2) above.
>
> Is there a way to 'import' only a (random) subset of rows from the .csv
> file without importing the entire dataset? A quick search on various R
> forums suggest that read.table() does not have this functionality.
> Obviously, I want to avoid importing the whole file because of memory
> issues. Looking forward to your help.
>
> Thanks,
> Khurram
> ------------------------
>  Khurram Nadeem
>  Postdoctoral Research Fellow
>  Department of Mathematics & Statistics
>  Acadia University, NS, Canada.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com



More information about the R-help mailing list