[R] Suggestion for big files [was: Re: A comment about R:]

Fri Jan 6 00:11:55 CET 2006

Rongui,

I'm not familiar with SQLite, but using MySQL would solve your problem.

MySQL has a "LOAD DATA INFILE" statement that loads text/csv files rapidly.

In R, assuming a test table exists in MySQL (blank table is fine), something 
like this would load the data directly in MySQL.

library(DBI)
library(RMySQL)
dbSendQuery(mycon,"LOAD DATA INFILE 'C:/textfile.csv'
INTO TABLE test3 FIELDS TERMINATED BY ',' ") #for csv files

Then a normal SQL query would allow you to work with a manageable size of 
data.

>From: bogdan romocea <br44114 at gmail.com>
>To: ronggui.huang at gmail.com
>CC: r-help <R-help at stat.math.ethz.ch>
>Subject: Re: [R] Suggestion for big files [was: Re: A comment about R:]
>Date: Thu, 5 Jan 2006 15:26:51 -0500
>
>ronggui wrote:
> > If i am familiar with
> > database software, using database (and R) is the best choice,but
> > convert the file into database format is not an easy job for me.
>
>Good working knowledge of a DBMS is almost invaluable when it comes to
>working with very large data sets. In addition, learning SQL is piece
>of cake compared to learning R. On top of that, knowledge of another
>(SQL) scripting language is not needed (except perhaps for special
>tasks): you can easily use R to generate the SQL syntax to import and
>work with arbitrarily wide tables. (I'm not familiar with SQLite, but
>MySQL comes with a command line tool that can run syntax files.)
>Better start learning SQL today.
>
>
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of ronggui
> > Sent: Thursday, January 05, 2006 12:48 PM
> > To: jim holtman
> > Cc: r-help at stat.math.ethz.ch
> > Subject: Re: [R] Suggestion for big files [was: Re: A comment
> > about R:]
> >
> >
> > 2006/1/6, jim holtman <jholtman at gmail.com>:
> > > If what you are reading in is numeric data, then it would
> > require (807 *
> > > 118519 * 8) 760MB just to store a single copy of the object
> > -- more memory
> > > than you have on your computer.  If you were reading it in,
> > then the problem
> > > is the paging that was occurring.
> > In fact,If I read it in 3 pieces, each is about 170M.
> >
> > >
> > > You have to look at storing this in a database and working
> > on a subset of
> > > the data.  Do you really need to have all 807 variables in
> > memory at the
> > > same time?
> >
> > Yip,I don't need all the variables.But I don't know how to get the
> > necessary  variables into R.
> >
> > At last I  read the data in piece and use RSQLite package to write it
> > to a database.and do then do the analysis. If i am familiar with
> > database software, using database (and R) is the best choice,but
> > convert the file into database format is not an easy job for me.I ask
> > for help in SQLite list,but the solution is not satisfying as that
> > required the knowledge about the third script language.After searching
> > the internet,I get this solution:
> >
> > #begin
> > rm(list=ls())
> > f<-file("D:\wvsevs_sb_v4.csv","r")
> > i <- 0
> > done <- FALSE
> > library(RSQLite)
> > con<-dbConnect("SQLite","c:\sqlite\database.db3")
> > tim1<-Sys.time()
> >
> > while(!done){
> > i<-i+1
> > tt<-readLines(f,2500)
> > if (length(tt)<2500) done <- TRUE
> > tt<-textConnection(tt)
> > if (i==1) {
> >            assign("dat",read.table(tt,head=T,sep=",",quote=""));
> >          }
> > else assign("dat",read.table(tt,head=F,sep=",",quote=""))
> > close(tt)
> > ifelse(dbExistsTable(con, "wvs"),dbWriteTable(con,"wvs",dat,append=T),
> >   dbWriteTable(con,"wvs",dat) )
> > }
> > close(f)
> > #end
> > It's not the best solution,but it works.
> >
> >
> >
> > > If you use 'scan', you could specify that you do not want
> > some of the
> > > variables read in so it might make a more reasonably sized objects.
> > >
> > >
> > > On 1/5/06, FranÃ§ois Pinard <pinard at iro.umontreal.ca> wrote:
> > > > [ronggui]
> > > >
> > > > >R's week when handling large data file.  I has a data
> > file : 807 vars,
> > > > >118519 obs.and its CVS format.  Stata can read it in in
> > 2 minus,but In
> > > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> > > >
> > > > Just (another) thought.  I used to use SPSS, many, many
> > years ago, on
> > > > CDC machines, where the CPU had limited memory and no
> > kind of paging
> > > > architecture.  Files did not need to be very large for
> > being too large.
> > > >
> > > > SPSS had a feature that was then useful, about the capability of
> > > > sampling a big dataset directly at file read time, quite before
> > > > processing starts.  Maybe something similar could help in
> > R (that is,
> > > > instead of reading the whole data in memory, _then_ sampling it.)
> > > >
> > > > One can read records from a file, up to a preset amount
> > of them.  If the
> > > > file happens to contain more records than that preset
> > number (the number
> > > > of records in the whole file is not known beforehand),
> > already read
> > > > records may be dropped at random and replaced by other
> > records coming
> > > > from the file being read.  If the random selection
> > algorithm is properly
> > > > chosen, it can be made so that all records in the
> > original file have
> > > > equal probability of being kept in the final subset.
> > > >
> > > > If such a sampling facility was built right within usual R reading
> > > > routines (triggered by an extra argument, say), it could offer
> > > > a compromise for processing large files, and also
> > sometimes accelerate
> > > > computations for big problems, even when memory is not at stake.
> > > >
> > > > --
> > > > FranÃ§ois Pinard   http://pinard.progiciels-bpi.ca
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > > >
> > >
> > >
> > >
> > > --
> > > Jim Holtman
> > > Cincinnati, OH
> > > +1 513 247 0281
> > >
> > > What the problem you are trying to solve?
> >
> >
> > --
> > é»„è£è´µ
> > Deparment of Sociology
> > Fudan University
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! 
>http://www.R-project.org/posting-guide.html