[R] Handling large dataset & dataframe
Liaw, Andy
andy_liaw at merck.com
Tue Apr 25 22:43:28 CEST 2006
Much easier to use colClasses in read.table, and in many cases just as fast
(or even faster).
Andy
From: Mark Stephens
>
> From ?scan: "the *type* of what gives the type of data to be
> read". So list(integer(), integer(), double(), raw(), ...) In
> your code all columns are being read as character regardless
> of the contents of the character vector.
>
> I have to admit that I have added the *'s in *type*. I have
> been caught out by this too. Its not the most convenient way
> to specify the types of a large number of columns either. As
> you have a lot of columns you might want to do something like
> this: as.list(rep(integer(1),250)), assuming your dummies
> are together, to save typing. Also storage.mode() is useful
> to tell you the precise type (and therefore size) of an
> object e.g. sapply(coltypes,
> storage.mode) is actually the types scan() will use. Note
> that 'numeric' could be 'double' or 'integer' which are
> important in your case to fit inside the 1GB limit, because
> 'integer' (4 bytes) is half 'double' (8 bytes).
>
> Perhaps someone on r-devel could enhance the documentation to
> make "type" stand out in capitals in bold in help(scan)? Or
> maybe scan could be clever enough to accept a character
> vector 'what'. Or maybe I'm missing a good reason why this
> isn't possible - anyone? How about allowing a character
> vector length one, with each character representing the type
> of that column e.g. what="IIIIDDCD" would mean 4 integers
> followed by 2 double's followed by a character column,
> followed finally by a double column, 8 columns in total.
> Probably someone somewhere has done that already, but I'm not
> aware anyone has wrapped it up conveniently?
>
> On 25/04/06, Sachin J <sachinj.2006 at yahoo.com> wrote:
> >
> > Mark:
> >
> > Here is the information I didn't provide in my earlier
> post. R version
> > is R2.2.1 running on Windows XP. My dataset has 16 variables with
> > following data type.
> > ColNumber: 1 2 3 .......16
> > Datatypes:
> >
> >
> "numeric","numeric","numeric","numeric","numeric","numeric","character
> >
> ","numeric","numeric","character","character","numeric","numeric","num
> > eric","numeric","numeric","numeric","numeric"
> >
> > Variable (2) which is numeric and variables denoted as
> character are
> > to be treated as dummy variables in the regression.
> >
> > Search in R help list suggested I can use read.csv with colClasses
> > option also instead of using scan() and then converting it to
> > dataframe as you suggested. I am trying both these methods
> but unable
> > to resolve syntactical error.
> >
> > >coltypes<-
> >
> c("numeric","factor","numeric","numeric","numeric","numeric","factor",
> >
> "numeric","numeric","factor","factor","numeric","numeric","numeric","n
> > umeric","numeric","numeric","numeric")
> >
> > >mydf <- read.csv("C:/temp/data.csv", header=FALSE, colClasses =
> > >coltypes,
> > strip.white=TRUE)
> >
> > ERROR: Error in scan(file = file, what = what, sep = sep, quote =
> > quote, dec = dec, :
> > scan() expected 'a real', got 'V1'
> >
> > No idea whats the problem.
> >
> > AS PER YOUR SUGGESTION I TRIED scan() as follows:
> >
> >
> >
> >coltypes<-c("numeric","factor","numeric","numeric","numeric","numeric
> >
> >","factor","numeric","numeric","factor","factor","numeric","n
> umeric","numeric","numeric","numeric","numeric","numeric")
> > >x<-scan(file =
> "C:/temp/data.dbf",what=as.list(coltypes),sep=",",quiet=TRUE,skip=1)
> >
> > >names(x)<-scan(file = "C:/temp/data.dbf",what="",nlines=1, sep=",")
> > >x<-as.data.frame(x)
> >
> > This is working fine but x has no data in it and contains
> > > x
> >
> > [1] X._. NA. NA..1 NA..2 NA..3 NA..4 NA..5 NA..6
> NA..7 NA..8
> > NA..9 NA..10 NA..11
> > [14] NA..12 NA..13 NA..14 NA..15 NA..16
> > <0 rows> (or 0-length row.names)
> >
> > Please let me know how to properly use scan or colClasses option.
> >
> > Sachin
> >
> >
> >
> >
> >
> > *Mark Stephens <markjs1 at googlemail.com>* wrote:
> >
> > Sachin,
> > With your dummies stored as integer, the size of your object would
> > appear to be 350000 * (4*250 + 8*16) bytes = 376MB. You
> said "PC" but
> > did not provide R version information, assuming windows then ...
> > With 1GB RAM you should be able to load a 376MB object into
> memory. If you
> > can store the dummies as 'raw' then object size is only 126MB.
> > You don't say how you attempted to load the data. Assuming
> your input data
> > is in text file (or can be) have you tried scan()? Setup the 'what'
> > argument
> > with length 266 and make sure the dummy column are set to
> integer() or
> > raw(). Then x = scan(...); class(x)=" data.frame".
> > What is the result of memory.limit()? If it is 256MB or
> 512MB, then try
> > starting R with --max-mem-size=800M (I forget the syntax
> exactly). Leave a
> > bit of room below 1GB. Once the object is in memory R may
> need to copy it
> > once, or a few times. You may need to close all other apps
> in memory, or
> > send them to swap.
> > I don't really see why your data should not fit into the
> memory you have.
> > Purchasing an extra 1GB may help. Knowing the object size
> calculation (as
> > above) should help you guage whether it is worth it.
> > Have you used process monitor to see the memory growing as
> R loads the
> > data? This can be useful.
> > If all the above fails, then consider 64-bit and purchasing
> as much memory
> > as you can afford. R can use over 64GB RAM+ on 64bit
> machines. Maybe you
> > can
> > hire some time on a 64-bit server farm - i heard its quite
> cheap but never
> > tried it myself. You shouldn't need to go that far with
> this data set
> > though.
> > Hope this helps,
> > Mark
> >
> >
> > Hi Roger,
> >
> > I want to carry out regression analysis on this dataset. So
> I believe
> > I can't read the dataset in chunks. Any other solution?
> >
> > TIA
> > Sachin
> >
> >
> > roger koenker < rkoenker at uiuc.edu> wrote:
> > You can read chunks of it at a time and store it in sparse
> matrix form
> > using the packages SparseM or Matrix, but then you need to
> think about
> > what you want to do with it.... least squares sorts of
> things are ok,
> > but other options are somewhat limited...
> >
> >
> > url: www.econ.uiuc.edu/~roger Roger Koenker
> > email rkoenker at uiuc.edu Department of Economics
> > vox: 217-333-4558 University of Illinois
> > fax: 217-244-6678 Champaign, IL 61820
> >
> >
> > On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
> >
> > > Hi,
> > >
> > > I have a dataset consisting of 350,000 rows and 266
> columns. Out of
> > > 266 columns 250 are dummy variable columns. I am trying
> to read this
> > > data set into R dataframe object but unable to do it due
> to memory
> > > size limitations (object size created is too large to
> handle in R).
> > > Is there a way to handle such a large dataset in R.
> > >
> > > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> > >
> > > Any pointers would be of great help.
> > >
> > > TIA
> > > Sachin
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> >
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/p
> > osting-guide.html>
> >
> >
> > ------------------------------
> > Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone
> calls. Great
> > rates starting at 1¢/min.
> >
> <http://us.rd.yahoo.com/mail_us/taglines/postman7/*http://us.rd.yahoo.
> > com/evt=39666/*http://beta.messenger.yahoo.com>
> >
> >
>
> [[alternative HTML version deleted]]
>
>
More information about the R-help
mailing list