[R] Handling large dataset & dataframe

Tue Apr 25 22:43:28 CEST 2006

Much easier to use colClasses in read.table, and in many cases just as fast
(or even faster).

Andy

From: Mark Stephens
> 
> From ?scan: "the *type* of what gives the type of data to be 
> read". So list(integer(), integer(), double(), raw(), ...) In 
> your code all columns are being read as character regardless 
> of the contents of the character vector.
> 
> I have to admit that I have added the *'s in *type*.  I have 
> been caught out by this too.  Its not the most convenient way 
> to specify the types of a large number of columns either.  As 
> you have a lot of columns you might want to do something like 
> this:  as.list(rep(integer(1),250)), assuming your dummies 
> are together, to save typing.  Also storage.mode() is useful 
> to tell you the precise type (and therefore size) of an 
> object e.g. sapply(coltypes,
> storage.mode) is actually the types scan() will use.  Note 
> that 'numeric' could be 'double' or 'integer' which are 
> important in your case to fit inside the 1GB limit, because 
> 'integer' (4 bytes) is half 'double' (8 bytes).
> 
> Perhaps someone on r-devel could enhance the documentation to 
> make "type" stand out in capitals in bold in help(scan)?  Or 
> maybe scan could be clever enough to accept a character 
> vector 'what'.  Or maybe I'm missing a good reason why this 
> isn't possible - anyone? How about allowing a character 
> vector length one, with each character representing the type 
> of that column e.g.  what="IIIIDDCD" would mean 4 integers 
> followed by 2 double's followed by a character column, 
> followed finally by a double column,  8 columns in total.  
> Probably someone somewhere has done that already, but I'm not 
> aware anyone has wrapped it up conveniently?
> 
> On 25/04/06, Sachin J <sachinj.2006 at yahoo.com> wrote:
> >
> >  Mark:
> >
> > Here is the information I didn't provide in my earlier 
> post. R version 
> > is R2.2.1 running on Windows XP.  My dataset has 16 variables with 
> > following data type.
> > ColNumber:   1              2              3  .......16
> > Datatypes:
> >
> > 
> "numeric","numeric","numeric","numeric","numeric","numeric","character
> > 
> ","numeric","numeric","character","character","numeric","numeric","num
> > eric","numeric","numeric","numeric","numeric"
> >
> > Variable (2) which is numeric and variables denoted as 
> character are 
> > to be treated as dummy variables in the regression.
> >
> > Search in R help list  suggested I can use read.csv with colClasses 
> > option also instead of using scan() and then converting it to 
> > dataframe as you suggested. I am trying both these methods 
> but unable 
> > to resolve syntactical error.
> >
> > >coltypes<-
> > 
> c("numeric","factor","numeric","numeric","numeric","numeric","factor",
> > 
> "numeric","numeric","factor","factor","numeric","numeric","numeric","n
> > umeric","numeric","numeric","numeric")
> >
> > >mydf <- read.csv("C:/temp/data.csv", header=FALSE, colClasses = 
> > >coltypes,
> > strip.white=TRUE)
> >
> > ERROR: Error in scan(file = file, what = what, sep = sep, quote = 
> > quote, dec = dec,  :
> >         scan() expected 'a real', got 'V1'
> >
> > No idea whats the problem.
> >
> > AS PER YOUR SUGGESTION I TRIED scan() as follows:
> >
> >
> > 
> >coltypes<-c("numeric","factor","numeric","numeric","numeric","numeric
> > 
> >","factor","numeric","numeric","factor","factor","numeric","n
> umeric","numeric","numeric","numeric","numeric","numeric")
> > >x<-scan(file = 
> "C:/temp/data.dbf",what=as.list(coltypes),sep=",",quiet=TRUE,skip=1)
> >
> > >names(x)<-scan(file = "C:/temp/data.dbf",what="",nlines=1, sep=",")
> > >x<-as.data.frame(x)
> >
> > This is working fine but x has no data in it and contains
> > > x
> >
> >  [1] X._.   NA.    NA..1  NA..2  NA..3  NA..4  NA..5  NA..6 
>  NA..7  NA..8
> > NA..9  NA..10 NA..11
> > [14] NA..12 NA..13 NA..14 NA..15 NA..16
> > <0 rows> (or 0-length row.names)
> >
> > Please let me know how to properly use scan or colClasses option.
> >
> > Sachin
> >
> >
> >
> >
> >
> > *Mark Stephens <markjs1 at googlemail.com>* wrote:
> >
> > Sachin,
> > With your dummies stored as integer, the size of your object would 
> > appear to be 350000 * (4*250 + 8*16) bytes = 376MB. You 
> said "PC" but 
> > did not provide R version information, assuming windows then ...
> > With 1GB RAM you should be able to load a 376MB object into 
> memory. If you
> > can store the dummies as 'raw' then object size is only 126MB.
> > You don't say how you attempted to load the data. Assuming 
> your input data
> > is in text file (or can be) have you tried scan()? Setup the 'what'
> > argument
> > with length 266 and make sure the dummy column are set to 
> integer() or
> > raw(). Then x = scan(...); class(x)=" data.frame".
> > What is the result of memory.limit()? If it is 256MB or 
> 512MB, then try
> > starting R with --max-mem-size=800M (I forget the syntax 
> exactly). Leave a
> > bit of room below 1GB. Once the object is in memory R may 
> need to copy it
> > once, or a few times. You may need to close all other apps 
> in memory, or
> > send them to swap.
> > I don't really see why your data should not fit into the 
> memory you have.
> > Purchasing an extra 1GB may help. Knowing the object size 
> calculation (as
> > above) should help you guage whether it is worth it.
> > Have you used process monitor to see the memory growing as 
> R loads the
> > data? This can be useful.
> > If all the above fails, then consider 64-bit and purchasing 
> as much memory
> > as you can afford. R can use over 64GB RAM+ on 64bit 
> machines. Maybe you
> > can
> > hire some time on a 64-bit server farm - i heard its quite 
> cheap but never
> > tried it myself. You shouldn't need to go that far with 
> this data set
> > though.
> > Hope this helps,
> > Mark
> >
> >
> > Hi Roger,
> >
> > I want to carry out regression analysis on this dataset. So 
> I believe 
> > I can't read the dataset in chunks. Any other solution?
> >
> > TIA
> > Sachin
> >
> >
> > roger koenker < rkoenker at uiuc.edu> wrote:
> > You can read chunks of it at a time and store it in sparse 
> matrix form 
> > using the packages SparseM or Matrix, but then you need to 
> think about 
> > what you want to do with it.... least squares sorts of 
> things are ok, 
> > but other options are somewhat limited...
> >
> >
> > url: www.econ.uiuc.edu/~roger Roger Koenker
> > email rkoenker at uiuc.edu Department of Economics
> > vox: 217-333-4558 University of Illinois
> > fax: 217-244-6678 Champaign, IL 61820
> >
> >
> > On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
> >
> > > Hi,
> > >
> > > I have a dataset consisting of 350,000 rows and 266 
> columns. Out of 
> > > 266 columns 250 are dummy variable columns. I am trying 
> to read this 
> > > data set into R dataframe object but unable to do it due 
> to memory 
> > > size limitations (object size created is too large to 
> handle in R). 
> > > Is there a way to handle such a large dataset in R.
> > >
> > > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> > >
> > > Any pointers would be of great help.
> > >
> > > TIA
> > > Sachin
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > 
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/p
> > osting-guide.html>
> >
> >
> >  ------------------------------
> > Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone 
> calls. Great 
> > rates starting at 1¢/min. 
> > 
> <http://us.rd.yahoo.com/mail_us/taglines/postman7/*http://us.rd.yahoo.
> > com/evt=39666/*http://beta.messenger.yahoo.com>
> >
> >
> 
> 	[[alternative HTML version deleted]]
> 
>