[R] What is the best package for large data cleaning (not statistical analysis)?
rechtsteiner at bgki.net
Sun Mar 15 15:15:01 CET 2009
you should think about storing the data externally in a sql database.
this makes you very flexible and you can do a lot of manipultaion
directly in the db. with the help of stored procedures for example in
a postgreSQL db you can use almost any preferred languege to
manipulate the data before loading it into R. there's also a
procedural language based on R with which you can do a lot of things
already inside postgresql databases.
and keep in mind: learning sql isn't more difficult than R.
Am 15.03.2009 um 13:13 schrieb Sean Zhang:
> Dear Jim:
> Thanks for your reply.
> Looks to me, you were using batching.
> I used batching to digest large data in Matlab before.
> Still wonder the answers to the two specifics questions without
> resorting to
> On Sat, Mar 14, 2009 at 10:13 PM, jim holtman <jholtman at gmail.com>
>> Exactly what type of cleaning do you want to do on them? Can you
>> in the data a block at a time (e.g., 1M records), clean them up and
>> then write them back out? You would have the choice of putting them
>> back as a text file or possibly storing them using 'filehash'. I
>> used that technique to segment a year's worth of data that was
>> probably 3GB of text into monthly objects that were about 70MB
>> dataframes that I stored using filehash. These I then read back in
>> do processing where I could summarize by month. So it all depends on
>> what you want to do.
>> You could read in the chunks, clean them and then reshape them into
>> dataframes that you could process later. You will still probably
>> the problem that all the data still won't fit in memory. Now one
>> thing I did was that since the dataframes were stored as binary
>> objects in filehash, it was pretty fast to retrieve them, pick out
>> data I needed from each month and create a subset of just the data I
>> needed that would now fit in memory.
>> So it all depends ...........
>> On Sat, Mar 14, 2009 at 8:46 PM, Sean Zhang <seanecon at gmail.com>
>>> Dear R helpers:
>>> I am a newbie to R and have a question related to cleaning large
>>> in R.
>>> So far, I have been using SAS for data cleaning because my data
>>> sets are
>>> relatively large (handling multiple files, each could be as large
>>> as 5-10
>>> I am not a fan of SAS at all and am eager to move data cleaning
>> into R
>>> Seems to me, there are 3 options. Using SQL, ff or filehash. I do
>>> to learn sql. so my question is more related to ff and filehash.
>>> In specifics,
>>> (1) for merging two large data frames, which one is better, ff vs.
>>> (2) for reshaping a large data frame (say from long to wide or the
>>> which one is better, ff vs. filehash?
>>> If you can provide examples, that will be even better.
>>> Many thanks in advance.
>>> [[alternative HTML version deleted]]
>>> R-help at r-project.org mailing list
>>> PLEASE do read the posting guide
>>> and provide commented, minimal, self-contained, reproducible code.
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>> What is the problem that you are trying to solve?
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help