[R] What is the best package for large data cleaning (not statistical analysis)?

Josuah Rechtsteiner rechtsteiner at bgki.net
Sun Mar 15 15:15:01 CET 2009


Hi Sean,

you should think about storing the data externally in a sql database.  
this makes you very flexible and you can do a lot of manipultaion  
directly in the db. with the help of stored procedures for example in  
a postgreSQL db you can use almost any preferred languege to  
manipulate the data before loading it into R. there's also a  
procedural language based on R with which you can do a lot of things  
already inside postgresql databases.

and keep in mind: learning sql isn't more difficult than R.

best,

josuah


Am 15.03.2009 um 13:13 schrieb Sean Zhang:

> Dear Jim:
>
> Thanks for your reply.
> Looks to me, you were using batching.
> I used batching to digest large data in Matlab before.
> Still wonder the answers to the two specifics questions without  
> resorting to
> batching.
>
> Thanks.
>
> -Sean
>
>
>
>
> On Sat, Mar 14, 2009 at 10:13 PM, jim holtman <jholtman at gmail.com>  
> wrote:
>
>> Exactly what type of cleaning do you want to do on them?  Can you  
>> read
>> in the data a block at a time (e.g., 1M records), clean them up and
>> then write them back out?  You would have the choice of putting them
>> back as a text file or possibly storing them using 'filehash'.  I  
>> have
>> used that technique to segment a year's worth of data that was
>> probably 3GB of text into monthly objects that were about 70MB
>> dataframes that I stored using filehash.  These I then read back in  
>> to
>> do processing where I could summarize by month.  So it all depends on
>> what you want to do.
>>
>> You could read in the chunks, clean them and then reshape them into
>> dataframes that you could process later.  You will still probably  
>> have
>> the problem that all the data still won't fit in memory.  Now one
>> thing I did was that since the dataframes were stored as binary
>> objects in filehash, it was pretty fast to retrieve them, pick out  
>> the
>> data I needed from each month and create a subset of just the data I
>> needed that would now fit in memory.
>>
>> So it all depends ...........
>>
>> On Sat, Mar 14, 2009 at 8:46 PM, Sean Zhang <seanecon at gmail.com>  
>> wrote:
>>> Dear R helpers:
>>>
>>> I am a newbie to R and have a question related to cleaning large  
>>> data
>> frames
>>> in R.
>>>
>>> So far, I have been using SAS for data cleaning because my data  
>>> sets are
>>> relatively large (handling multiple files, each could be as large  
>>> as 5-10
>>> G).
>>> I am not a fan of SAS at all and am eager to move data cleaning  
>>> tasks
>> into R
>>> completely.
>>>
>>> Seems to me, there are 3 options. Using SQL, ff or filehash. I do  
>>> not
>> want
>>> to learn sql. so my question is more related to ff and filehash.
>>>
>>> In specifics,
>>>
>>> (1) for merging two large data frames,  which one is better, ff vs.
>>> filehash?
>>> (2) for reshaping a large data frame (say from long to wide or the
>> opposite)
>>> which one is better, ff vs. filehash?
>>>
>>> If you can provide examples, that will be even better.
>>>
>>> Many thanks in advance.
>>>
>>> -Sean
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html 
>> >
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list