[R] Large data sets with R (binding to hadoop available?)

Thu Aug 21 20:40:51 CEST 2008

RSQLite package can read files into an SQLite database without the data going
through R. sqldf package provides a front end that makes it
particularly easy to
use - basically you need only a couple of lines of code.  Other databases have
similar facilities.  See:

http://sqldf.googlecode.com

On Thu, Aug 21, 2008 at 2:32 PM, Avram Aelony <aavram at mac.com> wrote:
>
> Dear R community,
>
> I find R fantastic and use R whenever I can for my data analytic needs.
>  Certain data sets, however, are so large that other tools seem to be needed
> to pre-process data such that it can be brought into R for further analysis.
>
> Questions I have for the many expert contributors on this list are:
>
> 1. How do others handle situations of large data sets (gigabytes, terabytes)
> for analysis in R ?
>
> 2. Are there existing ways or plans to devise ways to use the R language to
> interact with Hadoop or PIG ?  The Hadoop project by Apache has been
> successful at processing data on a large scale using the map-reduce
> algorithm.  A sister project uses an emerging language called "PIG-latin" or
> simply "PIG" for using the Hadoop framework in a manner reminiscent of the
> look and feel of R.  Is there an opportunity here to create a conceptual
> bridge since these projects are also open-source?  Does it already exist?
>
>
> Thanks in advance for your comments.
>
> -Avram
>
>
>
>
> ---------------------------
> Information about Hadoop:
> http://wiki.apache.org/hadoop/
> http://en.wikipedia.org/wiki/Hadoop
>
> "Apache Hadoop is a free Java software framework that supports data
> intensive distributed applications running on large clusters of commodity
> computers.[1] It enables applications to work with thousands of nodes and
> petabytes of data. Hadoop was inspired by Google's MapReduce and Google File
> System (GFS) papers."
>
>
>
> ---------------------------
> Information about PIG:
>
> http://incubator.apache.org/pig/
>
> "Pig is a platform for analyzing large data sets that consists of a
> high-level language for expressing data analysis programs, coupled with
> infrastructure for evaluating these programs. The salient property of Pig
> programs is that their structure is amenable to substantial parallelization,
> which in turns enables them to handle very large data sets.
> At the present time, Pig's infrastructure layer consists of a compiler that
> produces sequences of Map-Reduce programs, for which large-scale parallel
> implementations already exist (e.g., the Hadoop subproject). Pig's language
> layer currently consists of a textual language called Pig Latin, which has
> the following key properties:
>
> * Ease of programming. It is trivial to achieve parallel execution of
> simple, "embarrassingly parallel" data analysis tasks. Complex tasks
> comprised of multiple interrelated data transformations are explicitly
> encoded as data flow sequences, making them easy to write, understand, and
> maintain.
> * Optimization opportunities. The way in which tasks are encoded permits the
> system to optimize their execution automatically, allowing the user to focus
> on semantics rather than efficiency.
> * Extensibility. Users can create their own functions to do special-purpose
> processing."
>
> ---------------------------______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>