[R] Large data sets with R (binding to hadoop available?)
Avram Aelony
aavram at mac.com
Thu Aug 21 20:32:22 CEST 2008
Dear R community,
I find R fantastic and use R whenever I can for my data analytic
needs. Certain data sets, however, are so large that other tools
seem to be needed to pre-process data such that it can be brought
into R for further analysis.
Questions I have for the many expert contributors on this list are:
1. How do others handle situations of large data sets (gigabytes,
terabytes) for analysis in R ?
2. Are there existing ways or plans to devise ways to use the R
language to interact with Hadoop or PIG ? The Hadoop project by
Apache has been successful at processing data on a large scale using
the map-reduce algorithm. A sister project uses an emerging language
called “PIG-latin” or simply “PIG” for using the Hadoop framework in
a manner reminiscent of the look and feel of R. Is there an
opportunity here to create a conceptual bridge since these projects
are also open-source? Does it already exist?
Thanks in advance for your comments.
-Avram
---------------------------
Information about Hadoop:
http://wiki.apache.org/hadoop/
http://en.wikipedia.org/wiki/Hadoop
“Apache Hadoop is a free Java software framework that supports data
intensive distributed applications running on large clusters of
commodity computers.[1] It enables applications to work with
thousands of nodes and petabytes of data. Hadoop was inspired by
Google's MapReduce and Google File System (GFS) papers.”
---------------------------
Information about PIG:
http://incubator.apache.org/pig/
“Pig is a platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs. The salient
property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle
very large data sets.
At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs, for which
large-scale parallel implementations already exist (e.g., the Hadoop
subproject). Pig's language layer currently consists of a textual
language called Pig Latin, which has the following key properties:
* Ease of programming. It is trivial to achieve parallel execution of
simple, "embarrassingly parallel" data analysis tasks. Complex tasks
comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to write,
understand, and maintain.
* Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
* Extensibility. Users can create their own functions to do special-
purpose processing.”
---------------------------
More information about the R-help
mailing list