[R] Tools For Preparing Data For Analysis
Robert Duval
rduval at gmail.com
Fri Jun 8 03:23:35 CEST 2007
An additional option for Windows users is Micro Osiris
http://www.microsiris.com/
best
robert
On 6/7/07, Robert Wilkins <irishhacker at gmail.com> wrote:
> As noted on the R-project web site itself ( www.r-project.org ->
> Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> messy and dirty data for analysis with the R tool itself. I've also
> seen at least one S programming book (one of the yellow Springer ones)
> that says, more briefly, the same thing.
> The R Data Import/Export page recommends examples using SAS, Perl,
> Python, and Java. It takes a bit of courage to say that ( when you go
> to a corporate software web site, you'll never see a page saying "This
> is the type of problem that our product is not the best at, here's
> what we suggest instead" ). I'd like to provide a few more
> suggestions, especially for volunteers who are willing to evaluate new
> candidates.
>
> SAS is fine if you're not paying for the license out of your own
> pocket. But maybe one reason you're using R is you don't have
> thousands of spare dollars.
> Using Java for data cleaning is an exercise in sado-masochism, Java
> has a learning curve (almost) as difficult as C++.
>
> There are different types of data transformation, and for some data
> preparation problems an all-purpose programming language is a good
> choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
> excellent regular expression facilities.
>
> However, for some types of complex demanding data preparation
> problems, an all-purpose programming language is a poor choice. For
> example: cleaning up and preparing clinical lab data and adverse event
> data - you could do it in Perl, but it would take way, way too much
> time. A specialized programming language is needed. And since data
> transformation is quite different from data query, SQL is not the
> ideal solution either.
>
> There are only three statistical programming languages that are
> well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
> popular than S for data cleaning.
>
> If you're an R user with difficult data preparation problems, frankly
> you are out of luck, because the products I'm about to mention are
> new, unknown, and therefore regarded as immature. And while the
> founders of these products would be very happy if you kicked the
> tires, most people don't like to look at brand new products. Most
> innovators and inventers don't realize this, I've learned it the hard
> way.
>
> But if you are a volunteer who likes to help out by evaluating,
> comparing, and reporting upon new candidates, well you could certainly
> help out R users and the developers of the products by kicking the
> tires of these products. And there is a huge need for such volunteers.
>
> 1. DAP
> This is an open source implementation of SAS.
> The founder: Susan Bassein
> Find it at: directory.fsf.org/math/stats (GNU GPL)
>
> 2. PSPP
> This is an open source implementation of SPSS.
> The relatively early version number might not give a good idea of how
> mature the
> data transformation features are, it reflects the fact that he has
> only started doing the statistical tests.
> The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept.
> Also at : directory.fsf.org/math/stats (GNU GPL)
>
> 3. Vilno
> This uses a programming language similar to SPSS and SAS, but quite unlike S.
> Essentially, it's a substitute for the SAS datastep, and also
> transposes data and calculates averages and such. (No t-tests or
> regressions in this version). I created this, during the years
> 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
> my opinion. The tarball includes about 100 or so test cases used for
> debugging - for logical calculation errors, but not for extremely high
> volumes of data.
> The maintenance of Vilno has slowed down, because I am currently
> (desparately) looking for employment. But once I've found new
> employment and living quarters and settled in, I will continue to
> enhance Vilno in my spare time.
> The founder: that would be me, Robert Wilkins
> Find it at: code.google.com/p/vilno ( GNU GPL )
> ( In particular, the tarball at code.google.com/p/vilno/downloads/list
> , since I have yet to figure out how to use Subversion ).
>
>
> 4. Who knows?
> It was not easy to find out about the existence of DAP and PSPP. So
> who knows what else is out there. However, I think you'll find a lot
> more statistics software ( regression , etc ) out there, and not so
> much data transformation software. Not many people work on data
> preparation software. In fact, the category is so obscure that there
> isn't one agreed term: data cleaning , data munging , data crunching ,
> or just getting the data ready for analysis.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list