[R] Tools For Preparing Data For Analysis

(Ted Harding) ted.harding at nessie.mcc.ac.uk
Fri Jun 8 11:43:14 CEST 2007


On 08-Jun-07 08:27:21, Christophe Pallier wrote:
> Hi,
> 
> Can you provide examples of data formats that are problematic
> to read and clean with R ?
> 
> The only problematic cases I have encountered were cases with
> multiline and/or  varying length records (optional information).
> Then, it is sometimes a good idea to preprocess the data to
> present in a tabular format (one record per line).
> 
> For this purpose, I use awk (e.g.
> http://www.vectorsite.net/tsawk.html),
> which is very adept at processing ascii data files  (awk is
> much simpler to learn than perl, spss, sas, ...).

I want to join in with an enthusiastic "Me too!!". For anything
which has to do with basic checking for the kind of messes that
people can get data into when they "put it on the computer",
I think awk is ideal. It is very flexible (far more so than
many, even long-time, awk users suspect), very transparent
in its programming language (as opposed to say perl), fast,
and with light impact on system resources (rare delight in
these days, when upgrading your software may require upgrading
your hardware).

Although it may seem on the surface that awk is "two-dimensional"
in its view of data (line by line, and per field in a line),
it has some flexible internal data structures and recursive
function capability, which allows a lot more to be done with
the data that have been read in.

For example, I've used awk to trace ancestry through a genealogy,
given a data file where each line includes the identifier of an
individual and the identifiers of its male and female parents
(where known). And that was for pedigree dogs, where what happens
in real life makes Oedipus look trivial.

> I have never encountered a data file in ascii format that I
> could not reformat with Awk.  With binary formats, it is
> another story...

But then it is a good idea to process the binary file using an
instance of the creating software, to produce a ASCII file (say
in CSV format).

> But, again, this is my limited experience; I would like to
> know if there are situations where using SAS/SPSS is really
> a better approach.

The main thing often useful for data cleaning that awk does
not have is any associated graphics. It is -- by design -- a
line-by-line text-file processor. While, for instance, you
could use awk to accumulate numerical histogram counts, you
would have to use something else to display the histogram.
And for scatter-plots there's probably not much point in
bringing awk into the picture at all (unless a preliminary
filtration of mess is needed anyway).

That being said, though, there can still be a use to extract
data fields from a file for submission to other software.

Another kind of area where awk would not have much to offer
is where, as a part of your preliminary data inspection,
you want to inspect the results of some standard statistical
analyses.

As a final comment, utilities like awk can be used far more
fruitfully on operating systems (the unixoid family) which
incorporate at ground level the infrastructure for "plumbing"
together streams of data output from different programs.

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Jun-07                                       Time: 10:43:05
------------------------------ XFMail ------------------------------



More information about the R-help mailing list