[R] preprocessing data
Gabor Grothendieck
ggrothendieck at gmail.com
Tue Aug 16 17:27:45 CEST 2005
On 8/16/05, Jean Eid <jeaneid at chass.utoronto.ca> wrote:
> Dear all,
>
> My question is concerning the line
> "This is adequate for small files, but for anything more complicated we
> recommend using the facilities of a language like perl to pre-process
> the file."
>
> in the import/export manual.
>
> I have a large fixed-width file that I would like to preprocess in Perl or
> awk. The problem is that I do not know where to start. Does anyone have a
> simple example on how to turn a fixed-width file in any of these
> facilities into csv or tab delimited file. I guess I am looking for
> somewhat a perl for dummies or awk for dummies that does this. any
> pointers for website will be greatly appreciated
>
Try to do it in R first. I have found that I rarely need to go to
an outside language to massage my data.
# fixed with fields of 10 and 5
Lines <- readLines("mydata.dat")
data.frame( field1 = as.numeric(substring(1,10,Lines),
field2 = as.numeric(substring(11,15,Lines) )
If you do find that you have speed or memory problems that
require that you go outside of R to preprocess your data
then the gawk version of awk has a FIELDWIDTHS variable that
makes handling fixed fields very easy. The gawk program below
assumes two fields of widths 10 and 5, respectively, which
is set in the first line. Then it repeatedly executes the
second line for each input line forcing field splitting by a
dummy manipulation (since field splitting is lazy) and then
printing each line, the default being to print out the
entire line with a space between successive fields:
BEGIN { FIELDWIDTHS = "10 5" }
{ $1 = $1; print }
In R, do the following assuming the above two lines are in
split.awk:
read.table(pipe("gawk -f split.awk mydata.dat"))
or else run gawk outside of R then read in the output file
created:
gawk -f split.awk mydata.dat > mydata2.dat
For more information, google for
FIELDWIDTHS gawk
for that portion of the manual on FIELDWIDTHS -- it includes
an example and, of course, the whole manual is there too. The
book by Kernighan et al is also good.
I have used both awk and perl and I think its unlikely you
would need perl given that you have R at your disposal for
the hard parts and awk is easier to learn, better designed
and more focused on this sort of task.
More information about the R-help
mailing list