[R] queer data set
(Ted Harding)
Ted.Harding at nessie.mcc.ac.uk
Tue Aug 16 01:45:53 CEST 2005
On 15-Aug-05 S.O. Nyangoma wrote:
> I have a dataset that is basically structureless. Its dimension
> varies from row to row and sep(s) are a mixture of tab and semi
> colon (;) an example is
>
HEADER1 HEADER2 HEADER3 HEADER3
A1 B1 C1 X11;X12;X13
A2 B2 C2 X21;X22;X23;X24;X25
A3 B3 C3
A4 B4 C4 X41;X42;X43
A5 B5 C5 X51
>
> etc., say. Note that a blank under HEADER3 corresponds to non
> occurance and all semi colon (;) delimited variables are under
> HEADER3. These values run into tens of thousands. I want to give some
> order to this queer matrix to something like:
>
> HEADER1 HEADER2 HEADER3 HEADER3
> A1 B1 C1 X11
> A1 B1 C1 X12
> A1 B1 C1 X13
> A1 B1 C1 X14
> A2 B2 C2 X21
> A2 B2 C2 X22
> A2 B2 C2 X23
> A2 B2 C2 X24
> A2 B2 C2 X25
> A2 B2 C2 X26
> A3 B3 C3 NA
> A4 B4 C4 X41
> A4 B4 C4 X42
> A4 B4 C4 X43
>
> Is there a brilliant R-way of doing such task?
>
> Goodday. Stephen.
I don't know about a brilliant R trick (though I'm sure others do).
But (on my usual hobby-horse) if you have 'awk' available (and
don't mind using it) then it will do the job:
First create an 'awk' program file as follows:
{for(i in A) delete A[i]}
{if($4==""){A[1]="NA"}
else {split($4,A,";")}}
{B = $1 "\t" $2 "\t" $3}
{for(i in A) print B "\t" A[i]}
and call this say split.awk
Then run
awk -f split.awk
and feed it the lines of your primary dataset as above. Here's
a cut&paste from my Linux session, where the first block of
lines after "awk -f split.awk" are the lines being input to
the program, starting with the header, followed by the output
of the program starting with the header again:
awk -f split.awk
HEADER1 HEADER2 HEADER3 HEADER3
A1 B1 C1 X11;X12;X13
A2 B2 C2 X21;X22;X23;X24;X25
A3 B3 C3
A4 B4 C4 X41;X42;X43
A5 B5 C5 X51
HEADER1 HEADER2 HEADER3 HEADER3
A1 B1 C1 X11
A1 B1 C1 X12
A1 B1 C1 X13
A2 B2 C2 X24
A2 B2 C2 X25
A2 B2 C2 X21
A2 B2 C2 X22
A2 B2 C2 X23
A3 B3 C3 NA
A4 B4 C4 X41
A4 B4 C4 X42
A4 B4 C4 X43
A5 B5 C5 X51
In unixoid systems, with a large file of such lines, the command
would be
cat yourdatafile | awk -f split.awk
and then you would only see the output, not the input as shown
above, and you can of course redirect it into a new file with
cat yourdatafile | awk -f split.awk > newdatafile
Note, however, that the order of the lines output for the third
line of input (the one with X21;X22;X23;X24;X25) is not the same
as the order of the X21;X22;X23;X24;X25 though they are all there.
This is a "feature" of the way 'awk' handles arrays (which are
"associative arrays" indexed by values, not by position).
This may not matter for your application; but if it does matter
then I'm not sure how to force the correct order.
Hoping this helps,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 16-Aug-05 Time: 00:45:49
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list