[R] queer data set

Tue Aug 16 01:45:53 CEST 2005

On 15-Aug-05 S.O. Nyangoma wrote:
> I have a dataset that is basically structureless. Its dimension
> varies from row to row and sep(s) are a mixture of tab and semi
> colon (;) an example is
> 
HEADER1 HEADER2 HEADER3   HEADER3
A1       B1      C1       X11;X12;X13
A2       B2      C2       X21;X22;X23;X24;X25
A3       B3      C3       
A4       B4      C4       X41;X42;X43
A5       B5      C5       X51
> 
> etc., say. Note that a blank under HEADER3 corresponds to non 
> occurance and all semi colon (;) delimited variables are under 
> HEADER3. These values run into tens of thousands. I want to give some 
> order to this queer matrix to something like:
> 
> HEADER1 HEADER2 HEADER3   HEADER3
> A1       B1      C1       X11
> A1       B1      C1       X12
> A1       B1      C1       X13
> A1       B1      C1       X14
> A2       B2      C2       X21
> A2       B2      C2       X22
> A2       B2      C2       X23
> A2       B2      C2       X24
> A2       B2      C2       X25
> A2       B2      C2       X26
> A3       B3      C3       NA
> A4       B4      C4       X41
> A4       B4      C4       X42
> A4       B4      C4       X43
> 
> Is there a brilliant R-way of doing such task?
> 
> Goodday. Stephen.

I don't know about a brilliant R trick (though I'm sure others do).

But (on my usual hobby-horse) if you have 'awk' available (and
don't mind using it) then it will do the job:

First create an 'awk' program file as follows:

  {for(i in A) delete A[i]}
  {if($4==""){A[1]="NA"}
    else {split($4,A,";")}}
  {B = $1 "\t" $2 "\t" $3}
  {for(i in A) print B "\t" A[i]}

and call this say split.awk

Then run

  awk -f split.awk

and feed it the lines of your primary dataset as above. Here's
a cut&paste from my Linux session, where the first block of
lines after "awk -f split.awk" are the lines being input to
the program, starting with the header, followed by the output
of the program starting with the header again:

awk -f split.awk
HEADER1 HEADER2 HEADER3   HEADER3
A1       B1      C1       X11;X12;X13
A2       B2      C2       X21;X22;X23;X24;X25
A3       B3      C3       
A4       B4      C4       X41;X42;X43
A5       B5      C5       X51
HEADER1 HEADER2 HEADER3 HEADER3
A1      B1      C1      X11
A1      B1      C1      X12
A1      B1      C1      X13
A2      B2      C2      X24
A2      B2      C2      X25
A2      B2      C2      X21
A2      B2      C2      X22
A2      B2      C2      X23
A3      B3      C3      NA
A4      B4      C4      X41
A4      B4      C4      X42
A4      B4      C4      X43
A5      B5      C5      X51

In unixoid systems, with a large file of such lines, the command
would be

  cat yourdatafile | awk -f split.awk

and then you would only see the output, not the input as shown
above, and you can of course redirect it into a new file with

  cat yourdatafile | awk -f split.awk > newdatafile

Note, however, that the order of the lines output for the third
line of input (the one with X21;X22;X23;X24;X25) is not the same
as the order of the X21;X22;X23;X24;X25 though they are all there.

This is a "feature" of the way 'awk' handles arrays (which are
"associative arrays" indexed by values, not by position).

This may not matter for your application; but if it does matter
then I'm not sure how to force the correct order.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 16-Aug-05                                       Time: 00:45:49
------------------------------ XFMail ------------------------------