[R] A really simple data manipulation example
Robert Wilkins
irishhacker at gmail.com
Wed Jun 27 01:59:44 CEST 2007
In response to those who asked for a better explanation of what the
Vilno software does, here's a simple example that gives some idea of
what it does.
LABRESULTS is a dataset with multiple rows per patient , with lab
sodium measurements. It has columns: PATIENT_ID, VISIT_NUM, and
SODIUM.
DEMO is a dataset with one row per patient, with demographic data.
It has columns: PATIENT_ID, GENDER.
Here's a simple example, the following paragraph of code is a
data processing function (dpf) :
inlist LABRESULTS DEMO ;
mergeby PATIENT_ID ;
if (SODIUM == -9) SODIUM = NULL ;
if (VISIT_NUM != 2) deleterow ;
select AVERAGE_SODIUM = avg(SODIUM) by GENDER ;
sendoff(RESULTS_DATASET) GENDER AVERAGE_SODIUM ;
turnoff; // just means end-of-paragraph , version 1.0 won't need this.
Can you guess what it does? The lab result rows are merged with the
demographic rows, just to get the gender information merged in.
Obviously, they are merged by patient. The code -9 is used to denote
"missing", so convert that to NULL. I'm about to take a statistic for
visit 2, so rows with visit 0 or 1 must be deleted. I'm assuming, for
visit 2, each patient has at most one row. Now, for each sex group,
take the average sodium level. After the select statement, I have just
two rows, for male and female, with the average sodium level in the
AVERAGE_SODIUM column. Now the sendoff statement just stores the
current data table into a datafile, called RESULTS_DATASET.
So you have a sequence of data tables, each calculation reading in the
current table , and leaving a new data table for the next calculation.
So you have input datasets, a bunch of intermediate calculations, and
one or more output datasets. Pretty simple idea.
*****************************************
Some caveats:
LABRESULTS and DEMO are binary datasets. The asciitobinary and
binarytoascii statements are used to convert between binary datasets
and comma-separated ascii data files. (You can use any delimiter:
comma, vertical bar , etc). An asciitobinary statement is typically
just two lines of code.
The dpf begins with the inlist statement , and , for the moment ,
needs "turnoff ;" as the last line. Version 1.0 won't require the use
of "turnoff;", but version 0.85 does. It only means this paragraph of
code ends here ( a program can , of course , contain many paragraphs:
data processing functions, print statements, asciitobinary statements,
etc.).
If you've worked with lab data, you know lab data does not look so
simplistic. I need a simple example.
Vilno has a lot of functionality, many-to-many joins, adding columns,
firstrow() and lastrow() flags, and so forth. A fair amount of complex
data manipulations have already been tested with test programs ( in
the tarball ). Of course a simple example cannot show you that, it's
just a small taste.
*********************************************
If you've never used SPSS or SAS before, you won't care, but this
programming language falls in the same family as the SPSS and SAS
programming languages. All three programming languages have a fair
amount in common, but are quite different from the S programming
language. The vilno data processing function can replace the SAS
datastep. (It can also replace PROC TRANSPOSE and much of PROC MEANS,
except standard deviation calculations still need to be included in
the select statement).
********************************************
I hope that helps.
http://code.google.com/p/vilno
More information about the R-help
mailing list