[R] Tools For Preparing Data For Analysis

Gabor Grothendieck ggrothendieck at gmail.com
Sun Jun 10 04:16:46 CEST 2007


That can be  elegantly handled in R through R's object oriented programming
by defining a class for the fancy input.  See this post:
  https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
for a simple example of that style.


On 6/9/07, Robert Wilkins <irishhacker at gmail.com> wrote:
> Here are some examples of the type of data crunching you might have to do.
>
> In response to the requests by Christophe Pallier and Martin Stevens.
>
> Before I started developing Vilno, some six years ago, I had been working in
> the pharmaceuticals for eight years ( it's not easy to show you actual data
> though, because it's all confidential of course).
>
> Lab data can be especially messy, especially if one clinical trial allows
> the physicians to use different labs. So let's consider lab data.
>
> Merge in normal ranges, into the lab data. This has to be done by lab-site
> and lab testcode(PLT for platelets, etc.), obviously. I've seen cases where
> you also need to match by sex and age. The sex column in the normal ranges
> could be: blank, F, M, or B ( B meaning for Both sexes). The age column in
> the normal ranges could be: blank, or something like "40 <55". Even worse,
> you could have an ageunits column in the normal ranges dataset: usually "Y",
> but if there are children in the clinical trial, you will have "D" or "M",
> for Days and Months. If the clinical trial is for adults, all rows with "D"
> or "M" should be tossed out at the start. Clearly the statistical programmer
> has to spend time looking at the data, before writing the program. Remember,
> all of these details can change any time you move to a new clinical trial.
>
> So for the lab data, you have to merge in the patient's date of birth,
> calculate age, and somehow relate that to the age-group column in the normal
> ranges dataset.
>
> (By the way, in clinical trial data preparation, the SAS datastep is much
> more useful and convenient, in my opinion, than the SQL SELECT syntax, at
> least 97% of the time. But in the middle of this program, when you merge the
> normal ranges into the lab data, you get a better solution with PROC SQL (
> just the SQL SELECT statement implemented inside SAS) This is because of the
> trickiness of the age match-up, and the SAS datastep does not do well with
> many-to-many joins.).
>
> Merge in various study drug administration dates into the lab data. Now, for
> each lab record, calculate treatment period ( or cycle number ), depending
> on the statistician's specifications and the way the clinical trial is
> structured.
>
> Different clinical sites chose to use different lab providers. So, for
> example, for Monocytes, you have 10 different units ( essentially 6 units,
> but spelling inconsistencies as well). The statistician has requested that
> you use standardized units in some of the listings ( % units, and only one
> type of non-% unit, for example ). At the same time, lab values need to be
> converted ( *1.61 , divide by 1000, etc. ). This can be very time consuming
> no matter what software you use, and, in my experience, when the SAS
> programmer asks for more clinical information or lab guidebooks, the
> response is incomplete, so he does a lot of guesswork. SAS programmers do
> not have expertise in lab science, hence the guesswork.
>
> Your program has to accomodate numeric values, "1.54" , quasi-numeric values
> "<1" , and non-numeric values "Trace".
>
> Your data listing is tight for space, so print "PROLONGED CELL CONT" as
> "PRCC".
>
> Once normal ranges are merged in, figure out which values are out-of-range
> and high , which are low, and which are within normal range. In the data
> listing, you may have "H" or "L" appended to the result value being printed.
>
> For each treatment period, you may need a unique lab record selected, in
> case there are two or three for the same treatment period. The statistician
> will tell the SAS programmer how. Maybe the averages of the results for that
> treatment period, maybe that lab record closest to the mid-point of of the
> treatment period. This isn't for the data listing, but for a summary table.
>
> For the differentials ( monocytes, lymphocytes, etc) , merge in the WBC
> (total white blood cell count) values , to convert values between % units
> and absolute count units.
>
> When printing the values in the data listing, you need "H" or "L" to the
> right of the value. But you also need the values to be well lined up ( the
> decimal place ). This can be stupidly time consuming.
>
>
>
> AND ON AND ON AND ON .....
>
> I think you see why clinical trials statisticians and SAS programmers enjoy
> lots of job security.

This could be readily handled in R using object oriented programming.
You would specify a class for the strange input,



More information about the R-help mailing list