[R] Tools For Preparing Data For Analysis

Fri Jun 22 15:56:20 CEST 2007

I am posting to this thread that has been quiet for some time because I
remembered the following question.

Christophe Pallier wrote:
> Hi,
> 
> Can you provide examples of data formats that are problematic to read and
> clean with R ?

Today I had a data manipulation problem that I don't know how to do in R
so I solved it with perl.  Since I'm always interested in learning more
about complex data manipulation in R I am posting my problem in the
hopes of receiving some hints for doing this in R.

If anyone has nothing better to do than play with other people's data,
I would be happy to send the row files off-list.

Background:

I have been given data that contains two measurements of left
ventricular ejection fraction.  One of the methods is echocardiogram
which sometimes gives a true quantitative value and other times a
semi-quantitative value.  The desire is to compare echo with the
other method (MUGA).  In most cases, patients had either quantitative
or semi-quantitative.  Same patients had both.  The data came
to me in excel files with, basically, no patient identifiers to link
the "both" with the semi-quantitative patients (the "both" patients
were in multiple data sets).

What I wanted to do was extract from the semi-quantitative data file
those patients with only semi-quantitative.  All I have to link with
are the semi-quantitative echo and the MUGA and these pairs of values
are not unique.

To make this more concrete, here are some portions of the raw data.

"Both"

"ID NUM","ECHO","MUGA","Semiquant","Quant"
"B",12,37,10,12
"D",13,13,10,13
"E",13,26,10,15
"F",13,31,10,13
"H",15,15,10,15
"I",15,21,10,15
"J",15,22,10,15
"K",17,22,10,17
"N",17.5,4,10,17.5
"P",18,25,10,18
"R",19,25,10,19

Seimi-quantitative

"echo","muga","quant"
10,20,0      <-- keep
10,20,0      <-- keep
10,21,0      <-- remove
10,21,0      <-- keep
10,24,0      <-- keep
10,25,0      <-- remove
10,25,0      <-- remove
10,25,0      <-- keep

Here is the perl program I wrote for this.

#!/usr/bin/perl

open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
# Discard first row;
$_ = <BOTH>;
while(<BOTH>) {
    chomp;
    ($id, $e, $m, $sq, $qu) = split(/,/);
    $both{$sq,$m}++;
}
close(BOTH);

open(OUT, "> qual_echo_only.csv") || die "Can't open qual_echo_only.csv";
print OUT "pid,echo,muga,quant\n";
$pid = 2001;

open(QUAL, "qual_echo.csv") || die "Can't open qual_echo.csv";
# Discard first row
$_ = <QUAL>;
while(<QUAL>) {
    chomp;
    ($echo, $muga, $quant) = split(/,/);
    if ($both{$echo,$muga} > 0) {
        $both{$echo,$muga}--;
    }
    else {
        print OUT "$pid,$echo,$muga,$quant\n";
        $pid++;
    }
}
close(QUAL);
close(OUT);

open(OUT, "> both_echo.csv") || die "Can't open both_echo.csv";
print OUT "pid,echo,muga,quant\n";
$pid = 3001;

open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
# Discard first row;
$_ = <BOTH>;
while(<BOTH>) {
    chomp;
    ($id, $e, $m, $sq, $qu) = split(/,/);
    print OUT "$pid,$sq,$m,0\n";
    print OUT "$pid,$qu,$m,1\n";
    $pid++;
}
close(BOTH);
close(OUT);

-- 
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program
Assistant Professor, Department of Public Health Sciences
Faculty of Medicine, University of Toronto
email: kevin.thorpe at utoronto.ca  Tel: 416.864.5776  Fax: 416.864.6057