[R] Tools For Preparing Data For Analysis

Chris Evans chrishold at psyctc.org
Fri Jun 8 19:26:51 CEST 2007


Martin Henry H. Stevens sent the following  at 08/06/2007 15:11:
> Is there an example available of this sort of problematic data that  
> requires this kind of data screening and filtering? For many of us,  
> this issue would be nice to learn about, and deal with within R. If a  
> package could be created, that would be optimal for some of us. I  
> would like to learn a tad more, if it were not too much effort for  
> someone else to point me in the right direction?
> Cheers,
> Hank
> On Jun 8, 2007, at 8:47 AM, Douglas Bates wrote:
> 
>> On 6/7/07, Robert Wilkins <irishhacker at gmail.com> wrote:
>>> As noted on the R-project web site itself ( www.r-project.org ->

... rest snipped ...

OK, I can't resist that invitation.  I think there are many kinds of
problematic data.  I handle some nasty textish things in perl (and I
loved the purgatory quote) and I'm afraid I do some things in Excel and
some cleaning I can handle in R, but I never enter data directly into R.

However, one very common scenario I have faceda all my working life is
psych data from questionnaires or interviews in low budget work, mostly
student research or routine entry of therapists' data.  Typically you
have an identifier, a date, some demographics and then a lot of item
data.  There's little money (usual zero) involved for data entry and
cleaning but I've produced a lot of good(ish) papers out of this sort of
very low budget work over the last 20 years.  (Right at the other end of
a financial spectrum from the FDA/validated s'ware thread but this is
about validation again!)

The problem I often face is that people are lousy data entry machines
(well, actually, they vary ... enormously) and if they mess up the data
entry we all know how horrible this can be.

SPSS (boo hiss) used to have an excellent "module", actually a
standalone PC/Windoze program, that allowed you to define variables so
they had allowed values and it would refuse to accept out of range or
out of acceptable entries, it also allowed you to create checking rules
and rules that would, in the light of earlier entries, set later values
and not ask about them.  In a rudimentary way you could also lay things
out on the screen so that it paginated where the q'aire or paper data
record did etc.  The final nice touch was that you could define some
variables as invariant and then set the thing so an independent data
entry person could re-enter the other data (i.e. pick up q'aire, see if
ID fits the one showing on screen, if so, enter the rest of the data).
It would bleep and not move on if you entered a value other than that
entered by the first person and you had to confirm that one of you was
right.

That saved me wasted weeks I'm sure on analysing data that turned out to
be awful and I'd love to see someone build something to replace that.

Currently I tend to use (boo hiss) Excel for this as everyone I work
with seems to have it (and not all can install open office and anyway I
haven't had time to learn that properly yet either ...) and I set up
spreadsheets with validation rules set.  That doesn't get the branching
rules and checks (e.g. if male, skip questions about periods, PMT and
pregnancies), or at least, with my poor Excel skills it doesn't.  I just
skip a column to indicate page breaks in the q'aire, and I get, when I
can, two people to enter the data separately and then use R to compare
the two spreadsheets having yanked them into data frames.

I would really, really love someone to develop (and perhaps replace) the
rather buggy edit() and fix() routines (seem to hang on big data frames
in Rcmdr which is what I'm trying to get students onto) with something
that did some or all of what SPSS/DE used to do for me or I bodge now in
Excel.  If any generous coding whiz were willing to do this, I'll try to
alpha and beta test and write help etc.

There _may_ be good open source things out there that do what I need but
something that really integrated into R would be another huge step
forward in being able to phase out SPSS in my work settings and phase in R.

Very best all,

Chris



-- 
Chris Evans <chris at psyctc.org> Skype: chris-psyctc
Professor of Psychotherapy, Nottingham University;
Consultant Psychiatrist in Psychotherapy, Notts PDD network;
Research Programmes Director, Nottinghamshire NHS Trust;
*If I am writing from one of those roles, it will be clear. Otherwise*
*my views are my own and not representative of those institutions    *



More information about the R-help mailing list