[R] Tools For Preparing Data For Analysis

Stephen Tucker brown_emu at yahoo.com
Sun Jun 10 21:27:50 CEST 2007


Since R is supposed to be a complete programming language, I wonder
why these tools couldn't be implemented in R (unless speed is the
issue). Of course, it's a naive desire to have a single language that
does everything, but it seems that R currently has most of the
functions necessary to do the type of data cleaning described.

For instance, Gabor and Peter showed some snippets of ways to do this
elegantly; my [physical science] data is often not as horrendously
structured so usually I can get away with a program containing this
type of code

txtin <- scan(filename,what="",sep="\n")
filteredList <- lapply(strsplit(txtin,delimiter),FUN=filterfunction)
   # fiteringfunction() returns selected (and possibly transformed
   # elements if present and NULL otherwise
   # may include calls to grep(), regexpr(), gsub(), substring(),...
   # nchar(), sscanf(), type.convert(), paste(), etc.
mydataframe <- do.call(rbind,filteredList)
   # then match(), subset(), aggregate(), etc.

In the case that the file is large, I open a file connection and scan
a single line + apply filterfunction() successively in a FOR-LOOP
instead of using lapply(). Of course, the devil is in the details of
the filtering function, but I believe most of the required text
processing facilities are already provided by R.

I often have tasks that involve a combination of shell-scripting and
text processing to construct the data frame for analysis; I started
out using Python+NumPy to do the front-end work but have been using R
progressively more (frankly, all of it) to take over that portion
since I generally prefer the data structures and methods in R.


--- Peter Dalgaard <p.dalgaard at biostat.ku.dk> wrote:

> Douglas Bates wrote:
> > Frank Harrell indicated that it is possible to do a lot of difficult
> > data transformation within R itself if you try hard enough but that
> > sometimes means working against the S language and its "whole object"
> > view to accomplish what you want and it can require knowledge of
> > subtle aspects of the S language.
> >   
> Actually, I think Frank's point was subtly different: It is *because* of 
> the differences in view that it sometimes seems difficult to find the 
> way to do something in R that  is apparently straightforward in SAS. 
> I.e. the solutions exist and are often elegant, but may require some 
> lateral thinking.
> 
> Case in point: Finding the first or the last observation for each 
> subject when there are multiple records for each subject. The SAS way 
> would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that 
> you can compare the subject ID with the one from the previous record, 
> working with data that are sorted appropriately.
> 
> You can do the same thing in R with a for loop, but there are better 
> ways e.g.
> subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or 
> maybe
> do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or 
> something involving aggregate(). (The latter approaches generalize 
> better to other within-subject functionals like cumulative doses, etc.).
> 
> The hardest cases that I know of are the ones where you need to turn one 
> record into many, such as occurs in survival analysis with 
> time-dependent, piecewise constant covariates. This may require 
> "transposing the problem", i.e. for each  interval you find out which 
> subjects contribute and with what, whereas the SAS way would be a 
> within-subject loop over intervals containing an OUTPUT statement.
> 
> Also, there are some really weird data formats, where e.g. the input 
> format is different in different records. Back in the 80's where 
> punched-card input was still common, it was quite popular to have one 
> card with background information on a patient plus several cards 
> detailing visits, and you'd get a stack of cards containing both kinds. 
> In R you would most likely split on the card type using grep() and then 
> read the two kinds separately and merge() them later.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.



More information about the R-help mailing list