[R] Creating a file with reusable functions accessible throughout a computational biology cancer project

Tue Jun 7 18:53:19 CEST 2011

On 07/06/2011 12:41 PM, Ben Ganzfried wrote:
> Hi,
>
> My project is set up the following way:
> root directory contains the following folders:
>    folders: "Breast_Cancer" AND "Colorectal_Cancer" AND "Lung_Cancer" AND
> "Prostate_Cancer"
>
> I want to create a file, call it: "repeating_functions.R" and place it in
> the root directory such that I can call these functions from within the
> sub-folders in each type of cancer.  My confusion is that I'm not sure of
> the syntax to make this happen.  For example:
>
> Within the "Prostate_Cancer" folder, I have the following folders:
> "curated" AND "src" AND "uncurated"
>
> Within "uncurated" I have a ton of files, one of which could be:
> PMID5377_fullpdata.csv
>
> within "src" I have my R scripts, the one corresponding to the above
> "uncurated" file would be:
> PMID5377_curation.R
>
> Here's the problem I'm trying to address:
> Many of the uncurated files will require the same R code to curate them and
> I find myself spending a lot of time copying and pasting the same code over
> and over. I've spent at least 40 hours copying code I've already written and
> pasting it into a new dataset.  There has simply got to be a better way to
> do this.

There is:  you should put your common functions in a package.  Packages 
are a good way to organize your own code, you don't need to publish 
them.  (You will get a warning if you put "Not for distribution" into 
the License field in the DESCRIPTION file, but it's just a warning.)  
You can also put datasets in a package; this makes sense if they are 
relatively static.  If you get new data every day you probably wouldn't.
> A common example of the code I'll write in an "uncurated" file is the
> following (let's call the following snippet of code UNCURATED_EXAMPLE1):
> ##characteristics_ch1.2 ->  G
> tmp<- uncurated$characteristics_ch1.2
> tmp<- sub("grade: ","",tmp,fixed=TRUE)
> tmp[tmp=="I"]<- "low"
> tmp[tmp=="II"]<- "low"
> tmp[tmp=="III"]<- "high"
> curated$G<- tmp
>
> The thing that changes depending on the dataset is *typically* the column
> header (ie "uncurated$characteristics_ch1.2" might be
> "uncurated$description" or "uncurated_characteristics_ch1.7" depending on
> the dataset), although sometimes I want to substitute different words (ie
> "grade" can be referred to in many different ways).
>
> What's the easiest way to automate this?  I'd like, at a minimum, to make
> UNCURATED_EXAMPLE1 look like the following:
> tmp<- uncurated$characteristics_ch1.2
> insert_call_to_repeating_functions.R_and_access_("grade")_function
> curated$G<- tmp
>
> It would be even better if I could say, for Prostate_Cancer, write one R
> script that standardizes all the "uncurated" datasets; rather than writing
> 100 different R scripts.  Although I don't know how feasible this is.

Both of those sound very easy.   For example,

curate <- function(characteristic, word="grade: ") {
   tmp <- sub(word, "", characteristic, fixed=TRUE)
   tmp[tmp=="I"] <- "low"
   tmp[tmp=="II"] <- "low"
   tmp[tmp=="III"] <- "high"
   tmp
}

Then your script would just need one line

curated$G <- curate(uncurated$characteristics_ch1.2)

I don't know where you'll find the names of all the datasets, but if you 
can get them into a vector, it's pretty easy to write a loop that calls 
curate() for each one.

Deciding how much goes in the package and how much is one-off code that 
stays with a particular dataset is a judgment call.  I'd guess based on 
your description that curate() belongs in the package but the rest 
doesn't, but you know a lot more about the details than I do.

Duncan Murdoch
> I'm sorry if this sounds confusing.  Basically, I have thousands of
> "uncurated" datasets with clinical information and I'm trying to standardize
> all the datasets via R scripts so that all the information is standardized
> for statistical analysis.  Not all of the datasets contain the same
> information, but many of them do contain similar data (ie age, stage, grade,
> days_to_recurrence, and many others).  Furthermore, in many cases the
> standardization code is very similar across datasets (ie I'll want to delete
> the words "Age: " before the actual number).  But this is not always the
> case (ie sometimes a dataset will not put the different patient data (ie
> age, stage, grade) in separate columns, instead putting it all in one
> column, so I have to write a different function to split it by the ";" and
> make a new table that is separated by column).  Anyway, I would be forever
> grateful for any advice to make this quicker and am happy to provide any
> clarifications.
>
> Thank you very much.
>
> Ben
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.