[R] Creating a file with reusable functions accessible throughout a computational biology cancer project
Duncan Murdoch
murdoch.duncan at gmail.com
Tue Jun 7 18:53:19 CEST 2011
On 07/06/2011 12:41 PM, Ben Ganzfried wrote:
> Hi,
>
> My project is set up the following way:
> root directory contains the following folders:
> folders: "Breast_Cancer" AND "Colorectal_Cancer" AND "Lung_Cancer" AND
> "Prostate_Cancer"
>
> I want to create a file, call it: "repeating_functions.R" and place it in
> the root directory such that I can call these functions from within the
> sub-folders in each type of cancer. My confusion is that I'm not sure of
> the syntax to make this happen. For example:
>
> Within the "Prostate_Cancer" folder, I have the following folders:
> "curated" AND "src" AND "uncurated"
>
> Within "uncurated" I have a ton of files, one of which could be:
> PMID5377_fullpdata.csv
>
> within "src" I have my R scripts, the one corresponding to the above
> "uncurated" file would be:
> PMID5377_curation.R
>
> Here's the problem I'm trying to address:
> Many of the uncurated files will require the same R code to curate them and
> I find myself spending a lot of time copying and pasting the same code over
> and over. I've spent at least 40 hours copying code I've already written and
> pasting it into a new dataset. There has simply got to be a better way to
> do this.
There is: you should put your common functions in a package. Packages
are a good way to organize your own code, you don't need to publish
them. (You will get a warning if you put "Not for distribution" into
the License field in the DESCRIPTION file, but it's just a warning.)
You can also put datasets in a package; this makes sense if they are
relatively static. If you get new data every day you probably wouldn't.
> A common example of the code I'll write in an "uncurated" file is the
> following (let's call the following snippet of code UNCURATED_EXAMPLE1):
> ##characteristics_ch1.2 -> G
> tmp<- uncurated$characteristics_ch1.2
> tmp<- sub("grade: ","",tmp,fixed=TRUE)
> tmp[tmp=="I"]<- "low"
> tmp[tmp=="II"]<- "low"
> tmp[tmp=="III"]<- "high"
> curated$G<- tmp
>
> The thing that changes depending on the dataset is *typically* the column
> header (ie "uncurated$characteristics_ch1.2" might be
> "uncurated$description" or "uncurated_characteristics_ch1.7" depending on
> the dataset), although sometimes I want to substitute different words (ie
> "grade" can be referred to in many different ways).
>
> What's the easiest way to automate this? I'd like, at a minimum, to make
> UNCURATED_EXAMPLE1 look like the following:
> tmp<- uncurated$characteristics_ch1.2
> insert_call_to_repeating_functions.R_and_access_("grade")_function
> curated$G<- tmp
>
> It would be even better if I could say, for Prostate_Cancer, write one R
> script that standardizes all the "uncurated" datasets; rather than writing
> 100 different R scripts. Although I don't know how feasible this is.
Both of those sound very easy. For example,
curate <- function(characteristic, word="grade: ") {
tmp <- sub(word, "", characteristic, fixed=TRUE)
tmp[tmp=="I"] <- "low"
tmp[tmp=="II"] <- "low"
tmp[tmp=="III"] <- "high"
tmp
}
Then your script would just need one line
curated$G <- curate(uncurated$characteristics_ch1.2)
I don't know where you'll find the names of all the datasets, but if you
can get them into a vector, it's pretty easy to write a loop that calls
curate() for each one.
Deciding how much goes in the package and how much is one-off code that
stays with a particular dataset is a judgment call. I'd guess based on
your description that curate() belongs in the package but the rest
doesn't, but you know a lot more about the details than I do.
Duncan Murdoch
> I'm sorry if this sounds confusing. Basically, I have thousands of
> "uncurated" datasets with clinical information and I'm trying to standardize
> all the datasets via R scripts so that all the information is standardized
> for statistical analysis. Not all of the datasets contain the same
> information, but many of them do contain similar data (ie age, stage, grade,
> days_to_recurrence, and many others). Furthermore, in many cases the
> standardization code is very similar across datasets (ie I'll want to delete
> the words "Age: " before the actual number). But this is not always the
> case (ie sometimes a dataset will not put the different patient data (ie
> age, stage, grade) in separate columns, instead putting it all in one
> column, so I have to write a different function to split it by the ";" and
> make a new table that is separated by column). Anyway, I would be forever
> grateful for any advice to make this quicker and am happy to provide any
> clarifications.
>
> Thank you very much.
>
> Ben
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list