[R] Problem with parsing a dataset - help earnestly sought

Fri Apr 23 11:13:07 CEST 2010

We can use strapply in the gsubfn package. It extracts
fields matching regular expressions.

strapply extracts the parenthesized part of the regular
expression (or the entire regular expression if nothing
parenthesized), applies the function to it and returns
the result.  See http://gsubfn.googlecode.com

This works with the rules stated below and works on your
example but the general rules may only be apparent with
more data in which case you may need to make appropriate
adjustments.

Note that the regular expressions:

\w refers to a word character and must be written \\w when within in quotes.
+ means one or more occurrences in a row
$ means end of string

library(gsubfn)

NULL2NA <- function(x) if (is.null(x)) NA else x

extract <- function(x) {

	# age is "word" that comes after DX AGE:
	age <- strapply(x, "DX AGE: (\\w+)", c)
	age <- sapply(age, null2NA)

	# tissue is 2 or more word characters at end
	tissue <- strapply(x, "\\w\\w+$", c)
	tissue <- sapply(tissue, null2NA)

	data.frame(age, tissue)
}

extract(data[,2])
extract(data[,3])

On Fri, Apr 23, 2010 at 12:24 AM, Min-Han Tan <minhan.science at gmail.com> wrote:
> Dear fellow R-help members,
>
> I hope to seek your advice on how to parse/manage a dataset with hundreds of
> columns. Two examples of these columns, 'cancer.problems', and
> 'neuro.problems' are depicted below. Essentially, I need to parse this into
> a useful dataset, and unfortunately, I am not familiar with perl or any such
> language.
>
> data <- data.frame(id=c(1:10))
> data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y;
> LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS")
> data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74;
> STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS")
>
> As can be seen, the semi-colon delimiter follows its own set of rules, which
> are internally consistent - with all 3 elements of data, it should be
> "Status; Age; Tissue Type". However, if there is only tissue type, it is"
> Status; Tissue Type", without the trailing semi-colon. However, if there is
> Age available, it is "Status; Age;".
>
> The main challenge for me is how to parse/convert this dataset into a useful
> and consistent data.frame, or list, where I can capture Status, Age and
> Tissue Type as separate fields. Due to the varying application of the
> delimiter, I cannot use strsplit consistently. I have tried a convoluted
> method by identifying "AGE" as the character string identifying 3 element
> fields per below, but faced problems with unlist, given the empty fields.
>
> age.present <- grepl("AGE",data[,2])
> data.3column <- strsplit(data[age.present,2],";")
> data.2column <- strsplit(data[!age.present,2],";")
> data$cancer.status[age.present] <- unlist(data.3column)
> [(1:sum(age.present)*3)-2]
> ...
>
> Your advice is earnestly sought.
>
> Thanks.
>
> Min-Han
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>