[R] Changing cell value for MANY unique pairings of values in 2 columns
David Winsemius
dwinsemius at comcast.net
Sat Mar 18 17:52:26 CET 2017
> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.ellis at gmail.com> wrote:
>
> am cleaning some very messy health record lab data. Several of the rows
> in the VALUE column have text entries and they need to be converted to
> numeric in the NUMERIC_VALUE column based on the values in VALUE and
> DESCRIPTION. For example:
>
> df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than 0.30",
> "12%", "<0.2", "Unknown"),
> DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
> NUMERIC_VALUE=c(9, 9,9,9,9,9,9))
> df
>
> df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == "A"]=999999999
>
>
> However, I need to do this for ~500 unique pairings of VALUE and
> DESCRIPTION entries. I'm trying to find an easy way to do this without
> having to have 500 lines of code for each unique pairing. Some of the
> pairings will be changed to the same value (e.g., 99999999, or -999999999)
> but many will be unique numeric values.
I'm not convinced that you have the necessary scientific background to do this job properly. There are different sorts of lab tests: enzymatic activity, solute concentrations, and viral titers come to mind immediately for which the handling would be materially different. If you have a serum sodium level of less than 100 mEq/ml, then that is a value inconsistent with human life and the value should be set to NA. If you have a value of alkaline phosphatase that is 0 it is most suggestive of specimen mishandling. If you have a hepatitis B antigen level of "Positive", then it seems a perfectly informative value that should not be changed.
Furthermore the attempt to change vlaues that you think are missing to 99999999, or -999999999 is simply wrongheaded. Learn to use the missing value indicator NA rather thna setting these to a numeric value.
You should seek advice within your organization before you charge ahead with this strategy.
--
David.
>
>
> I've started by creating a new object called rules where a SUBSET of df
> rows are included with the new value they should be changed to.
>
>
> rules <- data.frame(VALUE = c("<60", "Negative", "Less than 0.30", "<0.2",
> "Unknown"),
> DESCRIPTION = c("A", "B", "C", "E", "E"),
> NEW_VALUE=c(60, -999999,0.29,0.1,777777))
> rules
>
>
> I tried doing a loop to change the values in df based on the suggested
> value in rules:
>
> for (i in (1 : nrow(rules))) {
> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
> [i,2]]=rules[i,3]
> }
> df
>
> This gives the error:
>
> Error in Ops.factor(df$VALUE, rules[i, 1]) :
> level sets of factors are differentwork and I think because when I write
>
> If I create rules using the exact same levels as df it works:
>
> rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than
> 0.30", "12%", "<0.2", "Unknown"),
> DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
> NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777))
> rules
>
> for (i in (1 : nrow(rules))) {
> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
> [i,2]]=rules[i,3]
> }
> df
>
>
> Can anyone suggest a way to modify my for loop so that it works for a
> subset of rows in df and accomplish what I want? Or suggest a completely
> different method that works?
>
>
> Thanks!
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list