[R] Changing cell value for MANY unique pairings of values in 2 columns

Sat Mar 18 17:52:26 CET 2017

> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.ellis at gmail.com> wrote:
> 
> am cleaning some very messy health record lab data.  Several of the rows
> in the VALUE column have text entries and they need to be converted to
> numeric in the NUMERIC_VALUE column based on the values in VALUE and
> DESCRIPTION.  For example:
> 
> df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than 0.30",
> "12%", "<0.2", "Unknown"),
>                 DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
>                 NUMERIC_VALUE=c(9, 9,9,9,9,9,9))
> df
> 
> df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == "A"]=999999999
> 
> 
> However, I need to do this for ~500 unique pairings of VALUE and
> DESCRIPTION entries.  I'm trying to find an easy way to do this without
> having to have 500 lines of code for each unique pairing.  Some of the
> pairings will be changed to the same value (e.g., 99999999, or -999999999)
> but many will be unique numeric values.

I'm not convinced that you have the necessary scientific background to do this job properly. There are different sorts of lab tests: enzymatic activity, solute concentrations, and viral titers come to mind immediately for which the handling would be materially different. If you have a serum sodium level of less than 100 mEq/ml, then that is a value inconsistent with human life and the value should be set to NA. If you have a value of alkaline phosphatase that is 0 it is most suggestive of specimen mishandling. If you have a hepatitis B antigen level of "Positive", then it seems a perfectly informative value that should not be changed.

Furthermore the attempt to change vlaues that you think are missing to 99999999, or -999999999 is simply wrongheaded. Learn to use the missing value indicator NA rather thna setting these to a numeric value.

You should seek advice within your organization before you charge ahead with this strategy.

-- 
David.
> 
> 
> I've started by creating a new object called rules where a SUBSET of df
> rows are included with the new value they should be changed to.
> 
> 
> rules <- data.frame(VALUE = c("<60",  "Negative", "Less than 0.30", "<0.2",
> "Unknown"),
>                    DESCRIPTION = c("A", "B", "C", "E", "E"),
>                    NEW_VALUE=c(60, -999999,0.29,0.1,777777))
> rules
> 
> 
> I tried doing a loop to change the values in df based on the suggested
> value in rules:
> 
> for (i in (1 : nrow(rules))) {
>  df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
> [i,2]]=rules[i,3]
> }
> df
> 
> This gives the error:
> 
> Error in Ops.factor(df$VALUE, rules[i, 1]) :
>  level sets of factors are differentwork and I think because when I write
> 
> If I create rules using the exact same levels as df it works:
> 
> rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than
> 0.30", "12%", "<0.2", "Unknown"),
>                    DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
>                    NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777))
> rules
> 
> for (i in (1 : nrow(rules))) {
>  df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
> [i,2]]=rules[i,3]
> }
> df
> 
> 
> Can anyone suggest a way to modify my for loop so that it works for a
> subset of rows in df and accomplish what I want?  Or suggest a completely
> different method that works?
> 
> 
> Thanks!
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA