[R] Changing cell value for MANY unique pairings of values in 2 columns

David Winsemius dwinsemius at comcast.net
Sat Mar 18 18:33:32 CET 2017


> On Mar 18, 2017, at 9:52 AM, David Winsemius <dwinsemius at comcast.net> wrote:
> 
> 
>> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.ellis at gmail.com> wrote:
>> 
>> am cleaning some very messy health record lab data.  Several of the rows
>> in the VALUE column have text entries and they need to be converted to
>> numeric in the NUMERIC_VALUE column based on the values in VALUE and
>> DESCRIPTION.  For example:
>> 
>> df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than 0.30",
>> "12%", "<0.2", "Unknown"),
>>                DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
>>                NUMERIC_VALUE=c(9, 9,9,9,9,9,9))
>> df
>> 
>> df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == "A"]=999999999
>> 
>> 
>> However, I need to do this for ~500 unique pairings of VALUE and
>> DESCRIPTION entries.  I'm trying to find an easy way to do this without
>> having to have 500 lines of code for each unique pairing.  Some of the
>> pairings will be changed to the same value (e.g., 99999999, or -999999999)
>> but many will be unique numeric values.
> 
> I'm not convinced that you have the necessary scientific background to do this job properly. There are different sorts of lab tests: enzymatic activity, solute concentrations, and viral titers come to mind immediately for which the handling would be materially different. If you have a serum sodium level of less than 100 mEq/ml, then that is a value inconsistent with human life and the value should be set to NA. If you have a value of alkaline phosphatase that is 0 it is most suggestive of specimen mishandling. If you have a hepatitis B antigen level of "Positive", then it seems a perfectly informative value that should not be changed.
> 
> Furthermore the attempt to change vlaues that you think are missing to 99999999, or -999999999 is simply wrongheaded. Learn to use the missing value indicator NA rather thna setting these to a numeric value.
> 
> You should seek advice within your organization before you charge ahead with this strategy.

I've been informed that my concerns are misplaced and that the scientific concerns I raised were inflated. The "answer" then might be:

merge( df, rules, by=1:2, all.x=TRUE)  # Or

merge( df, rules, by=c("VALUE", "DESCRIPTION") , all.x=TRUE)


Which returns in this example:


           VALUE DESCRIPTION NUMERIC_VALUE  NEW_VALUE
1           <0.2           E         9e+00       0.10
2            <60           A         9e+00      60.00
3            12%           D         9e+00         NA
4 Less than 0.30           C         9e+00       0.29
5       Negative           B         9e+00 -999999.00
6       Positive           A         1e+09         NA
7        Unknown           E         9e+00  777777.00

I'm leaving it in this form because I remain worried. I have no particular concerns about setting a value of "<0.2" to 0.1 but worry about setting values of  "<60" to 60.00 or "Less than 0.30" to 0.29 don't really accord with practice that I have seen in analysis of laboratory values. And I remain concerned that using flagrantly false numeric values as indicators will create serious errors down the line if this data is ever used by someone who has not been involved with its manipulation or has even forgotten months later how it was massaged. 

-- 
David.
> -- 
> David.
>> 
>> 
>> I've started by creating a new object called rules where a SUBSET of df
>> rows are included with the new value they should be changed to.
>> 
>> 
>> rules <- data.frame(VALUE = c("<60",  "Negative", "Less than 0.30", "<0.2",
>> "Unknown"),
>>                   DESCRIPTION = c("A", "B", "C", "E", "E"),
>>                   NEW_VALUE=c(60, -999999,0.29,0.1,777777))
>> rules
>> 
>> 
>> I tried doing a loop to change the values in df based on the suggested
>> value in rules:
>> 
>> for (i in (1 : nrow(rules))) {
>> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
>> [i,2]]=rules[i,3]
>> }
>> df
>> 
>> This gives the error:
>> 
>> Error in Ops.factor(df$VALUE, rules[i, 1]) :
>> level sets of factors are differentwork and I think because when I write
>> 
>> If I create rules using the exact same levels as df it works:
>> 
>> rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than
>> 0.30", "12%", "<0.2", "Unknown"),
>>                   DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
>>                   NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777))
>> rules
>> 
>> for (i in (1 : nrow(rules))) {
>> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
>> [i,2]]=rules[i,3]
>> }
>> df
>> 
>> 
>> Can anyone suggest a way to modify my for loop so that it works for a
>> subset of rows in df and accomplish what I want?  Or suggest a completely
>> different method that works?
>> 
>> 
>> Thanks!
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list