[R] Changing cell value for MANY unique pairings of values in 2 columns

Jim Lemon drjimlemon at gmail.com
Tue Mar 21 10:21:24 CET 2017


Hi Alicia,
If I understand this, perhaps:

df<-data.frame(VALUE=c("<60","Positive","Negative","Less than 0.30",
 "12%","<0.2","Unknown"),
 DESCRIPTION = c("A","A", "B","C","D","E","E"),
 NUMERIC_VALUE=c(9,9,9,9,9,9,9))

df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == "A"]<-
 999999999

rules<-data.frame(VALUE = c("<60","Negative","Less than 0.30",
 "<0.2","Unknown"),
 DESCRIPTION = c("A", "B", "C", "E", "E"),
 NEW_VALUE=c(60, -999999,0.29,0.1,777777))

for(row in 1:dim(df)[1]) {
 if(df$VALUE %in% rules$VALUE && df$DESCRIPTION %in% rules$DESCRIPTION) {
  which_value<-which(as.character(df$VALUE[row]) ==
   as.character(rules$VALUE))
  which_desc<-which(as.character(df$DESCRIPTION[row]) ==
   as.character(rules$DESCRIPTION))
  cat(which_value,which_desc,"\n")
  if(length(which_value) && length(which_desc) &&
   which_value == which_desc)
   df$NUMERIC_VALUE[row]<-rules$NEW_VALUE[which_value]
 }
}

Jim

On Tue, Mar 21, 2017 at 9:51 AM, Alicia Ellis <alicia.m.ellis at gmail.com> wrote:
> The solution proposed below does not accomplish my goal.  In the column
> labeled NEW_VALUE, there are two NAs where the value in NUMERIC_VALUE for
> that row should be.
>
> Can anyone else suggest a solution to my problem?
>
> Thanks!
>
> On Sat, Mar 18, 2017 at 1:33 PM, David Winsemius <dwinsemius at comcast.net>
> wrote:
>
>>
>> > On Mar 18, 2017, at 9:52 AM, David Winsemius <dwinsemius at comcast.net>
>> wrote:
>> >
>> >
>> >> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.ellis at gmail.com>
>> wrote:
>> >>
>> >> am cleaning some very messy health record lab data.  Several of the rows
>> >> in the VALUE column have text entries and they need to be converted to
>> >> numeric in the NUMERIC_VALUE column based on the values in VALUE and
>> >> DESCRIPTION.  For example:
>> >>
>> >> df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than
>> 0.30",
>> >> "12%", "<0.2", "Unknown"),
>> >>                DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
>> >>                NUMERIC_VALUE=c(9, 9,9,9,9,9,9))
>> >> df
>> >>
>> >> df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION ==
>> "A"]=999999999
>> >>
>> >>
>> >> However, I need to do this for ~500 unique pairings of VALUE and
>> >> DESCRIPTION entries.  I'm trying to find an easy way to do this without
>> >> having to have 500 lines of code for each unique pairing.  Some of the
>> >> pairings will be changed to the same value (e.g., 99999999, or
>> -999999999)
>> >> but many will be unique numeric values.
>> >
>> > I'm not convinced that you have the necessary scientific background to
>> do this job properly. There are different sorts of lab tests: enzymatic
>> activity, solute concentrations, and viral titers come to mind immediately
>> for which the handling would be materially different. If you have a serum
>> sodium level of less than 100 mEq/ml, then that is a value inconsistent
>> with human life and the value should be set to NA. If you have a value of
>> alkaline phosphatase that is 0 it is most suggestive of specimen
>> mishandling. If you have a hepatitis B antigen level of "Positive", then it
>> seems a perfectly informative value that should not be changed.
>> >
>> > Furthermore the attempt to change vlaues that you think are missing to
>> 99999999, or -999999999 is simply wrongheaded. Learn to use the missing
>> value indicator NA rather thna setting these to a numeric value.
>> >
>> > You should seek advice within your organization before you charge ahead
>> with this strategy.
>>
>> I've been informed that my concerns are misplaced and that the scientific
>> concerns I raised were inflated. The "answer" then might be:
>>
>> merge( df, rules, by=1:2, all.x=TRUE)  # Or
>>
>> merge( df, rules, by=c("VALUE", "DESCRIPTION") , all.x=TRUE)
>>
>>
>> Which returns in this example:
>>
>>
>>            VALUE DESCRIPTION NUMERIC_VALUE  NEW_VALUE
>> 1           <0.2           E         9e+00       0.10
>> 2            <60           A         9e+00      60.00
>> 3            12%           D         9e+00         NA
>> 4 Less than 0.30           C         9e+00       0.29
>> 5       Negative           B         9e+00 -999999.00
>> 6       Positive           A         1e+09         NA
>> 7        Unknown           E         9e+00  777777.00
>>
>> I'm leaving it in this form because I remain worried. I have no particular
>> concerns about setting a value of "<0.2" to 0.1 but worry about setting
>> values of  "<60" to 60.00 or "Less than 0.30" to 0.29 don't really accord
>> with practice that I have seen in analysis of laboratory values. And I
>> remain concerned that using flagrantly false numeric values as indicators
>> will create serious errors down the line if this data is ever used by
>> someone who has not been involved with its manipulation or has even
>> forgotten months later how it was massaged.
>>
>> --
>> David.
>> > --
>> > David.
>> >>
>> >>
>> >> I've started by creating a new object called rules where a SUBSET of df
>> >> rows are included with the new value they should be changed to.
>> >>
>> >>
>> >> rules <- data.frame(VALUE = c("<60",  "Negative", "Less than 0.30",
>> "<0.2",
>> >> "Unknown"),
>> >>                   DESCRIPTION = c("A", "B", "C", "E", "E"),
>> >>                   NEW_VALUE=c(60, -999999,0.29,0.1,777777))
>> >> rules
>> >>
>> >>
>> >> I tried doing a loop to change the values in df based on the suggested
>> >> value in rules:
>> >>
>> >> for (i in (1 : nrow(rules))) {
>> >> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
>> >> [i,2]]=rules[i,3]
>> >> }
>> >> df
>> >>
>> >> This gives the error:
>> >>
>> >> Error in Ops.factor(df$VALUE, rules[i, 1]) :
>> >> level sets of factors are differentwork and I think because when I write
>> >>
>> >> If I create rules using the exact same levels as df it works:
>> >>
>> >> rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than
>> >> 0.30", "12%", "<0.2", "Unknown"),
>> >>                   DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
>> >>                   NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777))
>> >> rules
>> >>
>> >> for (i in (1 : nrow(rules))) {
>> >> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
>> >> [i,2]]=rules[i,3]
>> >> }
>> >> df
>> >>
>> >>
>> >> Can anyone suggest a way to modify my for loop so that it works for a
>> >> subset of rows in df and accomplish what I want?  Or suggest a
>> completely
>> >> different method that works?
>> >>
>> >>
>> >> Thanks!
>> >>
>> >>      [[alternative HTML version deleted]]
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide http://www.R-project.org/
>> posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >
>> > David Winsemius
>> > Alameda, CA, USA
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/
>> posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list