[R] Updating a data frame based on if condition
arun
smartpink111 at yahoo.com
Tue Feb 18 21:17:58 CET 2014
Hi,
I don't know whether the 'mydata" object was updated or not before you run the table.
mydata <- within(mydata,FNAME_SUSPECT <- FNAME_TOKEN_COUNT >10|FNAME_LENGTH>45|regexpr("9",FNAME_PATTERN)==0)
table(mydata$FNAME_SUSPECT)
#
#FALSE
# 50
Now, your second condition (reply to David).
indx <- with(mydata,FNAME_TOKEN_COUNT >3| FNAME_LENGTH>55|regexpr("9",FNAME_PATTERN)==0)
indx1 <- ifelse(mydata$FNAME_TOKEN_COUNT > 3, TRUE,
ifelse(mydata$FNAME_LENGTH > 55, TRUE,
ifelse(regexpr("9", mydata$FNAME_PATTERN) == 0, TRUE,
FALSE
)
)
)
identical(indx,indx1)
#[1] TRUE
A.K.
On Tuesday, February 18, 2014 12:57 PM, Jeff Johnson <mrjefftoyou at gmail.com> wrote:
Hmm, I don't think as constructed the within clause is yielding the desired results. The test case you suggested works. However, if I try another test case:
within(mydata,FNAME_SUSPECT <- FNAME_TOKEN_COUNT >10|FNAME_LENGTH>45|regexpr("9",FNAME_PATTERN)==0)
which I read as if any row has more than 10 tokens, longer than 45 characters OR does not have a number (9), it should assign the result (FALSE in this case) to FNAME_SUSPECT.
table(mydata$FNAME_SUSPECT)
TRUE
50
On Tue, Feb 18, 2014 at 9:38 AM, arun <smartpink111 at yahoo.com> wrote:
>
>I think it doesn't even need ifelse()
>
> within(mydata,FNAME_SUSPECT <- FNAME_TOKEN_COUNT >3|FNAME_LENGTH>35|regexpr("9",FNAME_PATTERN)>0)
>A.K.
>
>
>
>On , arun <smartpink111 at yahoo.com> wrote:
>Hi,
>Try ?ifelse()
>A.K.
>
>
>
>
>
>
>On Tuesday, February 18, 2014 12:26 PM, Jeff Johnson <mrjefftoyou at gmail.com> wrote:
>I have a subset of data that I have identified as "suspect" (for example,
>the first name has excessive spaces, is longer than 35 characters or has a
>number).
>
>What I want to do is update the FNAME_SUSPECT field in "mydata" to TRUE if
>any of those conditions are met.
>
>Here's my data:
>> dput(mydata)
>structure(list(PERSON_FIRST_NAME = c("1298530", "JULIA, TAYLOR, CS AND
>JEFF",
>"88", "4465891170098562", "1124211", "LEWIS & MARY KAY", "KARL R O S",
>"5466181820076010", "JULI0 C", "WAYNE T.", "1124211", "1124211",
>"ROBERT B & VIONA D", "DENNIS and MARY SUE", "BRIAN JOANNE",
>"1124211", "RONALD and GAIL", "Mike and Mary Lou", "31763006",
>"7", "11460735", "Paul and Mary Beth", "JIMMY and RUTH MARIE",
>"1124211", "WAYNE & LU ANN", "SCOTT & ANNA MARIE", "1124211",
>"1124211", "952714", "DAVID, RHONDA and NATALIE", "VIRGINIA S",
>"707069", "4397836190001917", "MARIA DE LA LUZ", "MARIA DE LA LUZ",
>"G & S COMPUTERIZED GRADING", "1124211", "1124211", "1124211",
>"1124211", "MARIA DE LA LUZ", "ED AND JANICE KISHI", "1124211",
>"Garrett A. and Jenny E.", "1124211", "1124211", "Hiram T. and A. Judith",
>"MA DE LA LUZ", "STEVE, Bev, and Caleb", "MR AND MRS EVER"),
> FNAME_SUSPECT = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
> FNAME_LENGTH = c(7L, 26L, 2L, 16L, 7L, 16L, 10L, 16L, 7L,
> 10L, 7L, 7L, 18L, 19L, 14L, 7L, 16L, 17L, 8L, 1L, 8L, 18L,
> 20L, 7L, 14L, 18L, 7L, 7L, 6L, 25L, 12L, 6L, 16L, 15L, 15L,
> 26L, 7L, 7L, 7L, 7L, 15L, 19L, 7L, 23L, 7L, 7L, 22L, 12L,
> 21L, 15L), FNAME_PATTERN = c("9999999", "AAAAA,_AAAAAA,_AA_AAA_AAAA",
> "99", "9999999999999999", "9999999", "AAAAA_&_AAAA_AAA",
> "AAAA_A_A_A", "9999999999999999", "AAAA9_A", "AAAAA___A.",
> "9999999", "9999999", "AAAAAA_A_&_AAAAA_A", "AAAAAA_AAA_AAAA_AAA",
> "AAAAA___AAAAAA", "9999999", "AAAAAA_AAA__AAAA", "AAAA_AAA_AAAA_AAA",
> "99999999", "9", "99999999", "AAAA_AAA_AAAA_AAAA",
>"AAAAA_AAA_AAAA_AAAAA",
> "9999999", "AAAAA_&_AA_AAA", "AAAAA_&_AAAA_AAAAA", "9999999",
> "9999999", "999999", "AAAAA,_AAAAAA_AAA_AAAAAAA", "AAAAAAAA___A",
> "999999", "9999999999999999", "AAAAA_AA_AA_AAA", "AAAAA_AA_AA_AAA",
> "A_&_A_AAAAAAAAAAAA_AAAAAAA", "9999999", "9999999", "9999999",
> "9999999", "AAAAA_AA_AA_AAA", "AA_AAA_AAAAAA_AAAAA", "9999999",
> "AAAAAAA_A._AAA_AAAAA_A.", "9999999", "9999999",
>"AAAAA_A._AAA_A._AAAAAA",
> "AA_AA_AA_AAA", "AAAAA,_AAA,_AAA_AAAAA", "AA_AAA_AAA_AAAA"
> ), FNAME_TOKEN_COUNT = c(1L, 5L, 1L, 1L, 1L, 4L, 4L, 1L,
> 2L, 4L, 1L, 1L, 5L, 4L, 4L, 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L,
> 1L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 1L, 1L, 4L, 4L, 5L, 1L, 1L,
> 1L, 1L, 4L, 4L, 1L, 5L, 1L, 1L, 5L, 4L, 4L, 4L)), .Names =
>c("PERSON_FIRST_NAME",
>"FNAME_SUSPECT", "FNAME_LENGTH", "FNAME_PATTERN", "FNAME_TOKEN_COUNT"
>), row.names = c(6717L, 11035L, 11626L, 14965L, 17874L, 24341L,
>25582L, 25834L, 26851L, 30134L, 36385L, 45244L, 46947L, 61449L,
>67564L, 71465L, 73782L, 75278L, 78977L, 79037L, 80577L, 81644L,
>84427L, 86286L, 89963L, 91208L, 94054L, 99518L, 114658L, 128305L,
>129082L, 137492L, 137573L, 138556L, 139489L, 148757L, 153956L,
>155546L, 160533L, 162386L, 162681L, 165220L, 168063L, 173003L,
>175322L, 179935L, 180991L, 181215L, 183787L, 184573L), class = "data.frame")
>
>Note I defaulted all of the FNAME_SUSPECT to FALSE. I plan to change that
>later.
>
>I've tried running this:
>if(mydata$FNAME_TOKEN_COUNT > 3 | mydata$FNAME_LENGTH > 35 | regexpr("9",
>mydata$FNAME_PATTERN) > 0)
> mydata$FNAME_SUSPECT <- TRUE
>
>however I get the error:
>Warning message:
>In if (mydata$FNAME_TOKEN_COUNT > 3 | mydata$FNAME_LENGTH > 35 | :
> the condition has length > 1 and only the first element will be used
>
>Would I be better doing this in a for loop? I had once heard that if you're
>doing a for loop in R, you're doing something wrong.
>--
>Jeff
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>
--
Jeff
More information about the R-help
mailing list