[R] Help with factor column replacement value issue

Jeff Newmiller jdnewmil @ending from dcn@d@vi@@c@@u@
Fri Nov 16 17:26:22 CET 2018


My suggestion is to avoid converting the column to a factor until it is cleaned up the way you want it. There is also the forcats package, but I still prefer to work with character data for cleaning. The stringsAsFactors=FALSE argument to read.table and friends helps with this.

On November 16, 2018 8:16:22 AM PST, Michael Dewey <lists using dewey.myzen.co.uk> wrote:
>Dear Bill
>
>When you do your step of replacing lower case l with upper case L the 
>level still stays in the factor even though it is empty. If that is a 
>nuisance x <- factor(x) will drop the unused levels. There are other 
>ways of doing this.
>
>Michael
>
>On 16/11/2018 15:38, Bill Poling wrote:
>> Hello:
>> 
>> I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456
>> 
>> I would like to know why when I replace a column value it still
>appears in subsequent routines:
>> 
>> My example:
>> 
>> r1$B1 is a Factor: It is created from the first character of a list
>of CPT codes, r1$CPT.
>> 
>> head(r1$CPT, N= 25)
>> [1] A4649 A4649 C9359 C1713 A0394 A0398
>> 903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470
>01961 01968 10160 11000 11012 11042 11043 11044 11045 11100 11101 11200
>11201 11401 11402 ... l8699
>> 
>> str(r1$CPT)
>>   Factor w/ 903 levels "00000","00001",..: 773 773 816 783 739 741
>743 739 739 741 ...
>> 
>> 
>> And I want only those CPT's with leading alpha char in this column so
>I set the numeric leading char to Z
>> 
>> r1$B1 <- str_sub(r1$CPT,1,1)
>> 
>> r1$B1 <- as.factor(r1$B1) #Redundant
>> levels(r1$B1)[levels(r1$B1) %in% 
>c('1','2','3','4','5','6','7','8','9','0')] <- 'Z'
>> 
>> When I check what I have done I find l & L
>> 
>> unique(r1$B1)
>> #[1] A C Z L G Q U J V E S l D P
>> #Levels: Z A C D E G J l L P Q S U V
>> 
>> So I change l to L
>> r1$B1[r1$B1 == 'l'] <- 'L'
>> 
>> When I check again I have l & L but l = 0
>> table(r1$B1)
>> #   Z          A          C      D     E     G      J           l    
>L         P     Q     S     U     V
>> #19639  1673   546     2     8   147   281     0    664     1    64  
> 36   114    14
>> 
>> When I go to find those rows as if they existed, they are not
>accounted for?
>> 
>> tmp <- subset(r1, B1 == "l")
>> print(tmp)
>> Empty data.table (0 rows) of 9 cols:
>SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2...
>> 
>> And I have actually visually inspected the whole darn column, sheesh!
>> 
>> So I ignore it temporarily.
>> 
>> Now later on it resurfaces in a tutorial I am following for caret
>pkg.
>> 
>> preProcess(r1b, method = c("center", "scale"),
>>             thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5,
>>             knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique
>= 3,
>>             verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff =
>0.9,
>>             rangeBounds = c(0, 1))
>> # Warning in preProcess.default(r1b, method = c("center", "scale"),
>thresh = 0.95,  :
>> #                                 These variables have zero
>variances: B1l  <-------------yes this is a remnant of the r1$B1
>clean-up
>> #                               Created from 23141 samples and 22
>variables
>> #
>> #                               Pre-processing:
>> #                                 - centered (22)
>> #                                 - ignored (0)
>> #                                 - scaled (22)
>> 
>> 
>> So my questions are, in consideration of regression modelling
>accuracy:
>> 
>> Why is this happening?
>> How do I remove it?
>> Or is it irrelevant and leave it be?
>> 
>> As always, thank you for you support.
>> 
>> WHP
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Confidentiality Notice This message is sent from Zelis.
>...{{dropped:13}}
>> 
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 

-- 
Sent from my phone. Please excuse my brevity.



More information about the R-help mailing list