[R] Help with factor column replacement value issue
Bert Gunter
bgunter@4567 @ending from gm@il@com
Fri Nov 16 17:09:47 CET 2018
As usual, careful reading of the relevant Help page would resolve the confusion.
from ?factor:
"factor(x, exclude = NULL) applied to a factor without NAs is a
no-operation unless there are unused levels: in that case, a factor
with the reduced level set is returned. If exclude is used, since R
version 3.4.0, excluding non-existing character levels is equivalent
to excluding nothing, and when excludeis a character vector, that is
applied to the levels of x. Alternatively, excludecan be factor with
the same level set as x and will exclude the levels present in
exclude."
In, subsetting a factor does not change the levels attribute, even if
some levels are not present. One must explicitly remove them, e.g.:
> f <- factor(letters[1:3])
## 3 levels, all present
> f[1:2]
[1] a b
Levels: a b c
## 3 levels, but one empty
> factor(f[1:2], exclude = NULL)
[1] a b
Levels: a b
## Now only two levels
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, Nov 16, 2018 at 7:38 AM Bill Poling <Bill.Poling using zelis.com> wrote:
>
> Hello:
>
> I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456
>
> I would like to know why when I replace a column value it still appears in subsequent routines:
>
> My example:
>
> r1$B1 is a Factor: It is created from the first character of a list of CPT codes, r1$CPT.
>
> head(r1$CPT, N= 25)
> [1] A4649 A4649 C9359 C1713 A0394 A0398
> 903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470 01961 01968 10160 11000 11012 11042 11043 11044 11045 11100 11101 11200 11201 11401 11402 ... l8699
>
> str(r1$CPT)
> Factor w/ 903 levels "00000","00001",..: 773 773 816 783 739 741 743 739 739 741 ...
>
>
> And I want only those CPT's with leading alpha char in this column so I set the numeric leading char to Z
>
> r1$B1 <- str_sub(r1$CPT,1,1)
>
> r1$B1 <- as.factor(r1$B1) #Redundant
> levels(r1$B1)[levels(r1$B1) %in% c('1','2','3','4','5','6','7','8','9','0')] <- 'Z'
>
> When I check what I have done I find l & L
>
> unique(r1$B1)
> #[1] A C Z L G Q U J V E S l D P
> #Levels: Z A C D E G J l L P Q S U V
>
> So I change l to L
> r1$B1[r1$B1 == 'l'] <- 'L'
>
> When I check again I have l & L but l = 0
> table(r1$B1)
> # Z A C D E G J l L P Q S U V
> #19639 1673 546 2 8 147 281 0 664 1 64 36 114 14
>
> When I go to find those rows as if they existed, they are not accounted for?
>
> tmp <- subset(r1, B1 == "l")
> print(tmp)
> Empty data.table (0 rows) of 9 cols: SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2...
>
> And I have actually visually inspected the whole darn column, sheesh!
>
> So I ignore it temporarily.
>
> Now later on it resurfaces in a tutorial I am following for caret pkg.
>
> preProcess(r1b, method = c("center", "scale"),
> thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5,
> knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique = 3,
> verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9,
> rangeBounds = c(0, 1))
> # Warning in preProcess.default(r1b, method = c("center", "scale"), thresh = 0.95, :
> # These variables have zero variances: B1l <-------------yes this is a remnant of the r1$B1 clean-up
> # Created from 23141 samples and 22 variables
> #
> # Pre-processing:
> # - centered (22)
> # - ignored (0)
> # - scaled (22)
>
>
> So my questions are, in consideration of regression modelling accuracy:
>
> Why is this happening?
> How do I remove it?
> Or is it irrelevant and leave it be?
>
> As always, thank you for you support.
>
> WHP
>
>
>
>
>
>
>
>
>
>
>
>
> Confidentiality Notice This message is sent from Zelis. ...{{dropped:13}}
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list