[R] replacing a factor value in a data frame
Farrel Buchinsky
fbuchins at wpahs.org
Fri Apr 28 02:05:18 CEST 2006
Dave Roberts <droberts <at> montana.edu> writes:
>
> Federico,
>
> There doesn't appear to be an instance of the value you want to
> change in your example, so I had to improvise. Part of the problem may
> be that the dataframe is composed of factors, and it's not possible to
> convert the value of a factor to another value that's in the set of
> possible values, given by the levels() function. So, if you want to
> change GC to CG, but CG does not already exist in the set of possible
> values you'll have to add it. E.g.
>
> > tmp <- data
> > levels(tmp[,30]) <- c(levels(data[,30]),'CG')
>
> then, if the problem only occurs in one column it's an easy fix.
>
> > tmp[data=='GC'] <- 'CG'
>
> If GC occurs in multiple columns you'll either have to change the levels
> for each column as I did just above, or work with a single column.
> Since you don't have 30 columns in your example, let's pretend you want
> to change all the instances of 'CC' in data$V5 to 'XX'
>
> > tmp <- data
> > levels(tmp$V5) <- c(levels(data$V5),'XX')
> > tmp$V5[data$V5=='CC'] <- 'XX'
> > tmp
> V4 V5 V6 V7 V8 V9 V10
> 1 TT GG TT AC AG AG TT
> 2 AT XX TT AA AA AA TT
> 3 AT XX TT AC AA <NA> TT
> 4 TT XX TT AA AA AA TT
> 5 AT CG TT CC AA AA TT
> 6 TT XX TT AA AA AA TT
> 7 AT XX TT CC <NA> <NA> TT
> 8 TT XX TT AC AG AG TT
> 9 AT XX TT CC AG <NA> TT
> 10 TT XX TT CC GG GG TT
>
> Notice that the instances of 'CC' in tmp$V7 did not change.
>
> HTH, Dave Roberts
>
>
OK So I have a complexity to add. I have a dataframe with about 6008
variables. 6000 of them are loci accross the genome. Inadvertently, we coded
SNP data that did not satisfy quality control as "*". Somewhere along the line
the genotypes of these snps became "0/0". The tdt test in dgc.genetics does
not seem fond of this designation and would probably do better with it being
NA. So how does one recode every instance of "0/0" to missing accross all 6000
variables?
I believe that R has interpreted the genotype variables as character and
therefore turned every one into a factor with levels.
One simplification (if one could call it that), would be to revert to my long
dataframe where all the genotypes are in one variable and change it there
before reshaping to wide. Do not really want to do that since reshaping that
very big dataframe takes a very long time (more than 35 minutes, and less than
overnight).
More information about the R-help
mailing list