[R] replacing a factor value in a data frame

Fri Apr 28 02:05:18 CEST 2006

Dave Roberts <droberts <at> montana.edu> writes:

> 
> Federico,
> 
>      There doesn't appear to be an instance of the value you want to 
> change in your example, so I had to improvise.  Part of the problem may 
> be that the dataframe is composed of factors, and it's not possible to 
> convert the value of a factor to another value that's in the set of 
> possible values, given by the levels() function.  So, if you want to 
> change GC to CG, but CG does not already exist in the set of possible 
> values you'll have to add it. E.g.
> 
>  > tmp <- data
>  > levels(tmp[,30]) <- c(levels(data[,30]),'CG')
> 
> then, if the problem only occurs in one column it's an easy fix.
> 
>  > tmp[data=='GC'] <- 'CG'
> 
> If GC occurs in multiple columns you'll either have to change the levels 
> for each column as I did just above, or work with a single column. 
> Since you don't have 30 columns in your example, let's pretend you want 
> to change all the instances of 'CC' in data$V5 to 'XX'
> 
>  > tmp <- data
>  > levels(tmp$V5) <- c(levels(data$V5),'XX')
>  > tmp$V5[data$V5=='CC'] <- 'XX'
>  > tmp
>     V4 V5 V6 V7   V8   V9 V10
> 1  TT GG TT AC   AG   AG  TT
> 2  AT XX TT AA   AA   AA  TT
> 3  AT XX TT AC   AA <NA>  TT
> 4  TT XX TT AA   AA   AA  TT
> 5  AT CG TT CC   AA   AA  TT
> 6  TT XX TT AA   AA   AA  TT
> 7  AT XX TT CC <NA> <NA>  TT
> 8  TT XX TT AC   AG   AG  TT
> 9  AT XX TT CC   AG <NA>  TT
> 10 TT XX TT CC   GG   GG  TT
> 
> Notice that the instances of 'CC' in tmp$V7 did not change.
> 
> HTH, Dave Roberts
> 
> 
OK So I have a complexity to add. I have a dataframe with about 6008 
variables. 6000 of them are loci accross the genome. Inadvertently, we coded 
SNP data that did not satisfy quality control as "*". Somewhere along the line 
the genotypes of these snps became "0/0". The tdt test in dgc.genetics does 
not seem fond of this designation and would probably do better with it being 
NA. So how does one recode every instance of "0/0" to missing accross all 6000 
variables?

I believe that R has interpreted the genotype variables as character and 
therefore turned every one into a factor with levels.

One simplification (if one could call it that), would be to revert to my long 
dataframe where all the genotypes are in one variable and change it there 
before reshaping to wide. Do not really want to do that since reshaping that 
very big dataframe takes a very long time (more than 35 minutes, and less than 
overnight).