[R] Recoding variables based on reference values in data frame
Rui Barradas
ruipbarradas at sapo.pt
Tue Jul 2 23:15:20 CEST 2013
Hello,
I'm not sure I understood, but try the following.
Kgeno <- read.table(text = "
SNP_ID SNP1 SNP2 SNP3 SNP4
Maj_Allele C G C A
Min_Allele T A T G
ID1 CC GG CT AA
ID2 CC GG CC AA
ID3 CC GG nc AA
ID4 _ _ _ _
ID5 CC GG CC AA
ID6 CC GG CC AA
ID7 CC GG CT AA
ID8 _ _ _ _
ID9 CT GG CC AG
ID10 CC GG CC AA
ID11 CC GG CT AA
ID12 _ _ _ _
ID13 CC GG CC AA
", header = TRUE, stringsAsFactors = FALSE)
dat
fun <- function(x){
x[x %in% c("nc", "_")] <- NA
MM <- paste0(x[1], x[1]) # Major Major
Mm <- paste0(x[1], x[2]) # Major minor
mm <- paste0(x[2], x[2]) # minor minor
x[x == MM] <- 0
x[x == Mm] <- 1
x[x == mm] <- 2
x
}
Kgeno[, -1] <- sapply(Kgeno[, -1], fun)
Kgeno
Also, the best way to post data is by using ?dput.
dput(head(Kgeno[, 1:5], 30)) # post the output of this
Hope this helps,
Rui Barradas
Em 02-07-2013 21:46, kathleen askland escreveu:
> I'm new to R (previously used SAS primarily) and I have a genetics data
> frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3,
> ...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of
> the data is shown below:
> SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T G ID1
> CC GG CT AA ID2 CC GG CC AA ID3 CC GG
> nc
> AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC GG CC
> AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG
> CC AG ID10 CC GG CC AA ID11 CC GG CT AA
> ID12 _ _ _ _ ID13 CC GG CC AA
> The name of the data file is Kgeno.
> What I would like to do is recode all of the genotype values to standard
> integer notation, based on their values relative to the reference rows
> (Maj_Allele and Min_Allele). Standard notation sums the total of minor
> alleles in the genotype, so values can be 0, 1 or 2.
>
> Here are the changes I want to make:
> 1. If the genotype= "nc" or '_" then set equal to NA.
> 2. If genotype value = a character string comprised of two consecutive
> major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0.
> 3. If genotype value= c(Maj_Allele, Min_Allele) then set equal to 1.
> 4. If genotype value = c(Min_Allele, Min_Allele) then set equal to 2.
>
> I've tried the following ifelse processing but get error (Warning: Executed
> script did not end with R session at the top-level prompt. Top-level state
> will be restored) and can't seem to fix the code properly. I've counted the
> parentheses. Also, not sure if it would execute properly if I could fix it.
>
> # change 'nc' and '_' to NA, else leave as is:
> Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2])
> Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2])
>
> #convert genotype strings in the first data column to numeric values #(two
> major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else #leave as
> is (to preserve NA values).
>
> Kgeno[,2] <-
>
> ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
> Kgeno[1,2]), sep=""), 0,
>
> ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
> Kgeno[2,2]), sep=""), 1,
>
> ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character(
> Kgeno[2,2]), sep=""), 2,
> Kgeno[,2])))
>
>
> Finally, if above code were corrected, this would only change the first
> column of data, but I would like to change all 3000+ columns in the same
> way.
>
> I would greatly appreciate some suggestions on how to proceed.
>
> Thank you,
>
> Kathleen
>
> ---
> Kathleen Askland, MD
> Assistant Professor
> Department of Psychiatry & Human Behavior
> The Warren Alpert School of Medicine
> Brown University/Butler Hospital
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list