[R] Pairwise comparison between columns, logic
arun
smartpink111 at yahoo.com
Fri Jul 26 01:16:58 CEST 2013
HI,
Not sure about what your expected output would be. Also 'CEBPA' was not present in the Data.txt.
gset<- read.table("Names.txt",header=TRUE,stringsAsFactors=FALSE)
temp1<- read.table("Data.txt",header=TRUE,stringsAsFactors=FALSE)
lst1<-split(temp1,temp1$Names)
mat1<-combn(gset[-1,1],2) #removed CEBPA
library(plyr)
lst2<-lapply(split(mat1,col(mat1)),function(x) {x1<-join_all(lst1[x],by="patient_id",type="inner");x1["patient_id"] })
names(lst2)<-apply(mat1,2,paste,collapse="_")
do.call(rbind,lst2)
# patient_id
#DNMT3A_FLT3.1 LAML-AB-2811-TB #common ids between DNMT3A and FLT3
#DNMT3A_FLT3.2 LAML-AB-2816-TB
#DNMT3A_FLT3.3 LAML-AB-2818-TB
#DNMT3A_IDH1.1 LAML-AB-2802-TB#common ids between DNMT3A and IDH1. If you wanted it as separate dataframes, use `lst2`.
#DNMT3A_IDH1.2 LAML-AB-2822-TB
#DNMT3A_NPM1.1 LAML-AB-2802-TB
#DNMT3A_NPM1.2 LAML-AB-2809-TB
#DNMT3A_NPM1.3 LAML-AB-2811-TB
#DNMT3A_NPM1.4 LAML-AB-2816-TB
#DNMT3A_NRAS LAML-AB-2816-TB
#FLT3_NPM1.1 LAML-AB-2811-TB
#FLT3_NPM1.2 LAML-AB-2812-TB
#FLT3_NPM1.3 LAML-AB-2816-TB
#FLT3_NRAS LAML-AB-2816-TB
#IDH1_NPM1 LAML-AB-2802-TB
#NPM1_NRAS LAML-AB-2816-TB
A.K.
Hello R experts,
I am trying to solve the following logic.
I have two input files. The first file (Names.txt) that has two columns:
Column1 Column2
CEBPA CEBPA
DNMT3A DNMT3A
FLT3 FLT3
IDH1 IDH1
NPM1 NPM1
NRAS NRAS
and the second input file Data.txt has two columns Names, patient_id.
Name patient_id
DNMT3A LAML-AB-2802-TB
DNMT3A LAML-AB-2809-TB
DNMT3A LAML-AB-2811-TB
DNMT3A LAML-AB-2816-TB
DNMT3A LAML-AB-2818-TB
DNMT3A LAML-AB-2822-TB
DNMT3A LAML-AB-2824-TB
FLT3 LAML-AB-2811-TB
FLT3 LAML-AB-2812-TB
FLT3 LAML-AB-2814-TB
FLT3 LAML-AB-2816-TB
FLT3 LAML-AB-2818-TB
FLT3 LAML-AB-2825-TB
FLT3 LAML-AB-2830-TB
FLT3 LAML-AB-2834-TB
IDH1 LAML-AB-2802-TB
IDH1 LAML-AB-2821-TB
What I am attempting to do is for each name in first column of
names.txt, I do a pairwise comparison with the other names in the second
column based on which patient ids are common.
To explain in detail:
As an example: I extract patient_ids for CEBPA and DNMT3A and see
which are common, then I do the same for CEBPA and FLT3 and so on for
CEBPA and the next name in column 2.
So far the script I have written only does the comparison with the
first name in the list. So essentially with itself. I am not sure why
this logic is not working for all the names in column 2 for a single
name in column 1.
Below is my script:
gset<-read.table("Names.txt",header=F,na.strings = ".", as.is=T) # reading in the genes
temp<-read.table("Data.txt",header=T,sep="\t")
#################################################
all<-length(unique(temp$fpatient_id))
final<-c()
both.ab <- list()
both <- list()
temp.b <- matrix()
for(i in 1:nrow(gset)) # Loop for genes in the first column
{
temp2<-temp[which(temp$Column1 %in% gset[i,]),]
num.mut<-length(unique(temp2$patient_id))
temp.a <-temp[which(temp$Column1 == gset[i,1]),]
for(j in 1:(nrow(gset)) # Loop for genes in the second column
{
temp.b <-temp[which(temp$Column2 == gset[j,2]),]
# See which patient_ids of temp.a are in temp.b
both.ab[[i]]<-temp.a[which(temp.a$patient_id %in% temp.b$patient_id),]
}
both[[i]]<-both.ab[[i]]
num.both<-length(unique(both[[i]]$patient_id))
line<-c(paste(gset[i, which(!(is.na(gset[i,]))) ],collapse="/"), num.mut, all, num.mut/all, num.both)
final<-rbind(final,line)
}
Names.txtData.txtScript.txt
More information about the R-help
mailing list