[R] Subsetting isolating a group of values in a group of variables

arun smartpink111 at yahoo.com
Sun Sep 8 04:25:43 CEST 2013


Hi,
Using the same example:
str1<-paste(colnames(dat1)[grepl("diag",colnames(dat1))],"%in%","vec1",collapse="|")
 subset(dat1,eval(parse(text=str1)))
#  ID diag1 diag2 diag3 proc1 proc2 proc3
#2  2   k69   i80  u456  z456  z123  z456
#3  3   l91  i801  g678  u456  u123  u123
#4  4   i80   i90  h983  z123  z456  z456
lapply(split(dat2,dat2$diag),function(x) {x1<- x$code; str1<- paste(colnames(dat1)[grepl("diag",colnames(dat1))],"%in%","x1",collapse="|"); x2<- subset(dat1,eval(parse(text=str1))); cbind(x2,Diag=x$diag)})
#$`Broken finger`
#  ID diag1 diag2 diag3 proc1 proc2 proc3          Diag
#1  1   k23  i269  j123  u123  u456  u123 Broken finger
#2  2   k69   i80  u456  z456  z123  z456 Broken finger
#
#$`Broken foot`
#  ID diag1 diag2 diag3 proc1 proc2 proc3        Diag
#1  1   k23  i269  j123  u123  u456  u123 Broken foot
#2  2   k69   i80  u456  z456  z123  z456 Broken foot
#
#$`Broken legs`
#  ID diag1 diag2 diag3 proc1 proc2 proc3        Diag
#1  1   k23  i269  j123  u123  u456  u123 Broken legs
#3  3   l91  i801  g678  u456  u123  u123 Broken legs
#
#$`Broken rib`
#  ID diag1 diag2 diag3 proc1 proc2 proc3       Diag
#3  3   l91  i801  g678  u456  u123  u123 Broken rib
#4  4   i80   i90  h983  z123  z456  z456 Broken rib
#
#$`Broken toe`
#  ID diag1 diag2 diag3 proc1 proc2 proc3       Diag
#2  2   k69   i80  u456  z456  z123  z456 Broken toe
#3  3   l91  i801  g678  u456  u123  u123 Broken toe
#4  4   i80   i90  h983  z123  z456  z456 Broken toe

##############You can also use a larger dataset:
set.seed(48)
dat1New<- as.data.frame(matrix(sample(paste0(letters,sample(1:800,700,replace=TRUE)),90*1e5,replace=TRUE),ncol=90),stringsAsFactors=FALSE)
set.seed(185)
 dat2New<- as.data.frame(matrix(sample(paste0(letters,sample(400:1200,700,replace=TRUE)),90*1e5,replace=TRUE),ncol=90),stringsAsFactors=FALSE)
 dat3<- cbind(ID=1:1e5,dat1New,dat2New)
colnames(dat3)[-1]<-c(paste0("diag",1:90),paste0("proc",1:90))

set.seed(1459)
Refdat<- data.frame(code=unique(unlist(dat1New)), diag=sample(c("Broken finger","Broken toe", "Broken legs", "Broken foot", "Broken rib", "Broken nose", "Broken elbow", "Broken hip"),length(unique(unlist(dat1New))),replace=TRUE),stringsAsFactors=FALSE)

res<- lapply(split(Refdat,Refdat$diag),function(x) {x1<- x$code; str1<- paste(colnames(dat3)[grepl("diag",colnames(dat3))],"%in%","x1",collapse="|"); x2<- subset(dat3,eval(parse(text=str1))) })

sapply(split(Refdat,Refdat$diag),function(x) {x1<- x$code; str1<- paste(colnames(dat3)[grepl("diag",colnames(dat3))],"%in%","x1",collapse="|"); x2<- subset(dat3,eval(parse(text=str1)));nrow(x2) })
# Broken elbow Broken finger   Broken foot    Broken hip   Broken legs 
 #       99997        100000         99994        100000         99997 
 # Broken nose    Broken rib    Broken toe 
  #      99996         99999        100000 


A.K.






Thanks for the prompt reply arun, this really has helped. 

My actual data frame has diagnostic codes diag1, diag2 etc which are range from 1 to 93. Is there any way to apply "subset(dat1,diag1%in%vec1|diag2%in%vec1|diag3%in% vec1)"  such that i can search many multiple columns in dat1 without specifying each column separately? 


----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: R help <r-help at r-project.org>
Cc: 
Sent: Saturday, September 7, 2013 4:12 PM
Subject: Re: Subsetting isolating a group of values in a group of variables

Hi,

The expected output is not clear.
dat1<- read.table(text="ID diag1 diag2 diag3 proc1 proc2 proc3
1 k23 i269 j123   u123  u456  u123
2 k69 i80 u456   z456  z123  z456
3 l91 i801 g678   u456  u123  u123
4 i80 i90 h983   z123  z456   z456",sep="",header=TRUE,stringsAsFactors=FALSE)

vec1<- c("i80","i90","l91")


subset(dat1,diag1%in%vec1|diag2%in%vec1|diag3%in% vec1)
#  ID diag1 diag2 diag3 proc1 proc2 proc3
#2  2   k69   i80  u456  z456  z123  z456
#3  3   l91  i801  g678  u456  u123  u123
#4  4   i80   i90  h983  z123  z456  z456


##Creating another data frame with codes and diagnosis

dat2<-data.frame(code=unique(unlist(dat1[,2:4])),diag=c(rep("Broken finger",2),rep("Broken toe",2),rep("Broken legs",2),"Broken toe",rep("Broken foot",2),rep("Broken rib",2)),stringsAsFactors=FALSE)



 lst1<- lapply(split(dat2,dat2$diag), function(x) {x1<- x$code;x2<- subset(dat1,diag1%in%x1|diag2%in%x1|diag3%in%x1);cbind(x2,Diag=x$diag)})
 lst1
#$`Broken finger`
 # ID diag1 diag2 diag3 proc1 proc2 proc3          Diag
#1  1   k23  i269  j123  u123  u456  u123 Broken finger
#2  2   k69   i80  u456  z456  z123  z456 Broken finger
#
#$`Broken foot`
 # ID diag1 diag2 diag3 proc1 proc2 proc3        Diag
#1  1   k23  i269  j123  u123  u456  u123 Broken foot
#2  2   k69   i80  u456  z456  z123  z456 Broken foot
#
#$`Broken legs`
 # ID diag1 diag2 diag3 proc1 proc2 proc3        Diag
#1  1   k23  i269  j123  u123  u456  u123 Broken legs
#3  3   l91  i801  g678  u456  u123  u123 Broken legs

#$`Broken rib`
 # ID diag1 diag2 diag3 proc1 proc2 proc3       Diag
#3  3   l91  i801  g678  u456  u123  u123 Broken rib
#4  4   i80   i90  h983  z123  z456  z456 Broken rib

#$`Broken toe`
#  ID diag1 diag2 diag3 proc1 proc2 proc3       Diag
#2  2   k69   i80  u456  z456  z123  z456 Broken toe
#3  3   l91  i801  g678  u456  u123  u123 Broken toe
#4  4   i80   i90  h983  z123  z456  z456 Broken toe


A.K.


Hello. 

I have date frame structured like this: 

ID    diag1 diag2 diag3 proc1 proc2 proc3 
1    k23    i269     j123       u123  u456  u123 
2    k69    i80     u456   z456  z123  z456 
3    l91    i801     g678   u456  u123  u123 
4    i80    i90     h983   z123  z456   z456 

Each observation has a group of diagnostics codes(diag) and procedure codes(proc). 

A single diagnosis maybe be described by more than one code eg broken toe maybe coded for by i80,i90,l91 or more. 

My aim to subset all rows with any of the codes representing a 
single diagnosis. So i would like to use multiple values (i80,i90,l91= 
broken toe) applied to specific columns, ie diag1,2 and 3 to isolate 
those rows which contain any of the specified codes. 

Your help would be greatly appreciated.



More information about the R-help mailing list