[R] Subsetting problem data, 2

Chris Campbell ccampbell at mango-solutions.com
Fri Jul 20 10:09:40 CEST 2012


Hi!

# toy data   

toyData <- data.frame(x = 1:4, y = 5:8, xy = 9:12, z = 13:16)    
vars <- c("x", "z")      
    
# "pattern" is an argument of grep      
    
args(grep)      
    
# "pattern" must only consist of a single element     
# otherwise only the first element is used      
    
grep(pattern = vars, x = names(toyData))       
    
# one way to do this - a loop     
# create a vector to collect the output of each call    
     
toyColIndexList <- vector(length = length(vars), mode = "list")    
    
# grep each element in turn     
    
for (i in seq_along(vars)) {      
    toyColIndexList[[i]] <- grep(pattern = vars[i], x = names(toyData))     
}      
     
# combine all of the answers     
     
toyColIndex <- unlist(toyColIndexList)     
    
# remove duplicated columns if present    
    
toyColIndex <- toyColIndex[!duplicated(toyColIndex)]     
     
# select the elements we want    
    
toyData[, toyColIndex]     

      
# alternatively we could use regular expressions	   
     
grep(pattern = ("x|z"), x = names(toyData))    
     
# hope this helps

Best wishes

Chris

Chris Campbell
Mango Solutions
Data Analysis that Delivers
http://www.mango-solutions.com
+44 (0) 1249 705 450  


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Lib Gray
Sent: 20 July 2012 01:17
To: Rui Barradas
Cc: r-help
Subject: Re: [R] Subsetting problem data, 2

I'm still getting the message (if this is what you were suggesting I try).
The data set I'm using has many more columns other than these variables; could that be a problem? I didn't think it would affect it.

>pattern <- "L[1-8][12]"
> nms<-names(data)[grep(vars,names(data))]
Warning message:
In grep(vars, names(data)) :
  argument 'pattern' has length > 1 and only the first element will be used
>

On Thu, Jul 19, 2012 at 6:55 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:

> Hello,
>
> Sorry, forgot about that. It's trickier to write code without a 
> dataset to test it.
>
> Try
>
> pattern <- "L[1-8][12]"
>
> and after the grep print nms to see if it's right.
>
> Rui Barradas
>
> Em 20-07-2012 00:33, Lib Gray escreveu:
>
>> I'm getting this error message:
>>
>> nms<-names(data)[grep(vars,**names(data))]
>> Warning message:
>> In grep(vars, names(data)) :
>>    argument 'pattern' has length > 1 and only the first element will 
>> be used
>>
>> Is there a way around this?
>>
>>
>> On Thu, Jul 19, 2012 at 6:17 PM, Rui Barradas <ruipbarradas at sapo.pt>
>> wrote:
>>
>>  Hello,
>>>
>>> I guess so, and I can save you some typing.
>>>
>>> vars <- sort(apply(expand.grid("L", 1:8, 1:2), 1, paste, 
>>> collapse=""))
>>>
>>>
>>> Then use it and see the result.
>>>
>>> Rui Barradas
>>>
>>> Em 20-07-2012 00:00, Lib Gray escreveu:
>>>
>>>  The variables are actually L11, L12, L21, L22, ... , L81, L82. 
>>> Would
>>>> just
>>>> creating a vector c(L11,... ,L82) be fine? (I'm about to try it, 
>>>> but I wanted to check to see if that was going to be a big issue).
>>>>
>>>> On Thu, Jul 19, 2012 at 3:27 PM, Rui Barradas 
>>>> <ruipbarradas at sapo.pt>
>>>> wrote:
>>>>
>>>>   Hello,
>>>>
>>>>> Try the following. The data is your example of Patient A through 
>>>>> E, but from the output of dput().
>>>>>
>>>>> dat <- structure(list(Patient = structure(c(1L, 1L, 1L, 1L, 1L, 
>>>>> 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = 
>>>>> c("A", "B", "C", "D", "E"), class = "factor"), Cycle = c(1L, 2L, 
>>>>> 3L, 4L, 5L, 1L, 2L, 1L, 3L, 4L, 5L, 1L, 2L, 4L, 5L, 1L, 2L, 3L),
>>>>>       V1 = c(0.4, 0.3, 0.3, 0.4, 0.5, 0.4, 0.4, 0.9, 0.3, NA, 0.4,
>>>>>       0.2, 0.5, 0.6, 0.5, 0.1, 0.5, 0.4), V2 = c(0.1, 0.2, NA,
>>>>>       NA, 0.2, NA, NA, 0.9, 0.5, NA, NA, 0.5, 0.7, 0.4, 0.5, NA,
>>>>>       0.3, 0.3), V3 = c(0.5, 0.5, 0.6, 0.4, 0.5, NA, NA, 0.9, 0.6,
>>>>>       NA, NA, NA, NA, NA, NA, NA, NA, NA), V4 = c(1.5, 1.6, 1.7,
>>>>>       1.8, 1.5, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>>>       NA), V5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
>>>>>       NA, NA, NA, NA, NA, NA)), .Names = c("Patient", "Cycle", 
>>>>> "V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = 
>>>>> c(NA,
>>>>> -18L))
>>>>>
>>>>> dat
>>>>>
>>>>> nms <- names(dat)[grep("^V[1-9]$", names(dat))] dd <- split(dat, 
>>>>> dat$Patient) fun <- function(x) any(is.na(x)) && any(!is.na(x)) ix 
>>>>> <- sapply(dd, function(x) Reduce(`|`, lapply(x[, nms], fun)))
>>>>>
>>>>> dd[ix]
>>>>> do.call(rbind, dd[ix])
>>>>>
>>>>>
>>>>> I'm assuming that the variables names are as posted, V followed by 
>>>>> one single digit 1-9. To keep the Patients with complete cases 
>>>>> just negate the index 'ix', it's a logical index.
>>>>> Note also that dput() is the best way of posting a data example.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Rui Barradas
>>>>>
>>>>> Em 19-07-2012 15:15, Lib Gray escreveu:
>>>>>
>>>>>   Hello,
>>>>>
>>>>>> I didn't give enough information when I sent an query before, so 
>>>>>> I'm trying again with a more detailed explanation:
>>>>>>
>>>>>> In this data set, each patient has a different number of measured 
>>>>>> variables (they represent tumors, so some people had 2 tumors, 
>>>>>> some had 5, etc).
>>>>>> The
>>>>>> problem I have is that often in later cycles for a patient, 
>>>>>> tumors that were originally measured are now missing (or a "new" 
>>>>>> tumor showed up).
>>>>>> We
>>>>>> assume there are many different reasons for why a tumor would be 
>>>>>> measured in one cycle and not another, and so I want to subset 
>>>>>> OUT the "problem"
>>>>>> patients to better study these patterns.
>>>>>>
>>>>>> An example:
>>>>>>
>>>>>> Patient  Cycle  V1  V2  V3  V4  V5 A  1  0.4  0.1  0.5  1.5  NA A  
>>>>>> 2  0.3  0.2  0.5  1.6  NA A  3  0.3  NA  0.6  1.7  NA A  4  0.4  
>>>>>> NA  0.4  1.8  NA A  5  0.5  0.2  0.5  1.5  NA
>>>>>>
>>>>>> I want to keep patient A; they have 4 measured tumors, but tumor 
>>>>>> 2 is missing data for cycles 3 and 4
>>>>>>
>>>>>> B  1  0.4  NA  NA  NA  NA
>>>>>> B  2  0.4  NA  NA  NA  NA
>>>>>>
>>>>>> I do not want to keep patient B; they have 1 tumor that is 
>>>>>> measure consistently in both cycles
>>>>>>
>>>>>> C  1  0.9  0.9  0.9  NA  NA
>>>>>> C  3  0.3  0.5  0.6  NA  NA
>>>>>> C  4  NA  NA  NA  NA  NA
>>>>>> C  5  0.4  NA  NA  NA  NA
>>>>>>
>>>>>> I do want to keep patient C; all their data is missing for cycle 
>>>>>> 4 and cycle 5 only measured one tumor
>>>>>>
>>>>>> D  1  0.2  0.5  NA  NA  NA
>>>>>> D  2  0.5  0.7  NA  NA  NA
>>>>>> D  4  0.6  0.4  NA  NA  NA
>>>>>> D  5  0.5  0.5  NA  NA  NA
>>>>>>
>>>>>> I do not want patient D, their two tumors were measured each 
>>>>>> cycle
>>>>>>
>>>>>> E  1  0.1  NA  NA  NA  NA
>>>>>> E  2  0.5  0.3  NA  NA  NA
>>>>>> E  3  0.4  0.3  NA  NA  NA
>>>>>>
>>>>>> I DO want patient E; they only had one tumor register in Cycle 1, 
>>>>>> but cycles 2 and 3 had two tumors.
>>>>>>
>>>>>>
>>>>>> Thanks for any help!
>>>>>>
>>>>>>           [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________******________________
>>>>>> R-help at r-project.org mailing list 
>>>>>> https://stat.ethz.ch/mailman/******listinfo/r-help<https://stat.e
>>>>>> thz.ch/mailman/****listinfo/r-help>
>>>>>> <https://**stat.ethz.ch/mailman/****listinfo/r-help<https://stat.
>>>>>> ethz.ch/mailman/**listinfo/r-help>
>>>>>> >
>>>>>> <https://stat.**ethz.ch/**mailman/listinfo/r-**help<http://ethz.c
>>>>>> h/mailman/listinfo/r-**help> 
>>>>>> <http**s://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.et
>>>>>> hz.ch/mailman/listinfo/r-help>
>>>>>> >
>>>>>>
>>>>>> PLEASE do read the posting guide http://www.R-project.org/** 
>>>>>> posting-guide.html 
>>>>>> <http://www.R-project.org/****posting-guide.html<http://www.R-pro
>>>>>> ject.org/**posting-guide.html> 
>>>>>> <http://www.**R-project.org/posting-guide.**html<http://www.R-pro
>>>>>> ject.org/posting-guide.html>
>>>>>> >
>>>>>>
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>>>
>

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--

LEGAL NOTICE\ \ This message is intended for the use of ...{{dropped:18}}



More information about the R-help mailing list