[R] selecting dataframe columns based on substring of col name(s)
Evan Cooch
evan.cooch at gmail.com
Thu Jun 22 14:47:09 CEST 2017
Thanks to all the good suggestions/solutions to the original problem.
On 6/21/2017 3:28 PM, David Winsemius wrote:
>> On Jun 21, 2017, at 9:11 AM, Evan Cooch <evan.cooch at gmail.com> wrote:
>>
>> Suppose I have the following sort of dataframe, where each column name has a common structure: prefix, followed by a number (for this example, col1, col2, col3 and col4):
>>
>> d = data.frame( col1=runif(10), col2=runif(10), col3=runif(10),col4=runif(10))
>>
>> What I haven't been able to suss out is how to efficiently 'extract/manipulate/play with' columns from the data frame, making use of this common structure.
>>
>> Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I could subset the dataframe d in any number of ways -- for example
>>
>> piece <- d[,c("col2","col3","col4")]
>>
>> Works as expected, but for *big* problems (where I might have dozens -> hundreds of columns -- often the case with big design matrices output by some linear models program or another), having to write them all out using c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering about is if there is a way to simply select over the "changing part" of the column name (you can do this relatively easily in a data step in SAS, for example). Heuristically, something like:
>>
>> piece <- df[,col2:col4]
>>
>> where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the prefix 'col', and then simply select over the changing suffic -- i.e., column number).
>>
>> Now, if I use the "to" function in the lessR package, I can get there from here fairly easily:
>>
>> piece <- d[,to("col",4,from=2,same.size=FALSE)]
>>
>> But, is there a better way? Beyond 'efficiency' (ease of implementation), part of what constitutes 'better' might be something in base R, rather than relying on a package?
> After staring at the code for the base function subset with a thought to hacking it to do this I realized that should be already part of the evaluation result from its current form:
>
> names(airquality)
> #[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
>
> subset(airquality,
> Temp > 90, # this is the row selection
> select = Ozone:Solar.R) # and this selects columns
> #--------
> Ozone Solar.R
> 42 NA 259
> 43 NA 250
> 69 97 267
> 70 97 272
> 75 NA 291
> 102 NA 222
> 120 76 203
> 121 118 225
> 122 84 237
> 123 85 188
> 124 96 167
> 125 78 197
> 126 73 183
> 127 91 189
>
> Bert's advice to work with the numbers is good, but conversion to numeric designations of columns inside the `select`-expression is actually what is occurring inside `subset`.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list