[R] selecting dataframe columns based on substring of col name(s)

Thu Jun 22 14:47:09 CEST 2017

Thanks to all the good suggestions/solutions to the original problem.

On 6/21/2017 3:28 PM, David Winsemius wrote:
>> On Jun 21, 2017, at 9:11 AM, Evan Cooch <evan.cooch at gmail.com> wrote:
>>
>> Suppose I have the following sort of dataframe, where each column name has a common structure: prefix, followed by a number (for this example, col1, col2, col3 and col4):
>>
>> d = data.frame( col1=runif(10), col2=runif(10), col3=runif(10),col4=runif(10))
>>
>> What I haven't been able to suss out is how to efficiently 'extract/manipulate/play with' columns from the data frame, making use of this common structure.
>>
>> Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I could subset the dataframe d in any number of ways -- for example
>>
>> piece <- d[,c("col2","col3","col4")]
>>
>> Works as expected, but for *big* problems (where I might have dozens -> hundreds of columns -- often the case with big design matrices output by some linear models program or another), having to write them all out using c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering about is if there is a way to simply select over the "changing part" of the column name (you can do this relatively easily in a data step in SAS, for example). Heuristically, something like:
>>
>> piece <- df[,col2:col4]
>>
>> where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the prefix 'col', and then simply select over the changing suffic -- i.e., column number).
>>
>> Now, if I use the "to" function in the lessR package, I can get there from here fairly easily:
>>
>> piece <- d[,to("col",4,from=2,same.size=FALSE)]
>>
>> But, is there a better way? Beyond 'efficiency' (ease of implementation), part of what constitutes 'better' might be something in base R, rather than relying on a package?
> After staring at the code for the base function subset with a thought to hacking it to do this I realized that should be already part of the evaluation result from its current form:
>
>   names(airquality)
> #[1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"
>
> subset(airquality,
>            Temp > 90,             # this is the row selection
>            select = Ozone:Solar.R) # and this selects columns
> #--------
>      Ozone Solar.R
> 42     NA     259
> 43     NA     250
> 69     97     267
> 70     97     272
> 75     NA     291
> 102    NA     222
> 120    76     203
> 121   118     225
> 122    84     237
> 123    85     188
> 124    96     167
> 125    78     197
> 126    73     183
> 127    91     189
>
> Bert's advice to work with the numbers is good, but conversion to numeric designations of columns inside the `select`-expression is actually what is occurring inside `subset`.
>

	[[alternative HTML version deleted]]