[R] selecting dataframe columns based on substring of col name(s)

Wed Jun 21 21:28:20 CEST 2017

> On Jun 21, 2017, at 9:11 AM, Evan Cooch <evan.cooch at gmail.com> wrote:
> 
> Suppose I have the following sort of dataframe, where each column name has a common structure: prefix, followed by a number (for this example, col1, col2, col3 and col4):
> 
> d = data.frame( col1=runif(10), col2=runif(10), col3=runif(10),col4=runif(10))
> 
> What I haven't been able to suss out is how to efficiently 'extract/manipulate/play with' columns from the data frame, making use of this common structure.
> 
> Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I could subset the dataframe d in any number of ways -- for example
> 
> piece <- d[,c("col2","col3","col4")]
> 
> Works as expected, but for *big* problems (where I might have dozens -> hundreds of columns -- often the case with big design matrices output by some linear models program or another), having to write them all out using c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering about is if there is a way to simply select over the "changing part" of the column name (you can do this relatively easily in a data step in SAS, for example). Heuristically, something like:
> 
> piece <- df[,col2:col4]
> 
> where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the prefix 'col', and then simply select over the changing suffic -- i.e., column number).
> 
> Now, if I use the "to" function in the lessR package, I can get there from here fairly easily:
> 
> piece <- d[,to("col",4,from=2,same.size=FALSE)]
> 
> But, is there a better way? Beyond 'efficiency' (ease of implementation), part of what constitutes 'better' might be something in base R, rather than relying on a package?

After staring at the code for the base function subset with a thought to hacking it to do this I realized that should be already part of the evaluation result from its current form:

 names(airquality)
#[1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"  

subset(airquality, 
          Temp > 90,             # this is the row selection
          select = Ozone:Solar.R) # and this selects columns
#--------
    Ozone Solar.R
42     NA     259
43     NA     250
69     97     267
70     97     272
75     NA     291
102    NA     222
120    76     203
121   118     225
122    84     237
123    85     188
124    96     167
125    78     197
126    73     183
127    91     189

Bert's advice to work with the numbers is good, but conversion to numeric designations of columns inside the `select`-expression is actually what is occurring inside `subset`.

-- 

David Winsemius
Alameda, CA, USA