[R] selecting dataframe columns based on substring of col name(s)
David Winsemius
dwinsemius at comcast.net
Wed Jun 21 21:28:20 CEST 2017
> On Jun 21, 2017, at 9:11 AM, Evan Cooch <evan.cooch at gmail.com> wrote:
>
> Suppose I have the following sort of dataframe, where each column name has a common structure: prefix, followed by a number (for this example, col1, col2, col3 and col4):
>
> d = data.frame( col1=runif(10), col2=runif(10), col3=runif(10),col4=runif(10))
>
> What I haven't been able to suss out is how to efficiently 'extract/manipulate/play with' columns from the data frame, making use of this common structure.
>
> Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I could subset the dataframe d in any number of ways -- for example
>
> piece <- d[,c("col2","col3","col4")]
>
> Works as expected, but for *big* problems (where I might have dozens -> hundreds of columns -- often the case with big design matrices output by some linear models program or another), having to write them all out using c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering about is if there is a way to simply select over the "changing part" of the column name (you can do this relatively easily in a data step in SAS, for example). Heuristically, something like:
>
> piece <- df[,col2:col4]
>
> where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the prefix 'col', and then simply select over the changing suffic -- i.e., column number).
>
> Now, if I use the "to" function in the lessR package, I can get there from here fairly easily:
>
> piece <- d[,to("col",4,from=2,same.size=FALSE)]
>
> But, is there a better way? Beyond 'efficiency' (ease of implementation), part of what constitutes 'better' might be something in base R, rather than relying on a package?
After staring at the code for the base function subset with a thought to hacking it to do this I realized that should be already part of the evaluation result from its current form:
names(airquality)
#[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
subset(airquality,
Temp > 90, # this is the row selection
select = Ozone:Solar.R) # and this selects columns
#--------
Ozone Solar.R
42 NA 259
43 NA 250
69 97 267
70 97 272
75 NA 291
102 NA 222
120 76 203
121 118 225
122 84 237
123 85 188
124 96 167
125 78 197
126 73 183
127 91 189
Bert's advice to work with the numbers is good, but conversion to numeric designations of columns inside the `select`-expression is actually what is occurring inside `subset`.
--
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list