[R] data-mining with multiple filters applied to multiple columns

Morway, Eric emorway at usgs.gov
Wed Apr 8 20:47:32 CEST 2015


Using this representative dataset of a much larger dataset:

dat <- read.table(textConnection("ISEG  IRCH  div  gw
 1   1   265  229
 1   2   260  298
 1   3   234  196
54   1   432  485
54  39   467  485
54  40   468  468
54  41   460  381
54  42   489  502
 1   1   265  317
 1   2   276  225
 1   3   217  164
54   1   430  489
54  39   456  495
54  40   507  607
54  41   483  424
54  42   457  404
 1   1   265  278
 1   2   287  370
 1   3   224  274
54   1   412  585
54  39   473  532
54  40   502  595
54  41   497  441
54  42   447  467
 1   1   230  258
 1   2   251  152
 1   3   199  179
54   1   412  415
54  39   439  538
54  40   474  486
54  41   477  484
54  42   413  346
 1   1   230  171
 1   2   262  171
 1   3   217  263
54   1   432  485
54  39   455  482
54  40   493  419
54  41   489  536
54  42   431  504
 1   1  1002  1090
 1   2  1222  1178
 1   3  1198  1177
54   1  1432  1485
54  39  1876  1975
54  40  1565  1646
54  41  1455  1451
54  42  1427  1524
 1   1  1002  968
 1   2  1246  1306
 1   3  1153  1158
54   1  1532  1585
54  39  1790  1889
54  40  1490  1461
54  41  1518  1536
54  42  1486  1585
 1   1  1002  1081
 1   2  1229  1262
 1   3  1142  1241
54   1  1632  1659
54  39  1797  1730
54  40  1517  1466
54  41  1527  1589
54  42  1514  1612"),header=TRUE)

dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0)
tmp <- diff(dat[dat$seq==1,]$div)!=0
dat$idx <- 0
dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1
dat$ts <- cumsum(dat$idx)
dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum)
dat$ct <- seq(1:length(dat[,1]))

dat[1,]
#  ISEG IRCH div  gw seq idx ts iter ct
#     1    1 265 229   1   1  1    1  1

I'm attempting to carry out 2 data mining tasks:

1)  for each time step ('ts') and each ISEG within those time steps, I want
to difference the max and min 'div' values and store the result.  I got
close with this command, but the results were repeated:

with(subset(dat,ts==1 & IRCH==1),
ave(div,ISEG,FUN=function(x){max(x)-min(x)}))
#  0 20  0 20  0 20

There are a few shortcomings with this line of script that I'm hoping could
be improved upon: (1) I hard-coded ts==1, ideally, the R script would
iterate over all 'ts', (2) I really only need the results printed once ("0
 20"), and (3) it would be nice to store the results in something like this
(the results shown next are from a brute-force hand-calculation on the dat
data.frame):

ISEG IRCH ts div_diff
   1    1  1        0
  54    1  1       20
   1    1  2        0
  54    1  2       20
   1    1  3        0
  54    1  3      200

2) The second data-mining attempt is a bit more convoluted and to
demonstrate what I'd like to get, here it is in parts.  First:

aggregate(gw ~ ISEG + iter, subset(dat,ts==1), sum)
  ISEG iter   gw
1    1    1  723
2   54    1 2321
3    1    2  706
4   54    2 2419
5    1    3  922
6   54    3 2620

Once again, ts==1 is hard-coded, but this would ideally loop through all
unique 'ts' in the dataset.  Next, with this result, I'd like to difference
the maximum and minimum 'gw' values associated with each ISEG.  I tried:

with(aggregate(gw ~ ISEG + iter, subset(dat,ts==1), sum), ave(gw, ISEG,
function(x){max(x)-min(x)}))
#Error in unique.default(x, nmax = nmax) :
#  unique() applies only to vectors

but didn't know what to do with the error.  for the result above, the
answer I'm seeking is:

ISEG diff
   1  216
  54  289

For ts==1, the value of 216 results from 922 - 706 [max(gw) - min(gw) for
ISEG==1] and the value of 289 results from 2620 - 2321 [max(gw) - min(gw)
for ISEG==54].  So, the culmination of what I'm after would be the result
from data-mining effort 1 + data-mining effort 2:

ISEG IRCH ts div_diff gw_diff
   1    1  1        0     216
  54    1  1       20     289
   1    1  2        0      16
  54    1  2       20     157
   1    1  3        0     152
  54    1  3      200      25

	[[alternative HTML version deleted]]



More information about the R-help mailing list