[R] data-mining with multiple filters applied to multiple columns
Morway, Eric
emorway at usgs.gov
Wed Apr 8 20:47:32 CEST 2015
Using this representative dataset of a much larger dataset:
dat <- read.table(textConnection("ISEG IRCH div gw
1 1 265 229
1 2 260 298
1 3 234 196
54 1 432 485
54 39 467 485
54 40 468 468
54 41 460 381
54 42 489 502
1 1 265 317
1 2 276 225
1 3 217 164
54 1 430 489
54 39 456 495
54 40 507 607
54 41 483 424
54 42 457 404
1 1 265 278
1 2 287 370
1 3 224 274
54 1 412 585
54 39 473 532
54 40 502 595
54 41 497 441
54 42 447 467
1 1 230 258
1 2 251 152
1 3 199 179
54 1 412 415
54 39 439 538
54 40 474 486
54 41 477 484
54 42 413 346
1 1 230 171
1 2 262 171
1 3 217 263
54 1 432 485
54 39 455 482
54 40 493 419
54 41 489 536
54 42 431 504
1 1 1002 1090
1 2 1222 1178
1 3 1198 1177
54 1 1432 1485
54 39 1876 1975
54 40 1565 1646
54 41 1455 1451
54 42 1427 1524
1 1 1002 968
1 2 1246 1306
1 3 1153 1158
54 1 1532 1585
54 39 1790 1889
54 40 1490 1461
54 41 1518 1536
54 42 1486 1585
1 1 1002 1081
1 2 1229 1262
1 3 1142 1241
54 1 1632 1659
54 39 1797 1730
54 40 1517 1466
54 41 1527 1589
54 42 1514 1612"),header=TRUE)
dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0)
tmp <- diff(dat[dat$seq==1,]$div)!=0
dat$idx <- 0
dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1
dat$ts <- cumsum(dat$idx)
dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum)
dat$ct <- seq(1:length(dat[,1]))
dat[1,]
# ISEG IRCH div gw seq idx ts iter ct
# 1 1 265 229 1 1 1 1 1
I'm attempting to carry out 2 data mining tasks:
1) for each time step ('ts') and each ISEG within those time steps, I want
to difference the max and min 'div' values and store the result. I got
close with this command, but the results were repeated:
with(subset(dat,ts==1 & IRCH==1),
ave(div,ISEG,FUN=function(x){max(x)-min(x)}))
# 0 20 0 20 0 20
There are a few shortcomings with this line of script that I'm hoping could
be improved upon: (1) I hard-coded ts==1, ideally, the R script would
iterate over all 'ts', (2) I really only need the results printed once ("0
20"), and (3) it would be nice to store the results in something like this
(the results shown next are from a brute-force hand-calculation on the dat
data.frame):
ISEG IRCH ts div_diff
1 1 1 0
54 1 1 20
1 1 2 0
54 1 2 20
1 1 3 0
54 1 3 200
2) The second data-mining attempt is a bit more convoluted and to
demonstrate what I'd like to get, here it is in parts. First:
aggregate(gw ~ ISEG + iter, subset(dat,ts==1), sum)
ISEG iter gw
1 1 1 723
2 54 1 2321
3 1 2 706
4 54 2 2419
5 1 3 922
6 54 3 2620
Once again, ts==1 is hard-coded, but this would ideally loop through all
unique 'ts' in the dataset. Next, with this result, I'd like to difference
the maximum and minimum 'gw' values associated with each ISEG. I tried:
with(aggregate(gw ~ ISEG + iter, subset(dat,ts==1), sum), ave(gw, ISEG,
function(x){max(x)-min(x)}))
#Error in unique.default(x, nmax = nmax) :
# unique() applies only to vectors
but didn't know what to do with the error. for the result above, the
answer I'm seeking is:
ISEG diff
1 216
54 289
For ts==1, the value of 216 results from 922 - 706 [max(gw) - min(gw) for
ISEG==1] and the value of 289 results from 2620 - 2321 [max(gw) - min(gw)
for ISEG==54]. So, the culmination of what I'm after would be the result
from data-mining effort 1 + data-mining effort 2:
ISEG IRCH ts div_diff gw_diff
1 1 1 0 216
54 1 1 20 289
1 1 2 0 16
54 1 2 20 157
1 1 3 0 152
54 1 3 200 25
[[alternative HTML version deleted]]
More information about the R-help
mailing list