[R] Manipulating groups of boolean data subject to group size and distance from other groups

Mon Nov 28 23:20:49 CET 2016

To help with the clarification, I renamed 'col1' to 'year' and 'col2' to
'origDat'.  With that said...

The reason the second 'block' of 1's (four consecutive 1's appearing in
DF$origDat[11:14]) is preserved is because they are only separated by a
total of 1 year (1998 in DF$year) from a larger group of consecutive 1's
(years 1999 through 2002).  Because the first block of 1's are separated
from from any other block of ones by at least 2 years, which I have deemed
to be too large of a gap in data (0's are a surrogate for missing data),
the 1's appearing in DF$year[3:6] should be reset to 0.

I modified the script based on David's suggestion of rle (I was previously
unaware of it) to that shown below, and it works for all three example DF's
provided at the top of the script. That is, after running the script with
any of the first 3 DF's provided, the data in DF$finalDat (as compared to
DF$origDat) is reflective of what I'm after.

HOWEVER, the use of nested while loops and if statements strikes me as
antithetical to elegant R scripting.  Second, my script, as currently
constituted, has a significant bug in that the rules I've set forth are not
completely satisfied.  If DF4 is used (uncomment the line: "DF <- DF4") the
blocks of 1's at the beginning and end of DF$origDat are preserved, whereas
the middle (and largest continuous) block of 1's appearing in the middle of
DF$origDat are reset to 0.  Thus, I think I'm in need of a more elegant way
of pursuing this problem...should anyone be so inclined to offer of
additional thoughts.

The (semi-) working script using rle is:

DF <- data.frame(year=rep(1991:2004, each=2),

 origDat=c(0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1))

#DF <- data.frame(year=rep(1991:2004, each=2),
#
origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

#DF <- data.frame(year=rep(1991:2004, each=2),
#
origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,1,1,1,1,1,1,1,1))

# An example that doesn't work
DF4 <- data.frame(year=rep(1991:2004, each=2),

 origDat=c(1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1))
#DF <- DF4

DF$inc <- c(1, abs(diff(DF$origDat)))
DF$cumsum <- cumsum(DF$inc)

ex1 <- aggregate(year ~ cumsum, data=DF, function(x) length(unique(x)))
names(ex1) <- c('cumsum','isl')

tmp1a <- merge(DF, ex1, by="cumsum", all.x=TRUE)
tmp1a$isl2 <- (-1*tmp1a$origDat) * tmp1a$isl
tmp1a$isl2[tmp1a$isl2==0] <- tmp1a$isl[tmp1a$isl2==0]
tmp1a$isl2 <- -1 * tmp1a$isl2

DF$grpng <- tmp1a$isl2

runlen <- data.frame(cumsum = seq(1:length(rle(DF$grpng)$lengths)),
                     len = rle(DF$grpng)$lengths,
                     val = rle(DF$grpng)$values)

i <- 1
while(i <= nrow(runlen)){
  if(runlen[i,'val'] >= 2){  # As long as a '-2' or smaller doesn't follow,
                            # then the current group of data is NOT
                            # too 'distant' from other data and should be
                            # preserved.  Otherwise, the current grp of
                            # 1's should be reset to 0
    j <- i + 1
    while(j <= nrow(runlen)){
      if(runlen[j,'val'] <= -2){
        # If code enters here, then swich the sign of 'val' to
        # effectively inactivate this block of 1's
        runlen[i,'val'] <- -1 * runlen[i,'val']
      }
      #print(paste0("j: ",as.character(j)))
      j <- j + 1
    }
  } else if (runlen[i,'val'] > 0 & runlen[i,'val'] < 2){
    # If the script enters here, then the current group of data
    # doesn't meet the minimum continuous length requirement of
    # 2 or more years (in this example a check of >0 & <2 seems
    # silly, but in the real-world dataset 2 will be replaced with
    # a much larger example.
    runlen[i,'val'] <- -1 * runlen[i,'val']
  }
  #print(paste0("i: ",as.character(i)))
  i <- i + 1
}

runlen$finalDat <- ifelse(runlen$val < 0, 0, 1)
DF <- merge(DF, runlen, by = 'cumsum', all.x = TRUE)
DF

	[[alternative HTML version deleted]]