[R] on how to make a skip-table

Fri Sep 13 06:18:28 CEST 2013

It is a nice surprise to wake up receiving three answers, all producing 
correct results. Many thanks to all of you.

Jim Holtman solved it with amazing clarity. Gang Peng using a traditioanl 
C-like pointer style and Arun with awesome tight code thanks to diff().

I am embrassed to see my mis-spellings inherited in the answers ('lenths' 
should be 'lengths' and 'valida' should be 'valid'). This experience is to 
behove me to not to code in midnight again.

For anyone wishing to test these methods, I have compiled them all into one 
R script file, pasted at the end of this email.

Jim Holtman asked me to elaborate the problem:

     It is a common problem in reading sparse variable-lenght record data
     file.  Records are stored in file one next to another. The length of
     each record is known in advance, but a lot of them records are invalid,
     and should be skipped to make efficient use of memory.

     Ideally the datafile-reading routine should receive a skip-table. Before
     reading each wanted/valid record, it seeks forward for the distance
     given in the skip-table. The problem is how to obtain such a skip table.

     What we have at hand to produce the skip table, is a set of two data
     frames: a record.lengths data frame about each record's length, and a
     valid.records data frame about which records are significant and should
     be read.

--

###### input data:

record.lengths <- read.table(text = "    NR     length
          1       100
          2       130
          3       150
          4       148
          5       100
          6        83
          7        60", header = TRUE)

valid.records <- read.table(text = "  NR     factor
          1       3
          2       4
          4       8
          7       9", header = TRUE)

####### Jim Holtman's method:

x <- merge(record.length, valid.records, by = "NR", all.x = TRUE)
x$seq <- cumsum(!is.na(x$factor))

# need to add 1 to lines with NA to associate with next group
x$seq[is.na(x$factor)] <- x$seq[is.na(x$factor)] + 1

# split by 'seq', output last record and sum of preceeding records
skip.table <- do.call(rbind
      , lapply(split(x, x$seq), function(.sk){
          if (nrow(.sk) > 1) .sk$skip <- sum(.sk$length[1:(nrow(.sk) - 1L)])
          else .sk$skip <- 0
          .sk[nrow(.sk), ] # return first value
          })
      )

print(skip.table)

####### Gang Peng's method:

n.record <- length(record.lengths$NR)
index    <- record.lengths$NR %in% valid.records$NR
tmp <- 1:n.record
ind <- tmp[index]
st  <- 1
skip <- rep(0,length(ind))
for (i in 1:length(ind)) {
 	if(st<ind[i]){
 		skip[i]<-sum(record.lengths$length[st:(ind[i]-1)])
 	}
 	st <- ind[i]+1
}
print(cbind(valid.records,skip))

####### Arun's method:
indx<-diff(valid.records$NR)-1
skip.table<- within(valid.records, {skip<-
with(record.lengths,tapply(length,c(-1,rep(indx,indx+1)),function(x)
sum(x[-length(x)])))})[,c(1,3,2)]
print(skip.table)