[R] on how to make a skip-table
Zhang Weiwu
zhangweiwu at realss.com
Fri Sep 13 06:18:28 CEST 2013
It is a nice surprise to wake up receiving three answers, all producing
correct results. Many thanks to all of you.
Jim Holtman solved it with amazing clarity. Gang Peng using a traditioanl
C-like pointer style and Arun with awesome tight code thanks to diff().
I am embrassed to see my mis-spellings inherited in the answers ('lenths'
should be 'lengths' and 'valida' should be 'valid'). This experience is to
behove me to not to code in midnight again.
For anyone wishing to test these methods, I have compiled them all into one
R script file, pasted at the end of this email.
Jim Holtman asked me to elaborate the problem:
It is a common problem in reading sparse variable-lenght record data
file. Records are stored in file one next to another. The length of
each record is known in advance, but a lot of them records are invalid,
and should be skipped to make efficient use of memory.
Ideally the datafile-reading routine should receive a skip-table. Before
reading each wanted/valid record, it seeks forward for the distance
given in the skip-table. The problem is how to obtain such a skip table.
What we have at hand to produce the skip table, is a set of two data
frames: a record.lengths data frame about each record's length, and a
valid.records data frame about which records are significant and should
be read.
--
###### input data:
record.lengths <- read.table(text = " NR length
1 100
2 130
3 150
4 148
5 100
6 83
7 60", header = TRUE)
valid.records <- read.table(text = " NR factor
1 3
2 4
4 8
7 9", header = TRUE)
####### Jim Holtman's method:
x <- merge(record.length, valid.records, by = "NR", all.x = TRUE)
x$seq <- cumsum(!is.na(x$factor))
# need to add 1 to lines with NA to associate with next group
x$seq[is.na(x$factor)] <- x$seq[is.na(x$factor)] + 1
# split by 'seq', output last record and sum of preceeding records
skip.table <- do.call(rbind
, lapply(split(x, x$seq), function(.sk){
if (nrow(.sk) > 1) .sk$skip <- sum(.sk$length[1:(nrow(.sk) - 1L)])
else .sk$skip <- 0
.sk[nrow(.sk), ] # return first value
})
)
print(skip.table)
####### Gang Peng's method:
n.record <- length(record.lengths$NR)
index <- record.lengths$NR %in% valid.records$NR
tmp <- 1:n.record
ind <- tmp[index]
st <- 1
skip <- rep(0,length(ind))
for (i in 1:length(ind)) {
if(st<ind[i]){
skip[i]<-sum(record.lengths$length[st:(ind[i]-1)])
}
st <- ind[i]+1
}
print(cbind(valid.records,skip))
####### Arun's method:
indx<-diff(valid.records$NR)-1
skip.table<- within(valid.records, {skip<-
with(record.lengths,tapply(length,c(-1,rep(indx,indx+1)),function(x)
sum(x[-length(x)])))})[,c(1,3,2)]
print(skip.table)
More information about the R-help
mailing list