[R] How to pre-filter large amounts of data effectively

Tue Aug 9 12:53:50 CEST 2005

You are right, but unfortunately this is not the limiting step or  
bottleneck in the code below.
The filter.const() function is only used to get the non-constant  
columns in the
training data set, which is initially small (49 rows and 525 columns).
And this function is only applied for filtering the training set and  
takes about 2 seconds on my PowerBook.
After filtering the training data set, just the list of column names  
is used to filter the huge "prediction.set".
I think, the really time and memory consuming part is the for-loop  
below, but I don't know how to improve this part.

Anyway, thanks for the hint!!!

Best,
Torsten

On Aug 9, 2005, at 12:37 PM, Patrick Burns wrote:

> Building up an object like you do with 'realdata' is very
> wasteful (S Poetry says why).  I think you want something
> along the lines of:
>
> if(vectors[1] == 'column') {
>    realdata <- apply(X, 2, function(x) diff(range(x))) > tol
>    filteredX <- X[, realdata]
> } else {
>    realdata <- apply(X, 1, function(x) diff(range(x))) > tol
>    filteredX <- X[realdata, ]
> }
>
> Patrick Burns
> patrick at burns-stat.com
> +44 (0)20 8525 0696
> http://www.burns-stat.com
> (home of S Poetry and "A Guide for the Unwilling S User")
>
> Torsten Schindler wrote:
>
>
>> Hi,
>>
>> I'm a R newbie and want to accelerate the following pre-filtering   
>> step of a data set with more than 115,000 rows :
>>
>> #-----------------
>> # Function to filter out constant data columns
>> filter.const<-function(X, vectors=c('column', 'row'), tol=0){
>>   realdata=c()
>>   filteredX<-matrix()
>>   if( vectors[1] == 'row' ){
>>     for( row in (1:nrow(X)) ){
>>       if( length(which(X[row,]!=median(X[row,])))>tol ){
>>         realdata[length(realdata)+1]=row
>>       }
>>     }
>>     filteredX=X[realdata,]
>>   } else if( vectors[1] == 'column' ){
>>     for( col in (1:ncol(X)) ){
>>       if( length(which(X[,col]!=median(X[,col])))>tol ){
>>         realdata[length(realdata)+1]=col
>>       }
>>     }
>>     filteredX=X[,realdata]
>>   }
>>   return(list(x=filteredX, ix=realdata))
>> }
>>
>> #-----------------
>> # Filter out all all-constant columns in my training data set
>> #
>> # Read training data set with class information in the first column
>> training <- read.csv('training_data.txt')
>> dim(training) # => 49 rows and 525 columns
>>
>> # Prepare column names by stripping the underline and the number  
>> at  the end
>> colnames(training) <- sub('_\\d+$', '', colnames(training),  
>> perl=TRUE)
>>
>> # Filter out the all-constant columns, exclude column 1, the  
>> class  column called myclass
>> training.filter <- filter.const(training[,-1])
>>
>> # The filtered data frame is
>> training.filtered <- cbind(myclass=training[,1], training.filter$x)
>> dim(training.filtered) # => 49 rows and 250 columns
>>
>> # Save the filtered training set for later use in classification
>> filtered.data <- 'training_set_filtered.Rdata'
>> save(training.filtered, file=filtered.data)
>>
>> #-----------------
>> # THE FOLLOWING FILTERING STEP TAKES 3 HOUR ON MY PowerBook
>> # AND CONSUMES ABOUT 600 Mb MEMORY.
>> #
>> # I WOULD BE HAPPY ABOUT ANY HINT HOW TO IMPROVE THIS.
>>
>> # Pre-filter the big data set (more than 115,000 rows and 524   
>> columns) for later class predictions.
>> # The big data set contains the same column names as the training   
>> set, but in a different order.
>>
>> input.file <- 'big_data_set.txt'
>> filtered.file <- 'big_data_set_filtered.txt'
>>
>> # Read header with first row
>> prediction.set <- read.csv(input.file, header=TRUE, skip=0, nrow=1)
>>
>> # Prepare column names by stripping the underline and the number  
>> at  the end
>> colnames(prediction.set) <- sub('_\\d+$', '', colnames  
>> (prediction.set), perl=TRUE)
>> prediction.set.header <- colnames(prediction.set)
>>
>> # Get descriptor columns of the training data set without the   
>> Activity_Class column
>> training.filtered.property.colnames <- colnames(training.filtered) 
>> [-1]
>>
>> # Filter out the all-constant columns from the training set
>> prediction.set.filtered <- prediction.set  
>> [training.filtered.property.colnames]
>> dim(prediction.set.filtered) # => 1 row and 249 columns
>>
>> # Write header and the first filtered row
>> write.csv(prediction.set.filtered, file=filtered.file,
>>             append=FALSE,   
>> col.names=training.filtered.property.colnames)
>>
>> blocksize <- 1000
>> for (lineid in (0:120)*blocksize) {
>>   cat('lineid: ', lineid, '\n')
>>
>>   # Read block of data
>>   # We have to add an dummy colname "x" in the col.names, when  
>> the  header is not read!
>>   prediction.set <- try(read.csv(input.file, header=FALSE,
>>                         col.names=c('x',prediction.set.header),   
>> row.names=1,
>>                         skip=lineid+2, nrow=blocksize))
>>   if (class(prediction.set) == "try-error") break
>>
>>   # Filter out all-constant training set columns from the block
>>   prediction.set.filtered <- prediction.set  
>> [training.filtered.property.colnames]
>>
>>   # Append the data
>>   # (I know this function is slow, but I couldn't figure out how  
>> to  do it faster, so far.)
>>   write.table(prediction.set.filtered, file=filtered.file,
>>                         append=TRUE, col.names=FALSE, sep=",")
>> }
>>
>> #-------------
>> # Now read in the filtered data set and save it for later use in   
>> classification
>> prediction.set.filtered <- read.csv(filtered.file, header=TRUE,   
>> row.names=1)
>> filtered.data <- 'prediction_set_filtered.Rdata'
>> save(prediction.set.filtered, file=filtered.data)
>>
>>
>>
>> I would be very happy about any hints how to improve the code  
>> above!!!
>>
>> Best regards,
>>
>> Torsten
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! http://www.R-project.org/posting- 
>> guide.html
>>
>>
>>
>>
>>
>
>