[R] help usin scan on large matrix (caveats to what has been discussed before)

baptiste Auguié baptiste.auguie at googlemail.com
Thu Aug 12 14:45:23 CEST 2010


Hi,

I don't know if this can be useful to you, but I recently wrote a small function to read a large datafile like yours in a number of steps, with the possibility to save each intermediate block as .Rdata. This is based on read.table --- not as efficient as lower-level scan() but it might be good enough,

file <- 'test.txt'
## write.table(matrix(rnorm(1e6*14), ncol=14), file=file,row.names = F,
##             col.names = F )

n <- as.numeric(gsub("[^0123456789]","", system(paste("wc -l ", file), int=TRUE)))
n

blocks <- function(n=18, size=5){
res <- c(replicate(n%/%size, size))
if(n%%size) res <- c(res, n%%size)
if(!sum(res) == n) stop("ERROR!!!")
res
}
## blocks(1003, 500)


readBlocks <- function(file, nbk=1e5, out="tmp", save.inter=TRUE, 
                       classes= c("numeric", "numeric", rep("NULL", 6),
                         "numeric", "numeric", rep("NULL", 4))){
  
  n <- as.numeric(gsub("[^0123456789]","", system(paste("wc -l ", file), int=TRUE)))

  ncols <- length(grep("NULL", classes, invert=TRUE))
  results <- matrix(0, nrow=n, ncol=ncols)
  Nb <- blocks(n, nbk)
  skip <- c(0, cumsum(Nb))
  for(ii in seq_along(Nb)){
    d <- read.table(file, colClasses = classes, nrows=Nb[ii], skip=skip[ii], comment.char = "")
    if(save.inter){
      save(d, file=paste(out, ".", ii, ".rda", sep=""))
      }
    print(ii)
    results[seq(1+skip[ii], skip[ii]+Nb[ii]), ] <- as.matrix(d)
    rm(d) ; gc() 
  }
  save(results, file=paste(out, ".rda", sep=""))
  invisible(results)
}

## test <- readBlocks(file)

HTH,

baptiste



On Aug 12, 2010, at 1:34 PM, Martin Tomko wrote:

> Hi Peter,
> thank you for your reply. I still cannot get it to work.
> I have modified your code as follows:
> rows<-length(R)
> cols <- max(unlist(lapply(R,function(x) length(unlist(gregexpr(" ",x,fixed=TRUE,useBytes=TRUE))))))
> c<-scan(file=f,what=rep(c(list(NULL),rep(list(0L),cols-1),rows-1)), skip=1)
> m<-matrix(c, nrow = rows-1, ncol=cols+1,byrow=TRUE);
> 
> the list c seems ok, with all the values I would expect. Still, length(c) gives me a value = cols+1, which I find odd (I would expect =cols).
> I thine repeated it rows-1 times (to account for the header row). The values seem ok.
> Anyway, I tried to construct the matrix, but when I print it, the values are odd:
> > m[1:10,1:10]
>      [,1] [,2]       [,3]       [,4]       [,5]       [,6]       [,7]
> [1,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [2,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [3,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [4,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [5,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [6,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [7,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [8,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [9,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> [10,] NULL Integer,15 Integer,15 Integer,15 Integer,15 Integer,15 Integer,15
> ....
> 
> Any idea where the values are gone?
> Thanks
> Martin
> 
> Hence, I filled it into the matrix of dimensions
> 
> On 8/12/2010 12:24 PM, peter dalgaard wrote:
>> On Aug 12, 2010, at 11:30 AM, Martin Tomko wrote:
>> 
>>   
>>> c<-scan(file=f,what=list(c("",(rep(integer(0),cols)))), skip=1)
>>> m<-matrix(c, nrow = rows, ncol=cols,byrow=TRUE);
>>> 
>>> for some reason I end up with a character matrix, which I don't want. Is this the proper way to skip the first column (this is not documented anywhere - how does one skip the first column in scan???). is my way of specifying "integer(0)" correct?
>>>     
>> No. Well, integer(0) is just superfluous where 0L would do, since scan only looks at the types not the contents, but more importantly, what= wants a list of as many elements as there are columns and you gave it
>> 
>>   
>>> list(c("",(rep(integer(0),5))))
>>>     
>> [[1]]
>> [1] ""
>> 
>> I think what you actually meant was
>> 
>> c(list(NULL),rep(list(0L),5))
>> 
>> 
>> 
>>   
>>> And finally - would any sparse matrix package be more appropriate, and can I use a sparse matrix for the image() function producing typical heat,aps? I have seen that some sparse matrix packages produce different looking outputs, which would not be appropriate.
>>> 
>>> Thanks
>>> Martin
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>     
>>   
> 
> 
> -- 
> Martin Tomko
> Postdoctoral Research Assistant
> 
> Geographic Information Systems Division
> Department of Geography
> University of Zurich - Irchel
> Winterthurerstr. 190
> CH-8057 Zurich, Switzerland
> 
> email: 	martin.tomko at geo.uzh.ch
> site:	http://www.geo.uzh.ch/~mtomko
> mob: 	+41-788 629 558
> tel: 	+41-44-6355256
> fax: 	+41-44-6356848
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list