[R] How to transpose it in a fast way?

Fri Mar 8 23:46:18 CET 2013

On 03/08/2013 06:01 AM, Jan van der Laan wrote:
>
> You could use the fact that scan reads the data rowwise, and the fact that
> arrays are stored columnwise:
>
> # generate a small example dataset
> exampl <- array(letters[1:25], dim=c(5,5))
> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
>      sep="\t", quote=FALSE)
>
> # and read...
> d <- scan("example.dat", what=character())
> d <- array(d, dim=c(5,5))
>
> t(exampl) == d
>
>
> Although this is probably faster, it doesn't help with the large size. You could
> used the n option of scan to read chunks/blocks and feed those to, for example,
> an ff array (which you ideally have preallocated).

I think it's worth asking what the overall goal is; all we get from this 
exercise is another large file that we can't easily manipulate in R!

But nothing like a little challenge. The idea I think would be to transpose in 
chunks of rows by scanning in some number of rows and writing to a temporary file

     tpose1 <- function(fin, nrowPerChunk, ncol) {
         v <- scan(fin, character(), nmax=ncol * nrowPerChunk)
         m <- matrix(v, ncol=ncol, byrow=TRUE)
         fout <- tempfile()
         write(m, fout, nrow(m), append=TRUE)
         fout
     }

Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k at a 
time from some file fl <- "big.txt"

     ncol <- 60000L
     nrowPerChunk <- 10000L
     nChunks <- ncol / nrowPerChunk

     fin <- file(fl); open(fin)
     fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol))
     close(fin)

'fls' is now a vector of file paths, each containing a transposed slice of the 
matrix. The next task is to splice these together. We could do this by taking a 
slice of rows from each file, cbind'ing them together, and writing to an output

     splice <- function(fout, cons, nrowPerChunk, ncol) {
         slices <- lapply(cons, function(con) {
             v <- scan(con, character(), nmax=nrowPerChunk * ncol)
             matrix(v, nrowPerChunk, byrow=TRUE)
         })
         m <- do.call(cbind, slices)
         write(t(m), fout, ncol(m), append=TRUE)
     }

We'd need to use open connections as inputs and output

     cons <- lapply(fls, file); for (con in cons) open(con)
     fout <- file("big_transposed.txt"); open(fout, "w")
     xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk, nrowPerChunk))
     for (con in cons) close(con)
     close(fout)

As another approach, it looks like the data are from genotypes. If they really 
only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT' could be 
encoded as a single byte

     alf <- c("A", "C", "G", "T")
     nms <- outer(alf, alf, paste0)
     map <- outer(setNames(as.raw(0:15), nms),
                  setNames(as.raw(bitwShiftL(0:15, 4)), nms),
                  "|")

with e.g.,

 > map[matrix(c("AA", "CT"), ncol=2)]
[1] d0

This translates the problem of representing the 60k x 60k array as a 3.6 billion 
element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of 60k x 30k = 
1.8 billion elements (fits in R-2.15 vectors) of approx 1.8 Gbyte (probably 
usable in an 8 Gbyte laptop).

Personally, I would probably put this data in a netcdf / rdf5 file. Perhaps I'd 
use snpStats or GWAStools in Bioconductor http://bioconductor.org.

Martin

>
> HTH,
>
> Jan
>
>
>
>
> peter dalgaard <pdalgd at gmail.com> schreef:
>
>> On Mar 7, 2013, at 01:18 , Yao He wrote:
>>
>>> Dear all:
>>>
>>> I have a big data file of 60000 columns and 60000 rows like that:
>>>
>>> AA AC AA AA .......AT
>>> CC CC CT CT.......TC
>>> ..........................
>>> .........................
>>>
>>> I want to transpose it and the output is a new like that
>>> AA CC ............
>>> AC CC............
>>> AA CT.............
>>> AA CT.........
>>> ....................
>>> ....................
>>> AT TC.............
>>>
>>> The keypoint is  I can't read it into R by read.table() because the
>>> data is too large,so I try that:
>>> c<-file("silygenotype.txt","r")
>>> geno_t<-list()
>>> repeat{
>>>  line<-readLines(c,n=1)
>>>  if (length(line)==0)break  #end of file
>>>  line<-unlist(strsplit(line,"\t"))
>>> geno_t<-cbind(geno_t,line)
>>> }
>>> write.table(geno_t,"xxx.txt")
>>>
>>> It works but it is too slow ,how to optimize it???
>>
>>
>> As others have pointed out, that's a lot of data!
>>
>> You seem to have the right idea: If you read the columns line by line there is
>> nothing to transpose. A couple of points, though:
>>
>> - The cbind() is a potential performance hit since it copies the list every
>> time around. geno_t <- vector("list", 60000) and then
>> geno_t[[i]] <- <etc>
>>
>> - You might use scan() instead of readLines, strsplit
>>
>> - Perhaps consider the data type as you seem to be reading strings with 16
>> possible values (I suspect that R already optimizes string storage to make
>> this point moot, though.)
>>
>> --
>> Peter Dalgaard, Professor
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793