[R] long to wide on larger data set

Matthew Dowle mdowle at mdowle.plus.com
Mon Jul 12 17:50:33 CEST 2010


Juliet,

I've been corrected off list. I did not read properly that you are on 64bit.

The calculation should be :
    53860858 * 4 * 8 /1024^3 = 1.6GB
since pointers are 8 bytes on 64bit.

Also, data.table is an add-on package so I should have included :

   install.packages("data.table")
   require(data.table)

data.table is available on all platforms both 32bit and 64bit.

Please forgive mistakes: 'someoone' should be 'someone', 'percieved' should 
be
'perceived' and 'testDate' should be 'testData' at the end.

The rest still applies, and you might have a much easier time than I thought
since you are on 64bit. I was working on the basis of squeezing into 32bit.

Matthew


"Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message 
news:i1faj2$lvi$1 at dough.gmane.org...
>
> Hi Juliet,
>
> Thanks for the info.
>
> It is very slow because of the == in  testData[testData$V2==one_ind,]
>
> Why? Imagine someoone looks for 10 people in the phone directory. Would
> they search the entire phone directory for the first person's phone 
> number, starting
> on page 1, looking at every single name, even continuing to the end of the 
> book
> after they had found them ?  Then would they start again from page 1 for 
> the 2nd
> person, and then the 3rd, searching the entire phone directory from start 
> to finish
> for each and every person ?  That code using == does that.  Some of us 
> call
> that a 'vector scan' and is a common reason for R being percieved as slow.
>
> To do that more efficiently try this :
>
> testData = as.data.table(testData)
> setkey(testData,V2)    # sorts data by V2
> for (one_ind in mysamples) {
>   one_sample <- testData[one_id,]
>   reshape(one_sample)
> }
>
> or just this :
>
> testData = as.data.table(testData)
> setkey(testDate,V2)
> testData[,reshape(.SD,...), by=V2]
>
> That should solve the vector scanning problem, and get you on to the 
> memory
> problems which will need to be tackled. Since the 4 columns are character, 
> then
> the object size should be roughly :
>
>    53860858 * 4 * 4 /1024^3 = 0.8GB
>
> That is more promising to work with in 32bit so there is hope. [ That 
> 0.8GB
> ignores the (likely small) size of the unique strings in global string 
> hash (depending
> on your data). ]
>
> Its likely that the as.data.table() fails with out of memory.  That is not 
> data.table
> but unique. There is a change in unique.c in R 2.12 which makes unique 
> more
> efficient and since factor calls unique, it may be necessary to use R 
> 2.12.
>
> If that still doesn't work, then there are several more tricks (and we 
> will need
> further information), and there may be some tweaks needed to that code as 
> I
> didn't test it,  but I think it should be possible in 32bit using R 2.12.
>
> Is it an option to just keep it in long format and use a data.table ?
>
>   testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ]
>
> Why you you need to reshape from long to wide ?
>
> HTH,
> Matthew
>
>
>
> "Juliet Hannah" <juliet.hannah at gmail.com> wrote in message 
> news:AANLkTinYvgMrVdP0SvC-fYlGOGn2RO0OMNuGQbXx_H2b at mail.gmail.com...
> Hi Jim,
>
> Thanks for responding. Here is the info I should have included before.
> I should be able to access 4 GB.
>
>> str(myData)
> 'data.frame':   53860857 obs. of  4 variables:
> $ V1: chr  "200003" "200006" "200047" "200050" ...
> $ V2: chr  "cv0001" "cv0001" "cv0001" "cv0001" ...
> $ V3: chr  "A" "A" "A" "B" ...
> $ V4: chr  "B" "B" "A" "B" ...
>> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> On Mon, Jul 12, 2010 at 7:54 AM, jim holtman <jholtman at gmail.com> wrote:
>> What is the configuration you are running on (OS, memory, etc.)? What
>> does your object consist of? Is it numeric, factors, etc.? Provide a
>> 'str' of it. If it is numeric, then the size of the object is
>> probably about 1.8GB. Doing the long to wide you will probably need
>> at least that much additional memory to hold the copy, if not more.
>> This would be impossible on a 32-bit version of R.
>>
>> On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah <juliet.hannah at gmail.com> 
>> wrote:
>>> I have a data set that has 4 columns and 53860858 rows. I was able to
>>> read this into R with:
>>>
>>> cc <- rep("character",4)
>>> myData <- 
>>> read.table("myData.csv",header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=",")
>>>
>>>
>>> I need to reshape this data from long to wide. On a small data set the
>>> following lines work. But on the real data set, it didn't finish even
>>> when I took a sample of two (rows in new data). I didn't receive an
>>> error. I just stopped it because it was taking too long. Any
>>> suggestions for improvements? Thanks.
>>>
>>> # start example
>>> # i have commented out the write.table statement below
>>>
>>> testData <- read.table(textConnection("rs9999853,cv0084,A,A
>>> rs999986,cv0084,C,B
>>> rs9999883,cv0084,E,F
>>> rs9999853,cv0085,G,H
>>> rs999986,cv0085,I,J
>>> rs9999883,cv0085,K,L"),header=FALSE,sep=",")
>>> closeAllConnections()
>>>
>>> mysamples <- unique(testData$V2)
>>>
>>> for (one_ind in mysamples) {
>>> one_sample <- testData[testData$V2==one_ind,]
>>> mywide <- reshape(one_sample, timevar = "V1", idvar =
>>> "V2",direction = "wide")
>>> # write.table(mywide,file
>>> ="newdata.txt",append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE)
>>> }
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>



More information about the R-help mailing list