[R] long to wide on larger data set

Wed Jul 14 19:39:13 CEST 2010

Hi Matthew and Jim,

Thanks for all the suggestions as always. Matthew's post was very
informative in showing how things can be done much more efficiently
with data.table. I haven't had a chance to finish the reshaping
because my group was a in rush,
and someone else decided to do it in Perl. However, I did get a chance
to use the data.table package for the first time. In some preliminary
steps, I had to do some subsetting and recoding and this was superfast
with data.table. The tutorials were helpful in getting me up to speed.
Over the next few days
I plan to carry out the reshaping as a learning exercise so I'll be
ready next time. I'll post my results afterwards.

Thanks,

Juliet

On Mon, Jul 12, 2010 at 11:50 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Juliet,
>
> I've been corrected off list. I did not read properly that you are on 64bit.
>
> The calculation should be :
>    53860858 * 4 * 8 /1024^3 = 1.6GB
> since pointers are 8 bytes on 64bit.
>
> Also, data.table is an add-on package so I should have included :
>
>   install.packages("data.table")
>   require(data.table)
>
> data.table is available on all platforms both 32bit and 64bit.
>
> Please forgive mistakes: 'someoone' should be 'someone', 'percieved' should
> be
> 'perceived' and 'testDate' should be 'testData' at the end.
>
> The rest still applies, and you might have a much easier time than I thought
> since you are on 64bit. I was working on the basis of squeezing into 32bit.
>
> Matthew
>
>
> "Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message
> news:i1faj2$lvi$1 at dough.gmane.org...
>>
>> Hi Juliet,
>>
>> Thanks for the info.
>>
>> It is very slow because of the == in  testData[testData$V2==one_ind,]
>>
>> Why? Imagine someoone looks for 10 people in the phone directory. Would
>> they search the entire phone directory for the first person's phone
>> number, starting
>> on page 1, looking at every single name, even continuing to the end of the
>> book
>> after they had found them ?  Then would they start again from page 1 for
>> the 2nd
>> person, and then the 3rd, searching the entire phone directory from start
>> to finish
>> for each and every person ?  That code using == does that.  Some of us
>> call
>> that a 'vector scan' and is a common reason for R being percieved as slow.
>>
>> To do that more efficiently try this :
>>
>> testData = as.data.table(testData)
>> setkey(testData,V2)    # sorts data by V2
>> for (one_ind in mysamples) {
>>   one_sample <- testData[one_id,]
>>   reshape(one_sample)
>> }
>>
>> or just this :
>>
>> testData = as.data.table(testData)
>> setkey(testDate,V2)
>> testData[,reshape(.SD,...), by=V2]
>>
>> That should solve the vector scanning problem, and get you on to the
>> memory
>> problems which will need to be tackled. Since the 4 columns are character,
>> then
>> the object size should be roughly :
>>
>>    53860858 * 4 * 4 /1024^3 = 0.8GB
>>
>> That is more promising to work with in 32bit so there is hope. [ That
>> 0.8GB
>> ignores the (likely small) size of the unique strings in global string
>> hash (depending
>> on your data). ]
>>
>> Its likely that the as.data.table() fails with out of memory.  That is not
>> data.table
>> but unique. There is a change in unique.c in R 2.12 which makes unique
>> more
>> efficient and since factor calls unique, it may be necessary to use R
>> 2.12.
>>
>> If that still doesn't work, then there are several more tricks (and we
>> will need
>> further information), and there may be some tweaks needed to that code as
>> I
>> didn't test it,  but I think it should be possible in 32bit using R 2.12.
>>
>> Is it an option to just keep it in long format and use a data.table ?
>>
>>   testDate[, somecomplexrfunction(onecolumn, anothercolumn), by=list(V2) ]
>>
>> Why you you need to reshape from long to wide ?
>>
>> HTH,
>> Matthew
>>
>>
>>
>> "Juliet Hannah" <juliet.hannah at gmail.com> wrote in message
>> news:AANLkTinYvgMrVdP0SvC-fYlGOGn2RO0OMNuGQbXx_H2b at mail.gmail.com...
>> Hi Jim,
>>
>> Thanks for responding. Here is the info I should have included before.
>> I should be able to access 4 GB.
>>
>>> str(myData)
>> 'data.frame':   53860857 obs. of  4 variables:
>> $ V1: chr  "200003" "200006" "200047" "200050" ...
>> $ V2: chr  "cv0001" "cv0001" "cv0001" "cv0001" ...
>> $ V3: chr  "A" "A" "A" "B" ...
>> $ V4: chr  "B" "B" "A" "B" ...
>>> sessionInfo()
>> R version 2.11.0 (2010-04-22)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> On Mon, Jul 12, 2010 at 7:54 AM, jim holtman <jholtman at gmail.com> wrote:
>>> What is the configuration you are running on (OS, memory, etc.)? What
>>> does your object consist of? Is it numeric, factors, etc.? Provide a
>>> 'str' of it. If it is numeric, then the size of the object is
>>> probably about 1.8GB. Doing the long to wide you will probably need
>>> at least that much additional memory to hold the copy, if not more.
>>> This would be impossible on a 32-bit version of R.
>>>
>>> On Mon, Jul 12, 2010 at 1:25 AM, Juliet Hannah <juliet.hannah at gmail.com>
>>> wrote:
>>>> I have a data set that has 4 columns and 53860858 rows. I was able to
>>>> read this into R with:
>>>>
>>>> cc <- rep("character",4)
>>>> myData <-
>>>> read.table("myData.csv",header=FALSE,skip=1,colClasses=cc,nrow=53860858,sep=",")
>>>>
>>>>
>>>> I need to reshape this data from long to wide. On a small data set the
>>>> following lines work. But on the real data set, it didn't finish even
>>>> when I took a sample of two (rows in new data). I didn't receive an
>>>> error. I just stopped it because it was taking too long. Any
>>>> suggestions for improvements? Thanks.
>>>>
>>>> # start example
>>>> # i have commented out the write.table statement below
>>>>
>>>> testData <- read.table(textConnection("rs9999853,cv0084,A,A
>>>> rs999986,cv0084,C,B
>>>> rs9999883,cv0084,E,F
>>>> rs9999853,cv0085,G,H
>>>> rs999986,cv0085,I,J
>>>> rs9999883,cv0085,K,L"),header=FALSE,sep=",")
>>>> closeAllConnections()
>>>>
>>>> mysamples <- unique(testData$V2)
>>>>
>>>> for (one_ind in mysamples) {
>>>> one_sample <- testData[testData$V2==one_ind,]
>>>> mywide <- reshape(one_sample, timevar = "V1", idvar =
>>>> "V2",direction = "wide")
>>>> # write.table(mywide,file
>>>> ="newdata.txt",append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE)
>>>> }
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>