[R] Median computation

Wed May 23 22:28:36 CEST 2012

Yes, thanks Henrik. I neglected to mention that rowMedians could just
be plugged in instead of apply (..,1,...)

However, my main point is that that's probably not what matters,as
Benno points out. Maybe it's the data frames instead of the matrices,
but .... The process should execute in a few seconds even
inefficiently (my code). So there's something fishy here.

--Bert

On Wed, May 23, 2012 at 10:39 AM, Henrik Bengtsson <hb at biostat.ucsf.edu> wrote:
> Just adding a few cents to this:
>
> rowMedians(x) is roughly 4-10 times faster than apply(x, MARGIN=1,
> FUN=median) - at least on my local Windows 7 64bit tests.  You can do
> these simple benchmark runs yourself via the
> matrixStats/tests/rowMedians.R system test, cf. http://goo.gl/YCJed
> [R-forge].
>
> /Henrik
>
> On Wed, May 23, 2012 at 10:30 AM, Preeti <preeti at sci.utah.edu> wrote:
>> Hmm.. that is interesting... I did this on our server machine which has
>> about 200 cores. So memory is not an issue. Also, building the dataframe
>> takes about a few minutes maximum for me. My code is similar to yours but
>> for the fact that I create my dataframe from read.delim("filename") and
>> then I drop the first column because it has characters. I don't know why it
>> takes long on my machine.
>>
>> On Wed, May 23, 2012 at 11:26 AM, Benno Pütz <puetz at mpipsykl.mpg.de> wrote:
>>
>>> I wonder how you do this (or maybe on what kind of machine you execute it).
>>>
>>> I tried it out of curiosity and get
>>>
>>> > df = as.data.frame(lapply(1:300,function(x)sample(200,250000,T)))
>>> > colnames(df) = sample(letters[1:20],300,T)
>>> > system.time(dfmed<-lapply(unique(colnames(df)), function(x)
>>> + rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)))
>>>    user  system elapsed
>>>   5.680   0.952   7.171
>>>
>>> and those times are in seconds! The time consuming part was building the
>>> data.frame not the calculation.
>>>
>>> The only thing I noticed is that my R process claims some 1.4 GB of memory
>>> but that should not be a problem on any recent hardware but my guess at
>>> answering your question would be that this might be your problem,
>>> especially if you have other memory-hogging variables like this data frame
>>> lying around and you see severe memory swapping effects
>>>
>>> Benno
>>>
>>> Hello Everybody,
>>>
>>> The code:
>>>
>>> dfmed<-lapply(unique(colnames(df)), function(x)
>>> rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))
>>>
>>> takes really long time to execute ( in hours). Is there a faster way to do
>>> this?
>>>
>>> Thanks!
>>>
>>> On Tue, May 22, 2012 at 3:46 PM, Preeti <preeti at sci.utah.edu> wrote:
>>>
>>> Thanks Henrik! Here is the one-liner that I wrote:
>>>
>>>
>>> dfmed<-lapply(unique(colnames(df)), function(x)
>>>
>>> rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))
>>>
>>>
>>> Thanks again!
>>>
>>>
>>>
>>> On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson <hb at biostat.ucsf.edu
>>> >wrote:
>>>
>>>
>>> See rowMedians() of the matrixStats package for replacing apply(x,
>>>
>>> MARGIN=1, FUN=median). /Henrik
>>>
>>>
>>> On Tue, May 22, 2012 at 12:34 PM, Preeti <preeti at sci.utah.edu> wrote:
>>>
>>> Hi,
>>>
>>>
>>> I have a 250,000 by 300 matrix. I am trying to calculate the median of
>>>
>>> those columns (by row) with column names that are identical. I would
>>>
>>> like
>>>
>>> this to be efficient since apply(x,1,median) where x is created by
>>>
>>> choosing
>>>
>>> only those columns with same column name and looping on this is taking a
>>>
>>> really long time. Is there an efficient way to do this?
>>>
>>>
>>> Thanks!
>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________________________
>>>
>>> R-help at r-project.org mailing list
>>>
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>
>>> PLEASE do read the posting guide
>>>
>>> http://www.R-project.org/posting-guide.html
>>>
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> Benno Pütz
>>> Statistical Genetics
>>> MPI of Psychiatry
>>> Kraepelinstr. 2-10
>>> 80804 Munich, Germany
>>> T: ++49-(0)89-306 22 222
>>> F: ++49-(0)89-306 22 601
>>>
>>>
>>>
>>>
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm