[R] Median computation

Henrik Bengtsson hb at biostat.ucsf.edu
Wed May 23 19:39:14 CEST 2012


Just adding a few cents to this:

rowMedians(x) is roughly 4-10 times faster than apply(x, MARGIN=1,
FUN=median) - at least on my local Windows 7 64bit tests.  You can do
these simple benchmark runs yourself via the
matrixStats/tests/rowMedians.R system test, cf. http://goo.gl/YCJed
[R-forge].

/Henrik

On Wed, May 23, 2012 at 10:30 AM, Preeti <preeti at sci.utah.edu> wrote:
> Hmm.. that is interesting... I did this on our server machine which has
> about 200 cores. So memory is not an issue. Also, building the dataframe
> takes about a few minutes maximum for me. My code is similar to yours but
> for the fact that I create my dataframe from read.delim("filename") and
> then I drop the first column because it has characters. I don't know why it
> takes long on my machine.
>
> On Wed, May 23, 2012 at 11:26 AM, Benno Pütz <puetz at mpipsykl.mpg.de> wrote:
>
>> I wonder how you do this (or maybe on what kind of machine you execute it).
>>
>> I tried it out of curiosity and get
>>
>> > df = as.data.frame(lapply(1:300,function(x)sample(200,250000,T)))
>> > colnames(df) = sample(letters[1:20],300,T)
>> > system.time(dfmed<-lapply(unique(colnames(df)), function(x)
>> + rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)))
>>    user  system elapsed
>>   5.680   0.952   7.171
>>
>> and those times are in seconds! The time consuming part was building the
>> data.frame not the calculation.
>>
>> The only thing I noticed is that my R process claims some 1.4 GB of memory
>> but that should not be a problem on any recent hardware but my guess at
>> answering your question would be that this might be your problem,
>> especially if you have other memory-hogging variables like this data frame
>> lying around and you see severe memory swapping effects
>>
>> Benno
>>
>> Hello Everybody,
>>
>> The code:
>>
>> dfmed<-lapply(unique(colnames(df)), function(x)
>> rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))
>>
>> takes really long time to execute ( in hours). Is there a faster way to do
>> this?
>>
>> Thanks!
>>
>> On Tue, May 22, 2012 at 3:46 PM, Preeti <preeti at sci.utah.edu> wrote:
>>
>> Thanks Henrik! Here is the one-liner that I wrote:
>>
>>
>> dfmed<-lapply(unique(colnames(df)), function(x)
>>
>> rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))
>>
>>
>> Thanks again!
>>
>>
>>
>> On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson <hb at biostat.ucsf.edu
>> >wrote:
>>
>>
>> See rowMedians() of the matrixStats package for replacing apply(x,
>>
>> MARGIN=1, FUN=median). /Henrik
>>
>>
>> On Tue, May 22, 2012 at 12:34 PM, Preeti <preeti at sci.utah.edu> wrote:
>>
>> Hi,
>>
>>
>> I have a 250,000 by 300 matrix. I am trying to calculate the median of
>>
>> those columns (by row) with column names that are identical. I would
>>
>> like
>>
>> this to be efficient since apply(x,1,median) where x is created by
>>
>> choosing
>>
>> only those columns with same column name and looping on this is taking a
>>
>> really long time. Is there an efficient way to do this?
>>
>>
>> Thanks!
>>
>>
>>       [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>>
>> R-help at r-project.org mailing list
>>
>> https://stat.ethz.ch/mailman/listinfo/r-help
>>
>> PLEASE do read the posting guide
>>
>> http://www.R-project.org/posting-guide.html
>>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> Benno Pütz
>> Statistical Genetics
>> MPI of Psychiatry
>> Kraepelinstr. 2-10
>> 80804 Munich, Germany
>> T: ++49-(0)89-306 22 222
>> F: ++49-(0)89-306 22 601
>>
>>
>>
>>
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list