[R] Median computation

Wed May 23 19:17:35 CEST 2012

Assuming your original matrix IS a matrix, call it yourmat,  and not a
data frame (whose columns **must* have unique names if you haven't
messed with the check.names default) then maybe:

#### UNTESTED!!! ###
thenames <- unique(dimnames(yourmat)[[2]])
ans <- lapply(thenames, function(nm, {
   apply( yourmat[, thenames==nm],1, median,na.rm=TRUE)
   })

If I got it right, ans should be a list of vectors, one per unique
column name, each of which gives rowwise medians of the columns with
the same name. This can be combined into a new matrix e.g. by
do.call(cbind,ans)  if you like. You could get a matrix answer
directly if you use sapply or, maybe faster, vapply instead of lapply,
but I find lists simpler to begin with.

I believe this should be reasonably fast. Converting to and from data
frames and operating on data frames slows things down a lot, because
these are very general structures that must keep track of a lot of
overhead when being worked on. Matrices do not.

-- Bert

On Wed, May 23, 2012 at 9:46 AM, Preeti <preeti at sci.utah.edu> wrote:
> Hello Everybody,
>
> The code:
>
> dfmed<-lapply(unique(colnames(df)), function(x)
> rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))
>
> takes really long time to execute ( in hours). Is there a faster way to do
> this?
>
> Thanks!
>
> On Tue, May 22, 2012 at 3:46 PM, Preeti <preeti at sci.utah.edu> wrote:
>
>> Thanks Henrik! Here is the one-liner that I wrote:
>>
>> dfmed<-lapply(unique(colnames(df)), function(x)
>> rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))
>>
>> Thanks again!
>>
>>
>> On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson <hb at biostat.ucsf.edu>wrote:
>>
>>> See rowMedians() of the matrixStats package for replacing apply(x,
>>> MARGIN=1, FUN=median). /Henrik
>>>
>>> On Tue, May 22, 2012 at 12:34 PM, Preeti <preeti at sci.utah.edu> wrote:
>>> > Hi,
>>> >
>>> > I have a 250,000 by 300 matrix. I am trying to calculate the median of
>>> > those columns (by row) with column names that are identical. I would
>>> like
>>> > this to be efficient since apply(x,1,median) where x is created by
>>> choosing
>>> > only those columns with same column name and looping on this is taking a
>>> > really long time. Is there an efficient way to do this?
>>> >
>>> > Thanks!
>>> >
>>> >        [[alternative HTML version deleted]]
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm