[R] distance metrics
Jari Oksanen
jarioksa at sun3.oulu.fi
Tue Mar 13 09:00:56 CET 2007
On Tue Mar 13 00:21:22 CET 2007 Gavin Simpson wrote:
> On Mon, 2007-03-12 at 16:02 -0700, Sender wrote:
> Thanks for the suggestion Christian. I'm trying to avoid expanding the dist
> > object to a matrix, since i'm usually working with microarray data which
> > produces a distance matrix of size 5000 x 5000.
> >
> > If i can keep it in its condensed form i think it will speed things up.
> >
> > Is my thinking correct?
>
> That will all depend on what you want to do with it...
>
> A dist object of that size is c. 100 MB in memory, and c. 200 MB in size
> as the full dissimilarity matrix - values from object.size(). Of course,
> you'll need a reasonable amount of free memory over and above this to do
> anything useful with the matrix as copies may be required during
> analysis/processing etc.
>
> Of course, a dist object is just a vector of observed distances with
> various attributes, so one can always use "[" for vectors, but I imagine
> that anything other than trivial operations will become fiddly,
> complicated and time consuming - if you have the memory, give the
> as.matrix option a try and see how it works for your specific problems.
>
Such a fiddling could be a function that returns the index in the dist vector:
idx <- function(i, j, Size)
{
a <- min(i,j)
b <- max(i,j)
Size*(a-1) - a*(a-1)/2 + b - a
}
where i and j are the desired matrix indices and Size is the number of
observations, or the attribute "Size" of a 'dist' object. (The function
will fail if i==j or any(c(i,j) > Size) and with some other potential
abuse.)
You can refer to your individual distances from 5000 observations as:
dis[idx(2417, 1105, 5000)]
This is slower, of course, but avoids expanding to a matrix.
Perhaps a nicer and easier to use (but more opaque) way is to write the
function as:
getidx <- function(dist, i, j)
{
dist[idx(i, j, attr(dist, "Size"))]
}
which can be used with fewer bracket types: getidx(dist, 2417, 1105).
cheers, jari oksanen
More information about the R-help
mailing list