[R-sig-eco] Clustering large data

ONKELINX, Thierry Thierry.ONKELINX at inbo.be
Tue Oct 7 12:12:28 CEST 2008


Dear all,

We have a problem with a large dataset that we want to cluster. The
dataset is in a long format: 1154024 rows with presence data. Each row
has the name of the species and the location. We have 1381 species and
6354 locations.
The main problem is that we need the data in wide format (one row for
each location, one column for each species) for the clustering
algorithms. But the 6354 x 1381 dataframe is too big to fit into the
memory. At least when we use cast from the reshape package to convert
the dataframe from a long to a wide format.

Are there any clustering tools available that can work with the data in
a long format or with sparse matrices (only 13% of the matrix is
non-zero)? If the work with sparse matrices: how to convert our dataset
to a sparse matrix? Other suggestions are welcome.

We are working with R 2.7.2 on WinXP with 2 GB RAM. --max-mem-size is
set to 2047M.

Thanks,

Thierry


------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.



More information about the R-sig-ecology mailing list