[R-sig-eco] Clustering large data

tyler tyler.smith at mail.mcgill.ca
Tue Oct 7 14:35:39 CEST 2008


"ONKELINX, Thierry" <Thierry.ONKELINX at inbo.be>
writes:

> Dear all,
>
> We have a problem with a large dataset that we want to cluster. The
> dataset is in a long format: 1154024 rows with presence data. Each row
> has the name of the species and the location. We have 1381 species and
> 6354 locations.
> The main problem is that we need the data in wide format (one row for
> each location, one column for each species) for the clustering
> algorithms. But the 6354 x 1381 dataframe is too big to fit into the
> memory. At least when we use cast from the reshape package to convert
> the dataframe from a long to a wide format.
>
> Are there any clustering tools available that can work with the data in
> a long format or with sparse matrices (only 13% of the matrix is
> non-zero)? If the work with sparse matrices: how to convert our dataset
> to a sparse matrix? Other suggestions are welcome.
>

6354 x 1381 should be well within your memory limit, so I assume it's
the intermediate steps that are fouling you up. Maybe you can do it in
pieces: 

1. subset the original two-column matrix to include only the first 100 sites
2. convert this subset to wide form
3. repeat 63 times for different subsets
4. rbind the resulting matrices

Good luck,

Tyler

-- 
Watching a recorded television broadcast more than once will be illegal
under Bill C-61. 

http://www.michaelgeist.ca/content/view/3046/125/



More information about the R-sig-ecology mailing list