[R-sig-eco] Clustering large data

Christian A. Parker cparker at pdx.edu
Tue Oct 7 20:23:15 CEST 2008


Thats great, thanks. I always like it when someone can suggest a better 
or cleaner way to do something in code.

-Chris

Farrar.David at epamail.epa.gov wrote:
> Thanks for the illustration of xtabs. 
> 
> A quibble: Doesn't the following work, substituting as.matrix() for 
> matrix()? 
> (Does seem to conserve the dimensions and dimension names.) 
> 
> matrify<-function(datatable, formula = units~site+spp, relativize=F){
>   tbl<-xtabs(formula,data=datatable)
>   mx <-as.matrix(tbl)
>   if (relativize==T) {mx<-mx/rowSums(mx)}
>   return(mx)
> }
> 
> 
> 
> 
> "Christian A. Parker" <cparker at pdx.edu> 
> Sent by: r-sig-ecology-bounces at r-project.org
> 10/07/2008 11:04 AM
> 
> To
> "ONKELINX, Thierry" <Thierry.ONKELINX at inbo.be>
> cc
> r-sig-ecology at r-project.org
> Subject
> Re: [R-sig-eco] Clustering large data
> 
> 
> 
> 
> 
> 
> This method for converting long to wide format seems to work well with 
> pretty large datasets and it uses only base functions.
> 
> # this function will return a site*species matrix
> # based on the formula variable. Data does not need 
> # to be grouped, the xtabs function will take care of
> # summing any rows that are equal according to the 
> # formula.
> ### units are the cell value
> ### site is the row value
> ### spp is the column value
> matrify<-function(datatable, formula = units~site+spp, relativize=F){
>   tbl<-xtabs(formula,data=datatable)
>   mx<-matrix(tbl,ncol=ncol(tbl))
>   colnames(mx)<-colnames(tbl)
>   rownames(mx)<-rownames(tbl)
>   if (relativize==T) {mx<-mx/rowSums(mx)}
>   return(mx)
> }
> 
> 
> 
> ONKELINX, Thierry wrote:
> 
>>Dear all,
>>
>>We have a problem with a large dataset that we want to cluster. The
>>dataset is in a long format: 1154024 rows with presence data. Each row
>>has the name of the species and the location. We have 1381 species and
>>6354 locations.
>>The main problem is that we need the data in wide format (one row for
>>each location, one column for each species) for the clustering
>>algorithms. But the 6354 x 1381 dataframe is too big to fit into the
>>memory. At least when we use cast from the reshape package to convert
>>the dataframe from a long to a wide format.
>>
>>Are there any clustering tools available that can work with the data in
>>a long format or with sparse matrices (only 13% of the matrix is
>>non-zero)? If the work with sparse matrices: how to convert our dataset
>>to a sparse matrix? Other suggestions are welcome.
>>
>>We are working with R 2.7.2 on WinXP with 2 GB RAM. --max-mem-size is
>>set to 2047M.
>>
>>Thanks,
>>
>>Thierry
>>
>>
>>------------------------------------------------------------------------
>>----
>>ir. Thierry Onkelinx
>>Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>>and Forest
>>Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
>>methodology and quality assurance
>>Gaverstraat 4
>>9500 Geraardsbergen
>>Belgium 
>>tel. + 32 54/436 185
>>Thierry.Onkelinx at inbo.be 
>>www.inbo.be 
>>
>>To call in the statistician after the experiment is done may be no more
>>than asking him to perform a post-mortem examination: he may be able to
>>say what the experiment died of.
>>~ Sir Ronald Aylmer Fisher
>>
>>The plural of anecdote is not data.
>>~ Roger Brinner
>>
>>The combination of some data and an aching desire for an answer does not
>>ensure that a reasonable answer can be extracted from a given body of
>>data.
>>~ John Tukey
>>
>>Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver 
> 
> weer 
> 
>>en binden het INBO onder geen enkel beding, zolang dit bericht niet 
> 
> bevestigd is
> 
>>door een geldig ondertekend document. The views expressed in  this 
> 
> message 
> 
>>and any annex are purely those of the writer and may not be regarded as 
> 
> stating 
> 
>>an official position of INBO, as long as the message is not confirmed by 
> 
> a duly 
> 
>>signed document.
>>
>>_______________________________________________
>>R-sig-ecology mailing list
>>R-sig-ecology at r-project.org
>>https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>
>>
> 
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> 
>



More information about the R-sig-ecology mailing list