[R-sig-eco] reshape (was clustering large data)

Wed Oct 8 14:10:03 CEST 2008

Also in my computer reshape did not work for large datasets. I conturned
the problem by writing my own program making using loops. This takes a
few hours for a non-expert like me (my code is slow and not portable but
it works ....).  Another possibility  that came to my mind would be to
run the data transformation for example in SAS and reexport the data to
R. I think that the R data transformation procedures (like reshape) are
not the most efficient ones.

Frank

r-sig-ecology-request at r-project.org wrote:
> Send R-sig-ecology mailing list submissions to
> 	r-sig-ecology at r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> or, via email, send a message with subject or body 'help' to
> 	r-sig-ecology-request at r-project.org
>
> You can reach the person managing the list at
> 	r-sig-ecology-owner at r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of R-sig-ecology digest..."
>
>
> Today's Topics:
>
>    1. Clustering large data (ONKELINX, Thierry)
>    2. Re: Clustering large data (tyler)
>    3. Re: Clustering large data (Peter Solymos)
>    4. Re: Clustering large data (Farrar.David at epamail.epa.gov)
>    5. Re: Clustering large data (Christian A. Parker)
>    6. Re: Clustering large data (Farrar.David at epamail.epa.gov)
>    7. Re: Clustering large data (Brian Campbell)
>    8. Re: Clustering large data (Christian A. Parker)
>    9. Mortality anslisis (Marcelo Luiz de Laia)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 7 Oct 2008 12:12:28 +0200
> From: "ONKELINX, Thierry" <Thierry.ONKELINX at inbo.be>
> Subject: [R-sig-eco] Clustering large data
> To: <r-sig-ecology at r-project.org>
> Message-ID:
> 	<2E9C414912813E4EB981326983E0A10405903F59 at inboexch.inbo.be>
> Content-Type: text/plain;	charset="us-ascii"
>
> Dear all,
>
> We have a problem with a large dataset that we want to cluster. The
> dataset is in a long format: 1154024 rows with presence data. Each row
> has the name of the species and the location. We have 1381 species and
> 6354 locations.
> The main problem is that we need the data in wide format (one row for
> each location, one column for each species) for the clustering
> algorithms. But the 6354 x 1381 dataframe is too big to fit into the
> memory. At least when we use cast from the reshape package to convert
> the dataframe from a long to a wide format.
>
> Are there any clustering tools available that can work with the data in
> a long format or with sparse matrices (only 13% of the matrix is
> non-zero)? If the work with sparse matrices: how to convert our dataset
> to a sparse matrix? Other suggestions are welcome.
>
> We are working with R 2.7.2 on WinXP with 2 GB RAM. --max-mem-size is
> set to 2047M.
>
> Thanks,
>
> Thierry
>
>
> ------------------------------------------------------------------------
> ----
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature
> and Forest
> Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
> methodology and quality assurance
> Gaverstraat 4
> 9500 Geraardsbergen
> Belgium 
> tel. + 32 54/436 185
> Thierry.Onkelinx at inbo.be 
> www.inbo.be 
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to
> say what the experiment died of.
> ~ Sir Ronald Aylmer Fisher
>
> The plural of anecdote is not data.
> ~ Roger Brinner
>
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of
> data.
> ~ John Tukey
>
> Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
> en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
> door een geldig ondertekend document. The views expressed in  this message 
> and any annex are purely those of the writer and may not be regarded as stating 
> an official position of INBO, as long as the message is not confirmed by a duly 
> signed document.
>
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 07 Oct 2008 09:35:39 -0300
> From: tyler <tyler.smith at mail.mcgill.ca>
> Subject: Re: [R-sig-eco] Clustering large data
> To: r-sig-ecology at r-project.org
> Message-ID: <87zllg7fc4.fsf at blackbart.sedgenet>
> Content-Type: text/plain; charset=us-ascii
>
> "ONKELINX, Thierry" <Thierry.ONKELINX at inbo.be>
> writes:
>
>   
>> Dear all,
>>
>> We have a problem with a large dataset that we want to cluster. The
>> dataset is in a long format: 1154024 rows with presence data. Each row
>> has the name of the species and the location. We have 1381 species and
>> 6354 locations.
>> The main problem is that we need the data in wide format (one row for
>> each location, one column for each species) for the clustering
>> algorithms. But the 6354 x 1381 dataframe is too big to fit into the
>> memory. At least when we use cast from the reshape package to convert
>> the dataframe from a long to a wide format.
>>
>> Are there any clustering tools available that can work with the data in
>> a long format or with sparse matrices (only 13% of the matrix is
>> non-zero)? If the work with sparse matrices: how to convert our dataset
>> to a sparse matrix? Other suggestions are welcome.
>>
>>     
>
> 6354 x 1381 should be well within your memory limit, so I assume it's
> the intermediate steps that are fouling you up. Maybe you can do it in
> pieces: 
>
> 1. subset the original two-column matrix to include only the first 100 sites
> 2. convert this subset to wide form
> 3. repeat 63 times for different subsets
> 4. rbind the resulting matrices
>
> Good luck,
>
> Tyler
>
>