[R-sig-eco] Clustering large data

Dave Roberts droberts at montana.edu
Fri Oct 24 20:10:59 CEST 2008


Thierry and Hadley,

     Sorry to be late coming into this (I forgot I subscribed to sig-eco).

     package labdsv has a function called matrify() which takes a three 
column data.frame (sample,taxa,abundance) and creates a full (sparse) 
matrix representation.  I've never tried it on a data set as large as 
yours, and I'm curious if it would work.  It's pure R, but if worst 
comes to worst I used to have a FORTRAN version that would probably 
work. Please give matrify a try and let me know.

Dave R.

matrify <- function (data)
{
     if (ncol(data) != 3)
         stop("data frame must have three column format")
     plt <- data[, 1]
     spc <- data[, 2]
     abu <- data[, 3]
     plt.codes <- levels(factor(plt))
     spc.codes <- levels(factor(spc))
     taxa <- matrix(0, nrow = length(plt.codes), ncol =
              length(spc.codes))
     row <- match(plt, plt.codes)
     col <- match(spc, spc.codes)
     for (i in 1:length(abu)) {
         taxa[row[i], col[i]] <- abu[i]
     }
     taxa <- data.frame(taxa)
     names(taxa) <- spc.codes
     row.names(taxa) <- plt.codes
     taxa
}


hadley wickham wrote:
> Hi Thierry,
> 
> Thanks for the more detailed report.  I think the new version of
> reshape will help, but I just checked and it's current a total mess
> and will need a lot of work before it's ready for anyone to try.
> Unfortunately I'm unlikely to get to it until the ggplot2 book is
> finished, so it might be a bit of a wait.
> 
> Hadley
> 
> On Tue, Oct 14, 2008 at 2:52 AM, ONKELINX, Thierry
> <Thierry.ONKELINX at inbo.be> wrote:
>> Hi Hadley,
>>
>> Here is a more elaborate report of what I did and what when wrong. The
>> example is not reproducible because the dataset is to large. A smaller
>> dummy dataset is not an option as it works with smaller datasets. I'm
>> willing to run the code again with a development version of reshape.
>>
>> Cheers,
>>
>> Thierry
>>
>>
>>> library(RODBC)
>>> library(reshape)
>> Loading required package: plyr
>>> setwd("d:/wouter")
>>> Sys.info()
>>                     sysname                      release
>>                   "Windows"                         "XP"
>>                     version                     nodename
>> "build 2600, Service Pack 2"                 "LHPA000838"
>>                     machine                        login
>>                       "x86"           "thierry_onkelinx"
>>                        user
>>          "thierry_onkelinx"
>>> sessionInfo()
>> R version 2.7.2 (2008-08-25)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY=Du
>> tch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  tcltk     utils     methods
>>
>> [8] base
>>
>> other attached packages:
>> [1] reshape_0.8.1  plyr_0.1       RODBC_1.2-3    svSocket_0.9-5
>> svIO_0.9-5
>> [6] R2HTML_1.59    svMisc_0.9-5   svIDE_0.9-5
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.7.2
>>> channel <- odbcConnectAccess("db1.mdb")
>>> km <- sqlQuery(channel = channel, query = "SELECT KMhokcode AS
>> Location, TaxonFK AS Species FROM kmhok_periode2_selectie ORDER BY
>> KMhokcode, TaxonFK", as.is = TRUE)
>>> odbcCloseAll()
>>> km$value <- 1
>>> dim(km)
>> [1] 1157024       3
>>> length(unique(km$Location))
>> [1] 6354
>>> length(unique(km$Species))
>> [1] 1381
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:1000, ], fill
>> = 0))
>>   user  system elapsed
>>   0.11    0.00    0.17
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:10000, ], fill
>> = 0))
>>   user  system elapsed
>>    1.7     0.0     1.7
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:100000, ],
>> fill = 0))
>>   user  system elapsed
>>  46.42    0.45   47.02
>>> system.time(tmp <- cast(Location ~ Species, data = km, fill = 0))
>> Error: cannot allocate vector of size 33.5 Mb
>> Timing stopped at: 322.95 3.43 327.4
>>> system.time(tmp <- table(km$Location, km$Species))
>>   user  system elapsed
>>   1.10    0.00    1.11
>>
>>
>>
>> ------------------------------------------------------------------------
>> ----
>> ir. Thierry Onkelinx
>> Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>> and Forest
>> Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
>> methodology and quality assurance
>> Gaverstraat 4
>> 9500 Geraardsbergen
>> Belgium
>> tel. + 32 54/436 185
>> Thierry.Onkelinx at inbo.be
>> www.inbo.be
>>
>> To call in the statistician after the experiment is done may be no more
>> than asking him to perform a post-mortem examination: he may be able to
>> say what the experiment died of.
>> ~ Sir Ronald Aylmer Fisher
>>
>> The plural of anecdote is not data.
>> ~ Roger Brinner
>>
>> The combination of some data and an aching desire for an answer does not
>> ensure that a reasonable answer can be extracted from a given body of
>> data.
>> ~ John Tukey
>>
>> -----Oorspronkelijk bericht-----
>> Van: r-sig-ecology-bounces at r-project.org
>> [mailto:r-sig-ecology-bounces at r-project.org] Namens hadley wickham
>> Verzonden: vrijdag 10 oktober 2008 14:40
>> Aan: ONKELINX, Thierry
>> CC: r-sig-ecology at r-project.org
>> Onderwerp: Re: [R-sig-eco] Clustering large data
>>
>>> Thanks for your responses. The biggest problem seems to be cast() for
>>> the reshape package which could not handle the dataset. Peter's
>> solution
>>> using the mefa package worked fine. I found an other solution: table()
>>> which works fine to crosstabulate presence-only data.
>> Exactly what error did you get?  Or did it just take a very long time
>> and then you gave up?  I have an experimental rewrite of the reshape
>> package that is more memory efficient and much faster (10 - 20x) -
>> however, it's still some time from being ready for production use.
>>
>> Hadley
>>
>>
>> --
>> http://had.co.nz/
>>
>> _______________________________________________
>> R-sig-ecology mailing list
>> R-sig-ecology at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>
>> Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer
>> en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
>> door een geldig ondertekend document. The views expressed in  this message
>> and any annex are purely those of the writer and may not be regarded as stating
>> an official position of INBO, as long as the message is not confirmed by a duly
>> signed document.
>>
> 
> 
> 


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David W. Roberts                                     office 406-994-4548
Professor and Head                                      FAX 406-994-3190
Department of Ecology                         email droberts at montana.edu
Montana State University
Bozeman, MT 59717-3460



More information about the R-sig-ecology mailing list