[R-sig-eco] Clustering large data

Mon Oct 27 10:11:29 CET 2008

Dear Dave,

Below you'll find a testreport of your function. It works fine with my
dataset allthough it is slower than the plain and simple table function.
But that off course will only work with presence-only data like I have.
On the other hand: it is three time faster than the mefa package.

HTH,

Thierry

> matrify <- function (data)
+ {
+      if (ncol(data) != 3)
+          stop("data frame must have three column format")
+      plt <- data[, 1]
+      spc <- data[, 2]
+      abu <- data[, 3]
+      plt.codes <- levels(factor(plt))
+      spc.codes <- levels(factor(spc))
+      taxa <- matrix(0, nrow = length(plt.codes), ncol =
+               length(spc.codes))
+      row <- match(plt, plt.codes)
+      col <- match(spc, spc.codes)
+      for (i in 1:length(abu)) {
+          taxa[row[i], col[i]] <- abu[i]
+      }
+      taxa <- data.frame(taxa)
+      names(taxa) <- spc.codes
+      row.names(taxa) <- plt.codes
+      taxa
+ }
> library(RODBC)
> setwd("d:/wouter")
> Sys.info()
                     sysname                      release 
                   "Windows"                         "XP" 
                     version                     nodename 
"build 2600, Service Pack 2"                 "LHPA000838" 
                     machine                        login 
                       "x86"           "thierry_onkelinx" 
                        user 
          "thierry_onkelinx" 
> sessionInfo()
R version 2.8.0 (2008-10-20) 
i386-pc-mingw32 

locale:
LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY=Du
tch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252

attached base packages:
[1] stats     graphics  grDevices datasets  tcltk     utils     methods

[8] base     

other attached packages:
[1] RODBC_1.2-3    svSocket_0.9-5 svIO_0.9-5     R2HTML_1.59
svMisc_0.9-5  
[6] svIDE_0.9-5   

loaded via a namespace (and not attached):
[1] tools_2.8.0
> channel <- odbcConnectAccess("db1.mdb")
> km <- sqlQuery(channel = channel, query = "SELECT KMhokcode AS
Location, TaxonFK AS Species FROM kmhok_periode2_selectie ORDER BY
KMhokcode, TaxonFK", as.is = TRUE)
> odbcCloseAll()
> dim(km)
[1] 1157024       2
> length(unique(km$Location))
[1] 6354
> length(unique(km$Species))
[1] 1381
> system.time(tmp <- table(km$Location, km$Species))
   user  system elapsed 
   1.32    0.26    1.58 
> km$value <- 1
> dim(km)
[1] 1157024       3
> system.time(tmp <- matrify(km))
   user  system elapsed 
  10.81    0.58   11.39 
> library(mefa)
This is mefa 2.0-1 
> system.time(x <- mefa(stcs(km[, 1:2]))$xtabs)
   user  system elapsed 
  27.05    0.76   28.61 
> 

------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: Dave Roberts [mailto:droberts at montana.edu] 
Verzonden: vrijdag 24 oktober 2008 20:11
Aan: r-sig-ecology at r-project.org
CC: ONKELINX, Thierry
Onderwerp: Re: [R-sig-eco] Clustering large data

Thierry and Hadley,

     Sorry to be late coming into this (I forgot I subscribed to
sig-eco).

     package labdsv has a function called matrify() which takes a three 
column data.frame (sample,taxa,abundance) and creates a full (sparse) 
matrix representation.  I've never tried it on a data set as large as 
yours, and I'm curious if it would work.  It's pure R, but if worst 
comes to worst I used to have a FORTRAN version that would probably 
work. Please give matrify a try and let me know.

Dave R.

matrify <- function (data)
{
     if (ncol(data) != 3)
         stop("data frame must have three column format")
     plt <- data[, 1]
     spc <- data[, 2]
     abu <- data[, 3]
     plt.codes <- levels(factor(plt))
     spc.codes <- levels(factor(spc))
     taxa <- matrix(0, nrow = length(plt.codes), ncol =
              length(spc.codes))
     row <- match(plt, plt.codes)
     col <- match(spc, spc.codes)
     for (i in 1:length(abu)) {
         taxa[row[i], col[i]] <- abu[i]
     }
     taxa <- data.frame(taxa)
     names(taxa) <- spc.codes
     row.names(taxa) <- plt.codes
     taxa
}

hadley wickham wrote:
> Hi Thierry,
> 
> Thanks for the more detailed report.  I think the new version of
> reshape will help, but I just checked and it's current a total mess
> and will need a lot of work before it's ready for anyone to try.
> Unfortunately I'm unlikely to get to it until the ggplot2 book is
> finished, so it might be a bit of a wait.
> 
> Hadley
> 
> On Tue, Oct 14, 2008 at 2:52 AM, ONKELINX, Thierry
> <Thierry.ONKELINX at inbo.be> wrote:
>> Hi Hadley,
>>
>> Here is a more elaborate report of what I did and what when wrong.
The
>> example is not reproducible because the dataset is to large. A
smaller
>> dummy dataset is not an option as it works with smaller datasets. I'm
>> willing to run the code again with a development version of reshape.
>>
>> Cheers,
>>
>> Thierry
>>
>>
>>> library(RODBC)
>>> library(reshape)
>> Loading required package: plyr
>>> setwd("d:/wouter")
>>> Sys.info()
>>                     sysname                      release
>>                   "Windows"                         "XP"
>>                     version                     nodename
>> "build 2600, Service Pack 2"                 "LHPA000838"
>>                     machine                        login
>>                       "x86"           "thierry_onkelinx"
>>                        user
>>          "thierry_onkelinx"
>>> sessionInfo()
>> R version 2.7.2 (2008-08-25)
>> i386-pc-mingw32
>>
>> locale:
>>
LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY=Du
>> tch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  tcltk     utils
methods
>>
>> [8] base
>>
>> other attached packages:
>> [1] reshape_0.8.1  plyr_0.1       RODBC_1.2-3    svSocket_0.9-5
>> svIO_0.9-5
>> [6] R2HTML_1.59    svMisc_0.9-5   svIDE_0.9-5
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.7.2
>>> channel <- odbcConnectAccess("db1.mdb")
>>> km <- sqlQuery(channel = channel, query = "SELECT KMhokcode AS
>> Location, TaxonFK AS Species FROM kmhok_periode2_selectie ORDER BY
>> KMhokcode, TaxonFK", as.is = TRUE)
>>> odbcCloseAll()
>>> km$value <- 1
>>> dim(km)
>> [1] 1157024       3
>>> length(unique(km$Location))
>> [1] 6354
>>> length(unique(km$Species))
>> [1] 1381
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:1000, ],
fill
>> = 0))
>>   user  system elapsed
>>   0.11    0.00    0.17
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:10000, ],
fill
>> = 0))
>>   user  system elapsed
>>    1.7     0.0     1.7
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:100000, ],
>> fill = 0))
>>   user  system elapsed
>>  46.42    0.45   47.02
>>> system.time(tmp <- cast(Location ~ Species, data = km, fill = 0))
>> Error: cannot allocate vector of size 33.5 Mb
>> Timing stopped at: 322.95 3.43 327.4
>>> system.time(tmp <- table(km$Location, km$Species))
>>   user  system elapsed
>>   1.10    0.00    1.11
>>
>>
>>
>>
------------------------------------------------------------------------
>> ----
>> ir. Thierry Onkelinx
>> Instituut voor natuur- en bosonderzoek / Research Institute for
Nature
>> and Forest
>> Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
>> methodology and quality assurance
>> Gaverstraat 4
>> 9500 Geraardsbergen
>> Belgium
>> tel. + 32 54/436 185
>> Thierry.Onkelinx at inbo.be
>> www.inbo.be
>>
>> To call in the statistician after the experiment is done may be no
more
>> than asking him to perform a post-mortem examination: he may be able
to
>> say what the experiment died of.
>> ~ Sir Ronald Aylmer Fisher
>>
>> The plural of anecdote is not data.
>> ~ Roger Brinner
>>
>> The combination of some data and an aching desire for an answer does
not
>> ensure that a reasonable answer can be extracted from a given body of
>> data.
>> ~ John Tukey
>>
>> -----Oorspronkelijk bericht-----
>> Van: r-sig-ecology-bounces at r-project.org
>> [mailto:r-sig-ecology-bounces at r-project.org] Namens hadley wickham
>> Verzonden: vrijdag 10 oktober 2008 14:40
>> Aan: ONKELINX, Thierry
>> CC: r-sig-ecology at r-project.org
>> Onderwerp: Re: [R-sig-eco] Clustering large data
>>
>>> Thanks for your responses. The biggest problem seems to be cast()
for
>>> the reshape package which could not handle the dataset. Peter's
>> solution
>>> using the mefa package worked fine. I found an other solution:
table()
>>> which works fine to crosstabulate presence-only data.
>> Exactly what error did you get?  Or did it just take a very long time
>> and then you gave up?  I have an experimental rewrite of the reshape
>> package that is more memory efficient and much faster (10 - 20x) -
>> however, it's still some time from being ready for production use.
>>
>> Hadley
>>
>>
>> --
>> http://had.co.nz/
>>
>> _______________________________________________
>> R-sig-ecology mailing list
>> R-sig-ecology at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>
>> Dit bericht en eventuele bijlagen geven enkel de visie van de
schrijver weer
>> en binden het INBO onder geen enkel beding, zolang dit bericht niet
bevestigd is
>> door een geldig ondertekend document. The views expressed in  this
message
>> and any annex are purely those of the writer and may not be regarded
as stating
>> an official position of INBO, as long as the message is not confirmed
by a duly
>> signed document.
>>
> 
> 
> 

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David W. Roberts                                     office 406-994-4548
Professor and Head                                      FAX 406-994-3190
Department of Ecology                         email droberts at montana.edu
Montana State University
Bozeman, MT 59717-3460

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.