[R-sig-eco] Clustering large data

hadley wickham h.wickham at gmail.com
Wed Oct 15 18:36:27 CEST 2008


Hi Thierry,

Thanks for the more detailed report.  I think the new version of
reshape will help, but I just checked and it's current a total mess
and will need a lot of work before it's ready for anyone to try.
Unfortunately I'm unlikely to get to it until the ggplot2 book is
finished, so it might be a bit of a wait.

Hadley

On Tue, Oct 14, 2008 at 2:52 AM, ONKELINX, Thierry
<Thierry.ONKELINX at inbo.be> wrote:
> Hi Hadley,
>
> Here is a more elaborate report of what I did and what when wrong. The
> example is not reproducible because the dataset is to large. A smaller
> dummy dataset is not an option as it works with smaller datasets. I'm
> willing to run the code again with a development version of reshape.
>
> Cheers,
>
> Thierry
>
>
>> library(RODBC)
>> library(reshape)
> Loading required package: plyr
>> setwd("d:/wouter")
>> Sys.info()
>                     sysname                      release
>                   "Windows"                         "XP"
>                     version                     nodename
> "build 2600, Service Pack 2"                 "LHPA000838"
>                     machine                        login
>                       "x86"           "thierry_onkelinx"
>                        user
>          "thierry_onkelinx"
>> sessionInfo()
> R version 2.7.2 (2008-08-25)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY=Du
> tch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  tcltk     utils     methods
>
> [8] base
>
> other attached packages:
> [1] reshape_0.8.1  plyr_0.1       RODBC_1.2-3    svSocket_0.9-5
> svIO_0.9-5
> [6] R2HTML_1.59    svMisc_0.9-5   svIDE_0.9-5
>
> loaded via a namespace (and not attached):
> [1] tools_2.7.2
>> channel <- odbcConnectAccess("db1.mdb")
>> km <- sqlQuery(channel = channel, query = "SELECT KMhokcode AS
> Location, TaxonFK AS Species FROM kmhok_periode2_selectie ORDER BY
> KMhokcode, TaxonFK", as.is = TRUE)
>> odbcCloseAll()
>> km$value <- 1
>> dim(km)
> [1] 1157024       3
>> length(unique(km$Location))
> [1] 6354
>> length(unique(km$Species))
> [1] 1381
>> system.time(tmp <- cast(Location ~ Species, data = km[1:1000, ], fill
> = 0))
>   user  system elapsed
>   0.11    0.00    0.17
>> system.time(tmp <- cast(Location ~ Species, data = km[1:10000, ], fill
> = 0))
>   user  system elapsed
>    1.7     0.0     1.7
>> system.time(tmp <- cast(Location ~ Species, data = km[1:100000, ],
> fill = 0))
>   user  system elapsed
>  46.42    0.45   47.02
>> system.time(tmp <- cast(Location ~ Species, data = km, fill = 0))
> Error: cannot allocate vector of size 33.5 Mb
> Timing stopped at: 322.95 3.43 327.4
>> system.time(tmp <- table(km$Location, km$Species))
>   user  system elapsed
>   1.10    0.00    1.11
>
>
>
> ------------------------------------------------------------------------
> ----
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature
> and Forest
> Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
> methodology and quality assurance
> Gaverstraat 4
> 9500 Geraardsbergen
> Belgium
> tel. + 32 54/436 185
> Thierry.Onkelinx at inbo.be
> www.inbo.be
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to
> say what the experiment died of.
> ~ Sir Ronald Aylmer Fisher
>
> The plural of anecdote is not data.
> ~ Roger Brinner
>
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of
> data.
> ~ John Tukey
>
> -----Oorspronkelijk bericht-----
> Van: r-sig-ecology-bounces at r-project.org
> [mailto:r-sig-ecology-bounces at r-project.org] Namens hadley wickham
> Verzonden: vrijdag 10 oktober 2008 14:40
> Aan: ONKELINX, Thierry
> CC: r-sig-ecology at r-project.org
> Onderwerp: Re: [R-sig-eco] Clustering large data
>
>> Thanks for your responses. The biggest problem seems to be cast() for
>> the reshape package which could not handle the dataset. Peter's
> solution
>> using the mefa package worked fine. I found an other solution: table()
>> which works fine to crosstabulate presence-only data.
>
> Exactly what error did you get?  Or did it just take a very long time
> and then you gave up?  I have an experimental rewrite of the reshape
> package that is more memory efficient and much faster (10 - 20x) -
> however, it's still some time from being ready for production use.
>
> Hadley
>
>
> --
> http://had.co.nz/
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>
> Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer
> en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
> door een geldig ondertekend document. The views expressed in  this message
> and any annex are purely those of the writer and may not be regarded as stating
> an official position of INBO, as long as the message is not confirmed by a duly
> signed document.
>



-- 
http://had.co.nz/



More information about the R-sig-ecology mailing list