[R-sig-eco] Clustering large data
Dave Roberts
droberts at montana.edu
Fri Oct 24 20:10:59 CEST 2008
Thierry and Hadley,
Sorry to be late coming into this (I forgot I subscribed to sig-eco).
package labdsv has a function called matrify() which takes a three
column data.frame (sample,taxa,abundance) and creates a full (sparse)
matrix representation. I've never tried it on a data set as large as
yours, and I'm curious if it would work. It's pure R, but if worst
comes to worst I used to have a FORTRAN version that would probably
work. Please give matrify a try and let me know.
Dave R.
matrify <- function (data)
{
if (ncol(data) != 3)
stop("data frame must have three column format")
plt <- data[, 1]
spc <- data[, 2]
abu <- data[, 3]
plt.codes <- levels(factor(plt))
spc.codes <- levels(factor(spc))
taxa <- matrix(0, nrow = length(plt.codes), ncol =
length(spc.codes))
row <- match(plt, plt.codes)
col <- match(spc, spc.codes)
for (i in 1:length(abu)) {
taxa[row[i], col[i]] <- abu[i]
}
taxa <- data.frame(taxa)
names(taxa) <- spc.codes
row.names(taxa) <- plt.codes
taxa
}
hadley wickham wrote:
> Hi Thierry,
>
> Thanks for the more detailed report. I think the new version of
> reshape will help, but I just checked and it's current a total mess
> and will need a lot of work before it's ready for anyone to try.
> Unfortunately I'm unlikely to get to it until the ggplot2 book is
> finished, so it might be a bit of a wait.
>
> Hadley
>
> On Tue, Oct 14, 2008 at 2:52 AM, ONKELINX, Thierry
> <Thierry.ONKELINX at inbo.be> wrote:
>> Hi Hadley,
>>
>> Here is a more elaborate report of what I did and what when wrong. The
>> example is not reproducible because the dataset is to large. A smaller
>> dummy dataset is not an option as it works with smaller datasets. I'm
>> willing to run the code again with a development version of reshape.
>>
>> Cheers,
>>
>> Thierry
>>
>>
>>> library(RODBC)
>>> library(reshape)
>> Loading required package: plyr
>>> setwd("d:/wouter")
>>> Sys.info()
>> sysname release
>> "Windows" "XP"
>> version nodename
>> "build 2600, Service Pack 2" "LHPA000838"
>> machine login
>> "x86" "thierry_onkelinx"
>> user
>> "thierry_onkelinx"
>>> sessionInfo()
>> R version 2.7.2 (2008-08-25)
>> i386-pc-mingw32
>>
>> locale:
>> LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY=Du
>> tch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252
>>
>> attached base packages:
>> [1] stats graphics grDevices datasets tcltk utils methods
>>
>> [8] base
>>
>> other attached packages:
>> [1] reshape_0.8.1 plyr_0.1 RODBC_1.2-3 svSocket_0.9-5
>> svIO_0.9-5
>> [6] R2HTML_1.59 svMisc_0.9-5 svIDE_0.9-5
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.7.2
>>> channel <- odbcConnectAccess("db1.mdb")
>>> km <- sqlQuery(channel = channel, query = "SELECT KMhokcode AS
>> Location, TaxonFK AS Species FROM kmhok_periode2_selectie ORDER BY
>> KMhokcode, TaxonFK", as.is = TRUE)
>>> odbcCloseAll()
>>> km$value <- 1
>>> dim(km)
>> [1] 1157024 3
>>> length(unique(km$Location))
>> [1] 6354
>>> length(unique(km$Species))
>> [1] 1381
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:1000, ], fill
>> = 0))
>> user system elapsed
>> 0.11 0.00 0.17
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:10000, ], fill
>> = 0))
>> user system elapsed
>> 1.7 0.0 1.7
>>> system.time(tmp <- cast(Location ~ Species, data = km[1:100000, ],
>> fill = 0))
>> user system elapsed
>> 46.42 0.45 47.02
>>> system.time(tmp <- cast(Location ~ Species, data = km, fill = 0))
>> Error: cannot allocate vector of size 33.5 Mb
>> Timing stopped at: 322.95 3.43 327.4
>>> system.time(tmp <- table(km$Location, km$Species))
>> user system elapsed
>> 1.10 0.00 1.11
>>
>>
>>
>> ------------------------------------------------------------------------
>> ----
>> ir. Thierry Onkelinx
>> Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>> and Forest
>> Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
>> methodology and quality assurance
>> Gaverstraat 4
>> 9500 Geraardsbergen
>> Belgium
>> tel. + 32 54/436 185
>> Thierry.Onkelinx at inbo.be
>> www.inbo.be
>>
>> To call in the statistician after the experiment is done may be no more
>> than asking him to perform a post-mortem examination: he may be able to
>> say what the experiment died of.
>> ~ Sir Ronald Aylmer Fisher
>>
>> The plural of anecdote is not data.
>> ~ Roger Brinner
>>
>> The combination of some data and an aching desire for an answer does not
>> ensure that a reasonable answer can be extracted from a given body of
>> data.
>> ~ John Tukey
>>
>> -----Oorspronkelijk bericht-----
>> Van: r-sig-ecology-bounces at r-project.org
>> [mailto:r-sig-ecology-bounces at r-project.org] Namens hadley wickham
>> Verzonden: vrijdag 10 oktober 2008 14:40
>> Aan: ONKELINX, Thierry
>> CC: r-sig-ecology at r-project.org
>> Onderwerp: Re: [R-sig-eco] Clustering large data
>>
>>> Thanks for your responses. The biggest problem seems to be cast() for
>>> the reshape package which could not handle the dataset. Peter's
>> solution
>>> using the mefa package worked fine. I found an other solution: table()
>>> which works fine to crosstabulate presence-only data.
>> Exactly what error did you get? Or did it just take a very long time
>> and then you gave up? I have an experimental rewrite of the reshape
>> package that is more memory efficient and much faster (10 - 20x) -
>> however, it's still some time from being ready for production use.
>>
>> Hadley
>>
>>
>> --
>> http://had.co.nz/
>>
>> _______________________________________________
>> R-sig-ecology mailing list
>> R-sig-ecology at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>
>> Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer
>> en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
>> door een geldig ondertekend document. The views expressed in this message
>> and any annex are purely those of the writer and may not be regarded as stating
>> an official position of INBO, as long as the message is not confirmed by a duly
>> signed document.
>>
>
>
>
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David W. Roberts office 406-994-4548
Professor and Head FAX 406-994-3190
Department of Ecology email droberts at montana.edu
Montana State University
Bozeman, MT 59717-3460
More information about the R-sig-ecology
mailing list