[R] data summarization etc...

Daniel Malter daniel at umd.edu
Sat Jul 12 01:53:04 CEST 2008


The problem is that you do not really have categories. You draw 3 times
70000 random normal variables and then try to subset one by the other.
Since, no of the values will perfectly coincide with another, your code
would create something like 70000^3 categories. No wonder that you are
running out of memory. So what you are doing is nonsensical unless you
really have some groups/categories that cluster your data and which are
filled with a substantial number of observations (see example below).

x1=rnorm(30000,0,1)
x2=rnorm(30000,10,5)
group1=rep(c(1:3),each=10000)
group2=rep(c(1:3),10000)

aggregate(cbind(x1,x2),list(group1,group2),FUN=mean)

Best,
Daniel


-------------------------
cuncta stricte discussurus
-------------------------

-----Ursprüngliche Nachricht-----
Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im
Auftrag von sj
Gesendet: Friday, July 11, 2008 6:47 PM
An: r-help
Betreff: [R] data summarization etc...

Hello,

I am trying to do some fairly straightforward data summarization, i.e., the
kind you would do with a pivot table in excel or by using SQL queires. I
have a moderately sized data set of ~70,000 records and I am trying to
compute some group averages and sum values within groups. the code example
below shows how I am trying to go about doing this

pti <-rnorm(70000,10)
fid <- rnorm(70000,100)
finc <- rnorm(70000,1000)


### compute the sums of pti within fid groups sum_pinc
<-aggregate(cbind(fid,pti),list(fid),FUN=sum)

#### compute mean finc within fid groups tot_finc <-
aggregate(cbind(fid,finc),list(fid),FUN=mean)

when I try to do it this way I get an error message telling me that enough
memory cannot be allocated ( I am using R 2.7.1 on Windows XP with 2 GB of
Memory). I figure that there must be a more efficent way to go about doing
this. Please suggest.

I would typically do this kind of task in a database and use SQL to push the
data around. I know RODBC allows you to write SQL to query external DBs. Is
there any mechanisim that allows you to write SQL queies against datasets
internal to R e.g. in the case above


I could do something like

set <- cbind(fid,pti,finc)

select fid, sum(pti)
from set
group by fid

that would be handy!

Thanks,

Spencer

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list