[R] A manipulation problem for a large data set in R

Charles C. Berry cberry at tajo.ucsd.edu
Wed Aug 27 17:58:59 CEST 2008


On Wed, 27 Aug 2008, Giuseppe Paleologo wrote:

> I have two questions for the group. One is very concrete, and is dangerously
> close to a "please do my homework" posting. The second follows from the
> first one but is more general. I would welcome the advice of experienced R
> users.
>
> As for the first one: I have a data frame with two variables
>
> X  Y
> A,   chris
> D,   chris
> B,   chris
> B,   chris
> C,   andrew
> E,   andrew
> C,   andrew
> B,   beth
> D,  chris
> D,   beth
> C,   beth
> D,   beth
> D,   beth
> A,   andrew
> A,   andrew
> A,   andrew
> C,   chris
> B,   beth
> D,   chris
> E,   andrew
> D,   chris
> D,   beth
> D,   chris
> A,   andrew
> A,   chris
> C    chris
> A    chris
> B    chris
> C    beth
> A    chris
>
> I would like to produce a table, with one row for every level of the factor
> X, and multiple columns, filled with the observed levels of the factor Y
> that are observed jointly with X. Hence:
>
> X   Z1  Z2  Z3
> A,  andrew,  chris
> B,  chris beth,  chris
> C,  andrew,  beth,  chris
> D,  chris,  beth
> E,  andrew
>
> A solution would be to something like
>
> temp = tapply(Y, X, function(a) levels(a[,drop=TRUE])))

 	lapply( split(Y,X), unique )

or

 	lapply( split(Y,X), function(x) as.character(unique(x)))

HTH,

Chuck


>
> and then putting the output in an appropriately sized data frame. The issue
> I have with this is that it is inelegant and rather slow for my typical data
> set (~200K rows). So I was wondering if a more efficient, nicer solution
> exists.
>
> This leads me to a second question. Maybe out of laziness, maybe because R
> is good enough, I tend to do all my local data manipulations in R. This
> includes de-duping records, joining tables, and grouping observations. I do
> this also for larger data sets (say, dense tables with 100M+ elements). Is
> this current practice among R users? If so, is there a tutorial, or an R
> view on it?  If not, what do you use?
>
> Thanks in advance,
>
> -gappy
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list