[R] faster version of split()?
David Winsemius
dwinsemius at comcast.net
Fri Jan 16 18:27:02 CET 2009
Henrique's solution seems sensible. Another might be:
> df = data.frame(x = sample(7:9, 10, rep = T), y = sample(1:5, 10,
rep = T))
> table(df)
y
x 1 2 3 4 5
7 1 0 1 0 2
8 0 1 0 0 1
9 0 1 1 2 0
> rowSums(table(df) >0)
7 8 9
3 2 3
#---------same as Henrique's--------
> count <- function(x) length(unique(na.omit(x)))
> with(df, tapply(y, x, count))
7 8 9
3 2 3
--
David Winsemius
On Jan 16, 2009, at 5:10 AM, Simon Pickett wrote:
> Hi all,
>
> I want to calculate the number of unique observations of "y" in each
> level of "x" from my data frame "df".
>
> this does the job but it is very slow for this big data frame
> (159503 rows, 11 columns).....
>
> group.list <- split(df$y,df$x)
> count <- function(x) length(unique(na.omit(x)))
> sapply(group.list, count, USE.NAMES=TRUE)
>
> I couldnt find the answer searching for "slow split" and "split
> time" on help forum.
>
> I am running R version 2.2.1, on a machine with 4gb of memory and
> I'm using windows 2000.
>
> thanks in advance,
>
> Simon.
>
>
>
>
>
>
>
> ----- Original Message ----- From: "Wacek Kusnierczyk" <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
> >
> To: "Gundala Viswanath" <gundalav at gmail.com>
> Cc: "R help" <R-help at stat.math.ethz.ch>
> Sent: Friday, January 16, 2009 9:30 AM
> Subject: Re: [R] Value Lookup from File without Slurping
>
>
>> you might try to iteratively read a limited number of line of lines
>> in a
>> batch using readLines:
>>
>> # filename, the name of your file
>> # n, the maximal count of lines to read in a batch
>> connection = file(filename, open="rt")
>> while (length(lines <- readLines(con=connection, n=n))) {
>> # do your stuff here
>> }
>> close(connection)
>>
>> ?file
>> ?readLines
>>
>> vQ
>>
>>
>> Gundala Viswanath wrote:
>>> Dear all,
>>>
>>> I have a repository file (let's call it repo.txt)
>>> that contain two columns like this:
>>>
>>> # tag value
>>> AAA 0.2
>>> AAT 0.3
>>> AAC 0.02
>>> AAG 0.02
>>> ATA 0.3
>>> ATT 0.7
>>>
>>> Given another query vector
>>>
>>>
>>>> qr <- c("AAC", "ATT")
>>>>
>>>
>>> I would like to find the corresponding value for each query above,
>>> yielding:
>>>
>>> 0.02
>>> 0.7
>>>
>>> However, I want to avoid slurping whole repo.txt into an object
>>> (e.g. hash).
>>> Is there any ways to do that?
>>>
>>> The reason I want to do that because repo.txt is very2 large size
>>> (milions of lines,
>>> with tag length > 30 bp), and my PC memory is too small to keep it.
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list