[R] Splitting a DF into rows according to a column
peter dalgaard
pdalgd at gmail.com
Mon Oct 4 17:30:32 CEST 2010
On Oct 4, 2010, at 16:57 , Johannes Graumann wrote:
> Hi,
>
> I'm turning my wheels on this and keep coming around to the same wrong
> solution - please have a look and give a hand ...
>
> The premise is: a DF like so
>
>> loremIpsum <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
> Quisque leo ipsum, ultricies scelerisque volutpat non, volutpat et nulla.
> Curabitur consequat ullamcorper tellus id imperdiet. Duis semper malesuada
> nulla, blandit lobortis diam fringilla at. Vestibulum nec tellus orci, eu
> sollicitudin quam. Phasellus sit amet enim diam. Phasellus mattis hendrerit
> varius. Curabitur ut tristique enim. Lorem ipsum dolor sit amet, consectetur
> adipiscing elit. Sed convallis, tortor id vehicula facilisis, nunc justo
> facilisis tellus, sed eleifend nisi lacus id purus. Maecenas tempus
> sollicitudin libero, molestie laoreet metus dapibus eu. Mauris justo ante,
> mattis et pulvinar a, varius pretium eros. Curabitur fringilla dui ac dui
> rutrum pretium. Donec sed magna adipiscing nisi accumsan congue sed ac est.
> Vivamus lorem urna, tristique quis accumsan quis, ullamcorper aliquet
> velit."
>> tmpDF <- data.frame(Column1=rep(unlist(strsplit(loremIpsum,"
> ")),length.out=510),Column2=runif(510,min=0,max=1e8))
>
> is to be split into DFs with 50 entries in an ordered manner according to
> column2 (first DF ist o contain the rows with the 50 largest numbers, ...).
>
> Here is what I have been doing:
>
>> binSize <- 50
>> splitMembership <-
> pmin(ceiling(order(tmpDF[["Column2"]],decreasing=TRUE)/binSize),floor(nrow(tmpDF)/binSize))
>> splitList <- split(tmpDF,splitMembership)
>
> Distribution seems to work ...
>> sapply(splitList,nrow)
>
> But this is NOT what I wanted ...
>> sapply(splitList,function(x){max(x[["Column2"]])})
> This was supposed to give me bins that are Column2-sorted and bin one should
> have a higher max than 2 than 3 ...
>
> Can anyone point out where (my now 3 reimplementations) fail?
>
> Thanks, Stupid Joh
Dear Stupid Joh,
Have you considered something along the lines of
o <- order(-x$Column2)
xx <- x[o,]
split(xx, (seq_len(NROW(x))-1) %/% 50)
The above is a bit hard to follow, but it seems to work better with rank() instead of order():
> splitMembership <-
+ pmin(ceiling(rank(-tmpDF[["Column2"]])/binSize),floor(nrow(tmpDF)/binSize))
> splitList <- split(tmpDF,splitMembership)> sapply(splitList,nrow)
1 2 3 4 5 6 7 8 9 10
50 50 50 50 50 50 50 50 50 60
> sapply(splitList,function(x){max(x[["Column2"]])})
1 2 3 4 5 6
99877498 90567877 81965382 69112280 59814266 52130373
7 8 9 10
41557660 32630212 21226996 11880032
--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help
mailing list