[R] Splitting a DF into rows according to a column

peter dalgaard pdalgd at gmail.com
Mon Oct 4 17:30:32 CEST 2010


On Oct 4, 2010, at 16:57 , Johannes Graumann wrote:

> Hi,
> 
> I'm turning my wheels on this and keep coming around to the same wrong 
> solution - please have a look and give a hand ...
> 
> The premise is: a DF like so
> 
>> loremIpsum <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
> Quisque leo ipsum, ultricies scelerisque volutpat non, volutpat et nulla. 
> Curabitur consequat ullamcorper tellus id imperdiet. Duis semper malesuada 
> nulla, blandit lobortis diam fringilla at. Vestibulum nec tellus orci, eu 
> sollicitudin quam. Phasellus sit amet enim diam. Phasellus mattis hendrerit 
> varius. Curabitur ut tristique enim. Lorem ipsum dolor sit amet, consectetur 
> adipiscing elit. Sed convallis, tortor id vehicula facilisis, nunc justo 
> facilisis tellus, sed eleifend nisi lacus id purus. Maecenas tempus 
> sollicitudin libero, molestie laoreet metus dapibus eu. Mauris justo ante, 
> mattis et pulvinar a, varius pretium eros. Curabitur fringilla dui ac dui 
> rutrum pretium. Donec sed magna adipiscing nisi accumsan congue sed ac est. 
> Vivamus lorem urna, tristique quis accumsan quis, ullamcorper aliquet 
> velit."
>> tmpDF <- data.frame(Column1=rep(unlist(strsplit(loremIpsum," 
> ")),length.out=510),Column2=runif(510,min=0,max=1e8))
> 
> is to be split into DFs with 50 entries in an ordered manner according to 
> column2 (first DF ist o contain the rows with the 50 largest numbers, ...).
> 
> Here is what I have been doing:
> 
>> binSize <- 50
>> splitMembership <- 
> pmin(ceiling(order(tmpDF[["Column2"]],decreasing=TRUE)/binSize),floor(nrow(tmpDF)/binSize))
>> splitList <- split(tmpDF,splitMembership)
> 
> Distribution seems to work ...
>> sapply(splitList,nrow)
> 
> But this is NOT what I wanted ...
>> sapply(splitList,function(x){max(x[["Column2"]])})
> This was supposed to give me bins that are Column2-sorted and bin one should 
> have a higher max than 2 than 3 ...
> 
> Can anyone point out where (my now 3 reimplementations) fail?
> 
> Thanks, Stupid Joh

Dear Stupid Joh, 

Have you considered something along the lines of

o <- order(-x$Column2)
xx <- x[o,]
split(xx, (seq_len(NROW(x))-1) %/% 50)

The above is a bit hard to follow, but it seems to work better with rank() instead of order():

> splitMembership <- 
+ pmin(ceiling(rank(-tmpDF[["Column2"]])/binSize),floor(nrow(tmpDF)/binSize))
> splitList <- split(tmpDF,splitMembership)> sapply(splitList,nrow)
 1  2  3  4  5  6  7  8  9 10 
50 50 50 50 50 50 50 50 50 60 
> sapply(splitList,function(x){max(x[["Column2"]])})
       1        2        3        4        5        6 
99877498 90567877 81965382 69112280 59814266 52130373 
       7        8        9       10 
41557660 32630212 21226996 11880032 


-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list