[R] Split data frame into 250-row chunks

Wed Jun 10 21:21:21 CEST 2015

> On Jun 10, 2015, at 7:39 AM, Liz Hare <doggene at earthlink.net> wrote:
> 
> Hi R-Experts,
> 
> I have a data.frame like this:
> 
>> head(map)
>  chr snp   poscm   posbp    dist
> 1   1  M1 2.99043 3249189      NA
> 2   1  M2 3.06457 3273096 0.07414
> 3   1  M3 3.17018 3307151 0.10561
> 4   1  M4 3.20892 3319643 0.03874
> 5   1  M5 3.28120 3342947 0.07228
> 6   1  M6 3.29624 3347798 0.01504
> 
> I need to split this into chunks of 250 rows (there will usually be a last chunk with < 250 rows).
> 
> If I only had to extract one 250-line chunk, it would be easy:
> 
> map1 <- map[1:250, ]
> 
> or using subset().
> 
> I tried to make it a loop iterating through num and using beg and nd for starting and ending indices, but I couldn’t figure out how to reference all the variables I needed in this:
> 
>> chunks
>    beg   nd let num
> 1     1  250   a   1
> 2   251  500   b   2
> 3   501  750   c   3
> 4   751 1000   d   4
> 5  1001 1250   e   5
> 6  1251 1500   f   6
> 7  1501 1750   g   7
> 8  1751 2000   h   8
> 9  2001 2250   i   9
> 10 2251 2500   j  10
> …
> 
> Remembering that loops are not always the best answer in R, I looked at other options like split, following this example but not being able to adapt it from a vector to a data.frame version
> http://stackoverflow.com/questions/3318333/split-a-vector-into-chunks-in-r <http://stackoverflow.com/questions/3318333/split-a-vector-into-chunks-in-r> (Yes, I’ve reviewed the language documentation). I checked out ddply and data.table, but couldn’t find a way to use them with index positions instead of column values.
> 
> Thanks,
> Liz

Hi,

  map.split <- split(x, (as.numeric(rownames(map)) - 1) %/% 250)

That will create a list of data frames comprised of subsets of ‘map’, each of which will have 250 records except, of course, for the last one.

Essentially, you are creating a grouping variable based upon the numeric row names modulo the length of the chunks that you want. For example, using the built-in ‘iris’ dataset, which has 150 rows:

> (as.numeric(rownames(iris)) - 1) %/% 50
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [34] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [67] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[100] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[133] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

iris.split <- split(iris, (as.numeric(rownames(iris)) - 1) %/% 50)

> length(iris.split)
[1] 3

> lapply(iris.split, nrow)
$`0`
[1] 50

$`1`
[1] 50

$`2`
[1] 50

> lapply(iris.split, head)
$`0`
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

$`1`
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
51          7.0         3.2          4.7         1.4 versicolor
52          6.4         3.2          4.5         1.5 versicolor
53          6.9         3.1          4.9         1.5 versicolor
54          5.5         2.3          4.0         1.3 versicolor
55          6.5         2.8          4.6         1.5 versicolor
56          5.7         2.8          4.5         1.3 versicolor

$`2`
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
104          6.3         2.9          5.6         1.8 virginica
105          6.5         3.0          5.8         2.2 virginica
106          7.6         3.0          6.6         2.1 virginica

Regards,

Marc Schwartz