[Rd] Memory allocation in read.table

Simon Urbanek simon.urbanek at r-project.org
Wed Aug 28 19:44:57 CEST 2013


On Aug 28, 2013, at 12:17 PM, Hadley Wickham wrote:

> Hi all,
> 
> I've been trying to learn more about memory profiling in R and I've
> been trying memory profiling out on read.table. I'm getting a bit of a
> strange result, and I hope that someone might be able to explain why.
> 
> After running
> 
> Rprof("read-table.prof", memory.profiling = TRUE, line.profiling = TRUE,
>  gc.profiling = TRUE, interval = interval)
> diamonds <- read.table("diamonds.csv", sep = ",", header = TRUE)
> Rprof(NULL)
> 
> and doing an lot of data manipulation, I end up with a table that
> displays the total memory (in megabytes) allocated and released (by
> gc) from each line of (a local copy of) read.table:
> 
>          file line  alloc release
> 1 read-table.r  122 1.9797  1.1435
> 2 read-table.r  165 1.1148  0.6511
> 3 read-table.r  221 0.0763  0.0321
> 4 read-table.r  222 0.4922  1.5057
> 
> Lines 122 and 165 are where I expect to see big allocations and
> releases - they're calling scan and convert.type respectively. Lines
> 221 and 222 are more of a mystery:
> 
>    class(data) <- "data.frame"
>    attr(data, "row.names") <- row.names
> 
> Why do those lines need any allocations? I thought class<- and attr<-
> were primitives, and hence would modify in place.
> 

.. but only if there is no other reference to the data (i.e. NAMED < 2). If there are two references, they have to copy, because it would change the other copy.
Here, however, it already has NAMED=2 because of 

data <- data[keep]

If you remove that line and inverse the order of class() and attr()<- then you get 0 copies.

Cheers,
Simon

PS: if you are loading any sizable data, the one thing you don't want to do is to use read.table() ;)


> Re-running with gctorture(TRUE) yields roughly similar numbers,
> although there is no memory release because gc is called earlier, and
> the assignment of allocations to line is probably more accurate given
> that gctorture runs the code about 20x slower:
> 
>           file line    alloc  release
> 25 read-table.r  221 0.387299 0.00e+00
> 26 read-table.r  222 0.362964 0.00e+00
> 
> The whole object, when loaded, is ~4 meg, so those allocations
> represent fairly sizeable chunks of the total.
> 
> Any suggestions would be greatly appreciated.  Thanks!
> 
> Hadley
> 
> -- 
> Chief Scientist, RStudio
> http://had.co.nz/
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 



More information about the R-devel mailing list