[R] [R-sig-hpc] Quickest way to make a large "empty" file on disk?

Rui Barradas ruipbarradas at sapo.pt
Fri Sep 28 20:27:02 CEST 2012


Hello,

I've written a function to try to answer to your op request, but I've 
run into a problem. See in the end.
In the mean time, inline.
Em 28-09-2012 17:44, Jonathan Greenberg escreveu:
> Rui:
>
> Quick follow-up -- it looks like seek does do what I want (I see Simon
> suggested it some time ago) -- what do mean by "trash your disk"?
Nothing special, just that sometimes there are good ways of doing so. 
mmap seems to be safe.
>    What I'm
> trying to accomplish is getting parallel, asynchronous writes to a large
> binary image (just a binary file) working.  Each node writes to a different
> sector of the file via mmap, "filling in" the values as the process runs,
> but the file needs to be pre-created before I can mmap it.  Running a
> writeBin with a bunch of 0s would mean I'd basically have to write the file
> twice, but the seek/ff trick seems to be much faster.
>
> Do I risk doing some damage to my filesystem if I use seek?  I see there is
> a strongly worded warning in the help for ?seek:
>
> "Use of seek on Windows is discouraged. We have found so many errors in the
> Windows implementation of file positioning that users are advised to use it
> only at their own risk, and asked not to waste the *R* developers' time
> with bug reports on Windows' deficiencies." --> there's no detail here on
> which errors people have experienced, so I'm not sure if doing something as
> simple as just "creating" a file using seek falls under the "discouraging"
> category.

I'm not a great system programmer but in 20+ years of using seek on 
Windows has shown nothing of the sort. In fact, I've just found a 
problem with ubuntu 12.04, where seek gives the expected result on 
Windows, it goes up to a certain point on ubuntu and then "stops 
seeking", or whatever is happening. I installed ubuntu very recently so 
I really don't know why the behavior that you can see in the example run 
below. But I do that Windows 7 is causing no problem, as expected.
> As a note, we are trying to work this up on both Windows and *nix systems,
> hence our wanting to have a single approach that works on both OSs.
>
> --j

#
# Function: creates a file of ascii nulls using seek/writeBin. File size 
can be big.
#
createBig <- function(filename, size){
     if(size == 0) return(0)
     chunk <- .Machine$integer.max
     nchunks <- as.integer(size / chunk)
     rest <- size - as.double(nchunks)*as.double(chunk)
     fl <- file(filename, open = "wb")
     for(i in seq_len(nchunks)){
         seek(fl, where = chunk - 1, origin = "current", rw = "write")
         writeBin(raw(1), fl)
         # ---------- debug ----------
         print(seek(fl, where = NA))
     }
     if(rest > 0){
         seek(fl, where = rest - 1, origin = "current", rw = "write")
         writeBin(raw(1), fl)
     }
     close(fl)
}

As you can see from the debug prints, on Windows 7,  everything works as 
planned while on ubuntu 12.04 when it reaches 17Gb seek stops seeking. 
The increments in file size become 1 byte at a time, explained by the 
writeBin instruction. (The different, slightly larger, size is 
irrelevant, the code was ran several times all with the same result:  at 
17179869176 bytes it no longer works.)

#----------------------------------------------------------------------------
#
# System: Windows 7 / R 2.15.1

size <- 10*.Machine$integer.max + sample(.Machine$integer.max, 1)
size
[1] 22195364413

createBig("Test.txt", size)
[1] 2147483647
[1] 4294967294
[1] 6442450941
[1] 8589934588
[1] 10737418235
[1] 12884901882
[1] 15032385529
[1] 17179869176
[1] 19327352823
[1] 21474836470

file.info("Test.txt")$size
[1] 22195364413
file.info("Test.txt")$size %/% .Machine$integer.max
[1] 10
file.info("Test.txt")$size %% .Machine$integer.max
[1] 720527943

sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Portugal.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

loaded via a namespace (and not attached):
[1] fortunes_1.5-0

#----------------------------------------------------------------------------
#
# System: ubuntu 12.04 precise pangolim / R 2.15.1
size <- 10*.Machine$integer.max + sample(.Machine$integer.max, 1)
size
[1] 23091487381

createBig("Test.txt", size)
[1] 2147483647
[1] 4294967294
[1] 6442450941
[1] 8589934588
[1] 10737418235
[1] 12884901882
[1] 15032385529
[1] 17179869176
[1] 17179869177
[1] 17179869178

file.info("Test.txt")$size
[1] 17179869179
file.info("Test.txt")$size %/% .Machine$integer.max
[1] 8
file.info("Test.txt")$size %% .Machine$integer.max
[1] 3


sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=pt_PT.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=pt_PT.UTF-8        LC_COLLATE=pt_PT.UTF-8
  [5] LC_MONETARY=pt_PT.UTF-8    LC_MESSAGES=pt_PT.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

loaded via a namespace (and not attached):
[1] tools_2.15.1



>
> On Thu, Sep 27, 2012 at 3:49 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
>
>>   Hello,
>>
>> If you really need to trash your disk, why not use seek()?
>>
>>> fl <- file("Test.txt", open = "wb")
>>> seek(fl, where = 1024, origin = "start", rw = "write")
>> [1] 0
>>> writeChar(character(1), fl, nchars = 1, useBytes = TRUE)
>> Warning message:
>> In writeChar(character(1), fl, nchars = 1, useBytes = TRUE) :
>>    writeChar: more characters requested than are in the string - will
>> zero-pad
>>> close(fl)
>>
>> File "Test.txt" is now 1Kb in size.
>>
>> Hope this helps,
>>
>> Rui Barradas
>> Em 27-09-2012 20:17, Jonathan Greenberg escreveu:
>>
>> Folks:
>>
>> Asked this question some time ago, and found what appeared (at first) to be
>> the best solution, but I'm now finding a new problem.  First off, it seemed
>> like ff as Jens suggested worked:
>>
>> # outdata_ncells = the number of rows * number of columns * number of bands
>> in an image:
>> out<-ff(vmode="double",length=outdata_ncells,filename=filename)
>> finalizer(out) <- close
>> close(out)
>>
>> This was working fine until I attempted to set length to a VERY large
>> number: outdata_ncells = 17711913600.  This would create a file that is
>> 131.964GB.  Big, but not obscenely so (and certainly not larger than the
>> filesystem can handle).  However, length appears to be restricted
>> by .Machine$integer.max (I'm on a 64-bit windows box):
>>
>>   .Machine$integer.max
>>
>>   [1] 2147483647
>>
>> Any suggestions on how to solve this problem for much larger file sizes?
>>
>> --j
>>
>>
>> On Thu, May 3, 2012 at 10:44 AM, Jonathan Greenberg <jgrn at illinois.edu> <jgrn at illinois.edu>wrote:
>>
>>
>>   Thanks, all!  I'll try these out.  I'm trying to work up something that is
>> platform independent (if possible) for use with mmap.  I'll do some tests
>> on these suggestions and see which works best. I'll try to report back in a
>> few days.  Cheers!
>>
>> --j
>>
>>
>>
>> 2012/5/3 "Jens Oehlschlägel" <jens.oehlschlaegel at truecluster.com> <jens.oehlschlaegel at truecluster.com>
>>
>>   Jonathan,
>>
>> On some filesystems (e.g. NTFS, see below) it is possible to create
>> 'sparse' memory-mapped files, i.e. reserving the space without the cost of
>> actually writing initial values.
>> Package 'ff' does this automatically and also allows to access the file
>> in parallel. Check the example below and see how big file creation is
>> immediate.
>>
>> Jens Oehlschlägel
>>
>>
>>
>>   library(ff)
>> library(snowfall)
>> ncpus <- 2
>> n <- 1e8
>> system.time(
>>
>>   + x <- ff(vmode="double", length=n, filename="c:/Temp/x.ff")
>> + )
>>         User      System verstrichen
>>         0.01        0.00        0.02
>>
>>   # check finalizer, with an explicit filename we should have a 'close'
>>
>>   finalizer
>>
>>   finalizer(x)
>>
>>   [1] "close"
>>
>>   # if not, set it to 'close' inorder to not let slaves delete x on slave
>>
>>   shutdown
>>
>>   finalizer(x) <- "close"
>> sfInit(parallel=TRUE, cpus=ncpus, type="SOCK")
>>
>>   R Version:  R version 2.15.0 (2012-03-30)
>>
>> snowfall 1.84 initialized (using snow 0.3-9): parallel execution on 2
>> CPUs.
>>
>>
>>   sfLibrary(ff)
>>
>>   Library ff loaded.
>> Library ff loaded in cluster.
>>
>> Warnmeldung:
>> In library(package = "ff", character.only = TRUE, pos = 2, warn.conflicts
>> = TRUE,  :
>>    'keep.source' is deprecated and will be ignored
>>
>>   sfExport("x") # note: do not export the same ff multiple times
>> # explicitely opening avoids a gc problem
>> sfClusterEval(open(x, caching="mmeachflush")) # opening with
>>
>>   'mmeachflush' inststead of 'mmnoflush' is a bit slower but prevents OS
>> write storms when the file is larger than RAM
>> [[1]]
>> [1] TRUE
>>
>> [[2]]
>> [1] TRUE
>>
>>
>>   system.time(
>>
>>   + sfLapply( chunk(x, length=ncpus), function(i){
>> +   x[i] <- runif(sum(i))
>> +   invisible()
>> + })
>> + )
>>         User      System verstrichen
>>         0.00        0.00       30.78
>>
>>   system.time(
>>
>>   + s <- sfLapply( chunk(x, length=ncpus), function(i) quantile(x[i],
>> c(0.05, 0.95)) )
>> + )
>>         User      System verstrichen
>>         0.00        0.00        4.38
>>
>>   # for completeness
>> sfClusterEval(close(x))
>>
>>   [[1]]
>> [1] TRUE
>>
>> [[2]]
>> [1] TRUE
>>
>>
>>   csummary(s)
>>
>>                5%  95%
>> Min.    0.04998 0.95
>> 1st Qu. 0.04999 0.95
>> Median  0.05001 0.95
>> Mean    0.05001 0.95
>> 3rd Qu. 0.05002 0.95
>> Max.    0.05003 0.95
>>
>>   # stop slaves
>> sfStop()
>>
>>   Stopping cluster
>>
>>
>>   # with the close finalizer we are responsible for deleting the file
>>
>>   explicitely (unless we want to keep it)
>>
>>   delete(x)
>>
>>   [1] TRUE
>>
>>   # remove r-side metadata
>> rm(x)
>> # truly free memory
>> gc()
>>
>>    *Gesendet:* Donnerstag, 03. Mai 2012 um 00:23 Uhr
>> *Von:* "Jonathan Greenberg" <jgrn at illinois.edu> <jgrn at illinois.edu>
>> *An:* r-help <r-help at r-project.org> <r-help at r-project.org>, r-sig-hpc at r-project.org
>> *Betreff:* [R-sig-hpc] Quickest way to make a large "empty" file on
>> disk?
>>   R-helpers:
>>
>> What would be the absolute fastest way to make a large "empty" file (e.g.
>> filled with all zeroes) on disk, given a byte size and a given number
>> number of empty values. I know I can use writeBin, but the "object" in
>> this case may be far too large to store in main memory. I'm asking because
>> I'm going to use this file in conjunction with mmap to do parallel writes
>> to this file. Say, I want to create a blank file of 10,000 floating point
>> numbers.
>>
>> Thanks!
>>
>> --j
>>
>> --
>> Jonathan A. Greenberg, PhD
>> Assistant Professor
>> Department of Geography and Geographic Information Science
>> University of Illinois at Urbana-Champaign
>> 607 South Mathews Avenue, MC 150
>> Urbana, IL 61801
>> Phone: 415-763-5476
>>
>> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007http://www.geog.illinois.edu/people/JonathanGreenberg.html
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-hpc mailing listR-sig-hpc at r-project.orghttps://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>>
>>   --
>> Jonathan A. Greenberg, PhD
>> Assistant Professor
>> Department of Geography and Geographic Information Science
>> University of Illinois at Urbana-Champaign
>> 607 South Mathews Avenue, MC 150
>> Urbana, IL 61801
>> Phone: 415-763-5476
>> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007http://www.geog.illinois.edu/people/JonathanGreenberg.html
>>
>>
>>
>> ______________________________________________R-help at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>




More information about the R-help mailing list