[R-sig-hpc] How to manage the .ff files made by using the ff package

Jens Oehlschlägel Jens.Oehlschlaegel at truecluster.com
Thu Nov 1 18:24:23 CET 2012


Matthew,

You asked how to autmatically re-use a 'pattern' when modifying a ffdf 
object. To start with, help(read.table.ffdf) has an example how to 
create a ffdf in a specific directory with a specific file name 
'pattern'. If we put the pattern in a variable, this looks like

     mypattern <- "c:/tmp/csv"
     ffy <- read.csv.ffdf(file=csvfile, header=TRUE, 
colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
     , asffdf_args=list(
         vmode = c(log="boolean", int="byte", dbl="single", 
fac="nibble", ord="nibble", dct="single", dat="single")
       , col_args=list(pattern = mypattern)  # create in getwd() with 
prefix csv
       )
     )

As long as you modify existing columns, you operate 'by reference', thus 
no need to create new ffs.
For adding new columns, there are two little challenges in using the 
pattern from the fdff instead of using the variable mypattern

1. There is no 'pattern' stored at the ffdf level, 'pattern' is stored 
with each single ff column.
2. ffdfs are created from ffs, that need to exist *before* actually 
executing the code that creates the ffdf.

The following code shows how to query the old pattern and apply it for 
creating a new column:

     namedcolindexes <- 1:ncol(ffy)
     names(namedcolindexes) <- colnames(ffy)
     ffx <- do.call("ffdf", c(lapply(namedcolindexes, 
function(i)ffy[[i]]), list(newcol=ff(1:nrow(ffy), 
pattern=pattern(ffy[[1]])))))
     filename(ffx)

If you need to do this very often, you can create you own function, to 
save you some typing.

Kind regards

Jens Oehlschlägel


*Gesendet:* Freitag, 26. Oktober 2012 um 03:48 Uhr
*Von:* "Matthew Dubins" <matt.dubins at gmail.com>
*An:* Jens_Oehlschlaegel at truecluster.com
*Betreff:* Re: How to manage the .ff files made by using the ff package
Hi Jens,

If I set a pattern for my ffdf once, then will it continue to use the 
same pattern even when I modify that ffdf?  I do a lot of data 
processing and I just notice that every time I made a new column, it 
would make a 'clone' ff file that wouldn't conform to the pattern I set.

Basically, I compile the data outside of R, read it into R using 
read.csv.ffdf, but what options do I use to make sure the data i've 
imported persists without getting that filename access error?

All that being said, once I wasn't having any issues accessing the data 
that I saved in my new tempdir, I found the ff package to be great for 
reducing RAM usage :) :)  When will it be updated next?

Cheers,
Matthew Dubins

On Tue, Oct 23, 2012 at 12:56 PM, <Jens_Oehlschlaegel at truecluster.com> 
wrote:

    Dear Matthew,

    I'd rather keep fftempdir where it is and give a different pattern
    or filename for those ffs that you want to keep.

    HTH

    Jens


    *Gesendet:* Freitag, 19. Oktober 2012 um 02:05 Uhr
    *Von:* "Matthew Dubins" <matt.dubins at gmail.com>
    *An:* Jens_Oehlschlaegel at truecluster.com
    *Betreff:* How to manage the .ff files made by using the ff package
    Dear Jens,

    I'm now near completion of an analysis project wherein I almost
    exclusively used the ff package to manage, process, and analyze my
    data.  I kept noticing that it was saving .ff files in a temp
    directory on my windows machine as I was saving and loading my
    project.  I learned that it's better to set the fftempdir to another
    location so that the .ff files don't get deleted and the package
    doesn't complain that you can't load a vector because
    "file./access/(/filename/, /0/) == /0/ is not TRUE".

    Now that I have a custom fftempdir, I don't seem to have any more
    problems loading up individual vectors but these .ff files keep
    piling on.  What's a good way of making sure that I keep only the
    .ff files that I need for analysis?  Do I need to keep them in the
    fftempdir that I set or does the data persist on the hard drive in
    the .RData and .ffData files in the project directory?

    Thanks,
    Matthew Dubins



More information about the R-sig-hpc mailing list