[R-sig-hpc] How to manage the .ff files made by using the ff package
Jens Oehlschlägel
Jens.Oehlschlaegel at truecluster.com
Thu Nov 1 18:24:23 CET 2012
Matthew,
You asked how to autmatically re-use a 'pattern' when modifying a ffdf
object. To start with, help(read.table.ffdf) has an example how to
create a ffdf in a specific directory with a specific file name
'pattern'. If we put the pattern in a variable, this looks like
mypattern <- "c:/tmp/csv"
ffy <- read.csv.ffdf(file=csvfile, header=TRUE,
colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, asffdf_args=list(
vmode = c(log="boolean", int="byte", dbl="single",
fac="nibble", ord="nibble", dct="single", dat="single")
, col_args=list(pattern = mypattern) # create in getwd() with
prefix csv
)
)
As long as you modify existing columns, you operate 'by reference', thus
no need to create new ffs.
For adding new columns, there are two little challenges in using the
pattern from the fdff instead of using the variable mypattern
1. There is no 'pattern' stored at the ffdf level, 'pattern' is stored
with each single ff column.
2. ffdfs are created from ffs, that need to exist *before* actually
executing the code that creates the ffdf.
The following code shows how to query the old pattern and apply it for
creating a new column:
namedcolindexes <- 1:ncol(ffy)
names(namedcolindexes) <- colnames(ffy)
ffx <- do.call("ffdf", c(lapply(namedcolindexes,
function(i)ffy[[i]]), list(newcol=ff(1:nrow(ffy),
pattern=pattern(ffy[[1]])))))
filename(ffx)
If you need to do this very often, you can create you own function, to
save you some typing.
Kind regards
Jens Oehlschlägel
*Gesendet:* Freitag, 26. Oktober 2012 um 03:48 Uhr
*Von:* "Matthew Dubins" <matt.dubins at gmail.com>
*An:* Jens_Oehlschlaegel at truecluster.com
*Betreff:* Re: How to manage the .ff files made by using the ff package
Hi Jens,
If I set a pattern for my ffdf once, then will it continue to use the
same pattern even when I modify that ffdf? I do a lot of data
processing and I just notice that every time I made a new column, it
would make a 'clone' ff file that wouldn't conform to the pattern I set.
Basically, I compile the data outside of R, read it into R using
read.csv.ffdf, but what options do I use to make sure the data i've
imported persists without getting that filename access error?
All that being said, once I wasn't having any issues accessing the data
that I saved in my new tempdir, I found the ff package to be great for
reducing RAM usage :) :) When will it be updated next?
Cheers,
Matthew Dubins
On Tue, Oct 23, 2012 at 12:56 PM, <Jens_Oehlschlaegel at truecluster.com>
wrote:
Dear Matthew,
I'd rather keep fftempdir where it is and give a different pattern
or filename for those ffs that you want to keep.
HTH
Jens
*Gesendet:* Freitag, 19. Oktober 2012 um 02:05 Uhr
*Von:* "Matthew Dubins" <matt.dubins at gmail.com>
*An:* Jens_Oehlschlaegel at truecluster.com
*Betreff:* How to manage the .ff files made by using the ff package
Dear Jens,
I'm now near completion of an analysis project wherein I almost
exclusively used the ff package to manage, process, and analyze my
data. I kept noticing that it was saving .ff files in a temp
directory on my windows machine as I was saving and loading my
project. I learned that it's better to set the fftempdir to another
location so that the .ff files don't get deleted and the package
doesn't complain that you can't load a vector because
"file./access/(/filename/, /0/) == /0/ is not TRUE".
Now that I have a custom fftempdir, I don't seem to have any more
problems loading up individual vectors but these .ff files keep
piling on. What's a good way of making sure that I keep only the
.ff files that I need for analysis? Do I need to keep them in the
fftempdir that I set or does the data persist on the hard drive in
the .RData and .ffData files in the project directory?
Thanks,
Matthew Dubins
More information about the R-sig-hpc
mailing list