[R] Appending new values to an existing factor vector
David Hall (coding)
hacking at gringer.org
Sat Mar 15 01:25:03 CET 2008
Hello,
I've recently come across a situation where I'm trying to read in [genotype
data] files that have around 80,000,000 lines, 4 fields, with a high proportion
of repeated strings, here's a sample:
rsXXXXXXX SAMPLE0001 CG 0.05302
rsXXXXXX SAMPLE0001 CC 0.06817
rsXXXXXXXX SAMPLE0001 CC 0.01369
rsXXXXXXY SAMPLE0001 GG 0.01816
rsXXXXXXZ SAMPLE0001 GG 0.006711
rsXXXXXXX SAMPLE0002 GG 0.05813
[For the purpose of the work I'm doing at the moment, I don't care about the
last column]
What's the best way to read in these data?
My understanding of what happens when I do read.table on such a file is that it
reads the file into a matrix (or perhaps a list) of character strings, then
carries out the character conversions [i.e. as.factor(data[[i]])].
infile.df <- read.table(gzfile("large_file.txt.gz"), nrows = 82000000)
Doing this all in one go results in R complaining about not having enough memory
to store a data structure of that size [I'm running on Linux, with 1.5GB memory
+ 2GB swap], so I need to do it piecewise, but I suspect the memory issues will
still be present if I do that.
What I'd like is a way to read in, say, a million lines at a time, do the factor
conversion, then append to my existing data frame, which has columns of factors.
However, something I came across while participating in the ICFP 2007
(http://www.icfpcontest.org/) using R was the strange behaviour when adding
new/unknown values to a factor vector:
> (a <- factor(c("I","C","I","C","F","I")))
[1] I C I C F I
Levels: C F I
> append(a,"P")
[1] "3" "1" "3" "1" "2" "3" "P"
What would be nice is for unknown levels to be added and encoded as a new value,
without having to refactor the whole list, as follows:
> factor(append(as.character(a),"P"))
[1] I C I C F I P
Levels: C F I P
Is there a better way to do this that means I don't need to do the character
conversion process?
The need to do this character conversion seems to removes one of the useful
features of a factored vector in that it substantially reduces space requirements.
Thanks for your help,
David Hall
More information about the R-help
mailing list