[R] loop for a large database
Petr Savicky
savicky at cs.cas.cz
Mon Feb 27 09:01:38 CET 2012
On Sun, Feb 26, 2012 at 11:39:01AM -0800, mari681 wrote:
> SORRY!
>
> The data in MyTable are tagsets of photos, like this:
>
> V1 V2 V3 V4 V5 V6 V7 V8
> 230 green nailpolish barrym 0 0 0 0 0
> 231 ny green brooklyn cleanup clean gowanus volunteer gcc
> 232 green saul lecture 0 0 0 0 0
> 233 green colors cores market colores marakesh mercado malu
> 234 ny green brooklyn cleanup clean gowanus volunteer gcc
> 235 green saul lecture 0 0 0 0 0
> 236 portrait pet white green cat canon square eos
>
> V9 V10 V11 V12 V13 V14 V15
> 230 0 0 0 0 0 0 0
> 231 gowanuscanalconservancy 0 0 0 0 0 0
> 232 0 0 0 0 0 0 0
> 233 malugreen maroc souk marrocos 0 0 0
> 234 gowanuscanalconservancy 0 0 0 0 0 0
> 235 0 0 0 0 0 0 0
> 236 is eyes mark taiwan ii mk2 5d
>
>
> while data of MyVector is a list of tags (none of the columns in particular)
> whose frequency in MyTable has to be computed. Like this:
>
> [1] "life" "wood" "pink" "house" "green" "fall"
Hi.
Just to be sure, in all the previous solutions, "malugreen" is not an
occurence of "green". Is this correct?
> MyTable has 21 millions rows and 15 columns, and the data is "character",
> they are words.
Do you use the argument stringsAsFactors=FALSE, when reading the data
from a file? Otherwise, character data are converted to a factor.
The discussed solutions work in both cases, however, if we try to
prepare simplified data for testing efficiency, we should use the
same column class as in the real situation.
> When I tried the loop my computer crashed in the meaning that it freezed
> (froze?) and didn't allow me to do anything. The morning after I forced it
> off and rebooted.
This does not seem to be a consequence of a too long computation.
A possible cause can be too large memory requirements. How large memory
the R process uses after loading the data? Try gc() command after loading
the data and compare with the amount of memory available. On a Linux
machine, it is also possible to see the memory usage with "top" command
in the row, where R is reported.
Petr.
More information about the R-help
mailing list