[R] Can't import this 4GB DATASET
iliketurtles
isaacm200 at gmail.com
Fri May 4 07:34:40 CEST 2012
Dear Experienced R Practitioners,
I have 4GB .txt data called "dataset.txt" and have attempted to use *ff,
bigmemory, filehash and sqldf *packages to import it, but have had no
success. The readLines output of this data is:
readLines("dataset.txt",n=20)
[1] " "
[2] "
"
[3] " "
[4] " PERMNO DATE SHRCD COMNAM
PRC VOL"
[5] ""
[6] " 10001 01/09/1986 11 GREAT FALLS GAS CO
-5.75000 14160"
[7] " 10001 01/10/1986 11 GREAT FALLS GAS CO
-5.87500 0"
[8] " 10001 01/13/1986 11 GREAT FALLS GAS CO
-5.87500 2805"
[9] " 10001 01/14/1986 11 GREAT FALLS GAS CO
-5.87500 2070"
[10] " 10001 01/15/1986 11 GREAT FALLS GAS CO
-6.06250 6000"
[11] " 10001 01/16/1986 11 GREAT FALLS GAS CO
-6.25000 1500"
[12] " 10001 01/17/1986 11 GREAT FALLS GAS CO
-6.25000 7100"
[13] " 10001 01/20/1986 11 GREAT FALLS GAS CO
-6.31250 1700"
[14] " 10001 01/21/1986 11 GREAT FALLS GAS CO
-6.18750 4000"
[15] " 10001 01/22/1986 11 GREAT FALLS GAS CO
-6.18750 5200"
[16] " 10001 01/23/1986 11 GREAT FALLS GAS CO
-6.18750 4100"
[17] " 10001 01/24/1986 11 GREAT FALLS GAS CO
-6.18750 1500"
[18] " 10001 01/27/1986 11 GREAT FALLS GAS CO
-6.18750 4000"
[19] " 10001 01/28/1986 11 GREAT FALLS GAS CO
-6.12500 3500"
[20] " 10001 01/29/1986 11 GREAT FALLS GAS CO
-6.06250 4600"
This data goes on for a huge number of rows (not sure exactly how many).
Each element in each row is separated by and uneven number of (what seem to
be) spaces (maybe TAB? not sure). Further, there are some rows that are
"incomplete", i.e. there's missing elements.
Take the first 29 rows of "dataset.txt" into a separate data file, let's
call it "dataset2.txt". read.table("dataset2.txt",skip=5) gives the perfect
table that I want to end up with, except I want it with the 4GB data through
bigmemory, ff or filehash.
read.table('dataset2.txt',skip=5)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 10001 01/09/1986 11 GREAT FALLS GAS CO -5.7500 14160
2 10001 01/10/1986 11 GREAT FALLS GAS CO -5.8750 0
3 10001 01/13/1986 11 GREAT FALLS GAS CO -5.8750 2805
4 10001 01/14/1986 11 GREAT FALLS GAS CO -5.8750 2070
5 10001 01/15/1986 11 GREAT FALLS GAS CO -6.0625 6000
6 10001 01/16/1986 11 GREAT FALLS GAS CO -6.2500 1500
7 10001 01/17/1986 11 GREAT FALLS GAS CO -6.2500 7100
8 10001 01/20/1986 11 GREAT FALLS GAS CO -6.3125 1700
9 10001 01/21/1986 11 GREAT FALLS GAS CO -6.1875 4000
10 10001 01/22/1986 11 GREAT FALLS GAS CO -6.1875 5200
11 10001 01/23/1986 11 GREAT FALLS GAS CO -6.1875 4100
12 10001 01/24/1986 11 GREAT FALLS GAS CO -6.1875 1500
13 10001 01/27/1986 11 GREAT FALLS GAS CO -6.1875 4000
14 10001 01/28/1986 11 GREAT FALLS GAS CO -6.1250 3500
15 10001 01/29/1986 11 GREAT FALLS GAS CO -6.0625 4600
16 10001 01/30/1986 11 GREAT FALLS GAS CO -6.0625 3830
17 10001 01/31/1986 11 GREAT FALLS GAS CO -6.1250 675
18 10001 02/03/1986 11 GREAT FALLS GAS CO -6.1250 2300
19 10001 02/04/1986 11 GREAT FALLS GAS CO -6.1250 4200
20 10001 02/05/1986 11 GREAT FALLS GAS CO -6.1250 1000
21 10001 02/06/1986 11 GREAT FALLS GAS CO -6.1250 4200
22 10001 02/07/1986 11 GREAT FALLS GAS CO -6.1250 1800
23 10001 02/10/1986 11 GREAT FALLS GAS CO -6.1875 100
24 10001 02/11/1986 11 GREAT FALLS GAS CO -6.3125 1500
25 10001 02/12/1986 11 GREAT FALLS GAS CO -6.2500 2500
26 10001 02/13/1986 11 GREAT FALLS GAS CO -6.2500 1000
27 10001 02/14/1986 11 GREAT FALLS GAS CO -6.1875 0
28 10001 02/18/1986 11 GREAT FALLS GAS CO -6.1875 7000
29 10001 02/19/1986 11 GREAT FALLS GAS CO -6.2500 5200
###/*MY ATTEMPT USING FILEHASH*/###
#First, try and upload the 29 row dataset (the tiny subset of the whole
dataset).
install.packages("filehash");require(filehash)
dumpList(read.table("dataset2.txt",skip=6),dbName="db02") #Note that DumpDF
also works.
env02 <- db2env(db="db02")
with(env02,V1)
#
[1] 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001
10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001
10001 10001
[27] 10001 10001
#
with(env02,V2)
#
[1] 01/10/1986 01/13/1986 01/14/1986 01/15/1986 01/16/1986 01/17/1986
01/20/1986 01/21/1986 01/22/1986 01/23/1986 01/24/1986 01/27/1986 01/28/1986
01/29/1986
[15] 01/30/1986 01/31/1986 02/03/1986 02/04/1986 02/05/1986 02/06/1986
02/07/1986 02/10/1986 02/11/1986 02/12/1986 02/13/1986 02/14/1986 02/18/1986
02/19/1986
#
#so this is all good. But when we try to import the ENTIRE data;
dumpList(read.table("dataset.txt",skip=6),dbName="db01")
#Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
: line 1991 did not have 9 elements
#This confirms that there are missing elements in later rows. So, I put
fill=TRUE:
dumpDF(read.table("dataset.txt",skip=6,fill=TRUE),dbName="db01")
Error: cannot allocate vector of size 500.0 Mb
In addition: Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
Reached total allocation of 4078Mb: see help(memory.size)
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
Reached total allocation of 4078Mb: see help(memory.size)
3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
Reached total allocation of 4078Mb: see help(memory.size)
4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
Reached total allocation of 4078Mb: see help(memory.size)
#This occurs after a 30 minute wait.
###/*MY ATTEMPT USING FF*/###
#First, try with the 29 row "datatset2.txt",
# open a connection to the file
con <- file('dataset2.txt', 'rt')
# read the remainder using read.table.ffdf
ffdf <- read.table.ffdf(file=con)
# close connection
close(con)
ffdf
#ffdf (all open) dim=c(29,9), dimorder=c(1,2) row.names=NULL
#ffdf virtual mapping
# PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix
PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
PhysicalIsOpen
V1 V1 integer integer FALSE FALSE
FALSE 1 1 1 TRUE
V2 V2 integer integer FALSE FALSE
FALSE 2 1 1 TRUE
V3 V3 integer integer FALSE FALSE
FALSE 3 1 1 TRUE
V4 V4 integer integer FALSE FALSE
FALSE 4 1 1 TRUE
V5 V5 integer integer FALSE FALSE
FALSE 5 1 1 TRUE
V6 V6 integer integer FALSE FALSE
FALSE 6 1 1 TRUE
V7 V7 integer integer FALSE FALSE
FALSE 7 1 1 TRUE
V8 V8 double double FALSE FALSE
FALSE 8 1 1 TRUE
V9 V9 integer integer FALSE FALSE
FALSE 9 1 1 TRUE
ffdf data
V1 V2 V3 V4 V5 V6
V7 V8 V9
1 10001 01/09/1986 11 GREAT FALLS GAS CO
-5.7500 14160
2 10001 01/10/1986 11 GREAT FALLS GAS CO
-5.8750 0
3 10001 01/13/1986 11 GREAT FALLS GAS CO
-5.8750 2805
4 10001 01/14/1986 11 GREAT FALLS GAS CO
-5.8750 2070
5 10001 01/15/1986 11 GREAT FALLS GAS CO
-6.0625 6000
6 10001 01/16/1986 11 GREAT FALLS GAS CO
-6.2500 1500
7 10001 01/17/1986 11 GREAT FALLS GAS CO
-6.2500 7100
8 10001 01/20/1986 11 GREAT FALLS GAS CO
-6.3125 1700
: : : : : : :
: : :
22 10001 02/07/1986 11 GREAT FALLS GAS CO
-6.1250 1800
23 10001 02/10/1986 11 GREAT FALLS GAS CO
-6.1875 100
24 10001 02/11/1986 11 GREAT FALLS GAS CO
-6.3125 1500
25 10001 02/12/1986 11 GREAT FALLS GAS CO
-6.2500 2500
26 10001 02/13/1986 11 GREAT FALLS GAS CO
-6.2500 1000
27 10001 02/14/1986 11 GREAT FALLS GAS CO
-6.1875 0
28 10001 02/18/1986 11 GREAT FALLS GAS CO
-6.1875 7000
29 10001 02/19/1986 11 GREAT FALLS GAS CO
-6.2500 5200
#GOOD!!! Now let's try with the 4GB "dataset.txt".
# open a connection to the file
con <- file('dataset.txt', 'rt')
# read the remainder using read.table.ffdf
ffdf <- read.table.ffdf(file=con)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
:
line 4 did not have 9 elements
#Again, missing elements. Yet I don't know how to do something like
fill=TRUE in read.table.ffdf (like you can in read.table()).
###/*MY ATTEMPT USING BIGMEMORY*/###
#I can't even import "dataset2.txt", so I'm trying to figure out this before
I try on the big "dataset.txt".
E <- read.big.matrix("dataset2.txt", skip = 3,
backingpath = getwd(),
sep = " ",
type = "double",
)
#results in:
E[10,]
[1] NA NA NA 10001.0000 NA NA
NA 1.0000 NA NA NA NA 11.0000
NA
[15] NA NA NA NA NA NA
NA NA NA NA NA NA NA
NA
[29] NA NA NA NA NA NA
NA NA NA NA NA NA NA
NA
[43] NA -6.3125 NA NA NA NA
NA NA NA NA NA 1700.0000 NA
NA
[57] NA NA NA NA NA NA
NA NA NA NA NA NA NA
NA
[71] NA
#So all the character and dates are being screwed up.
E <- read.big.matrix("dataset2.txt", skip = 3,
backingpath = getwd(),
sep = " ",
type = "char",
)
E[10,]
[1] NA NA NA 17 NA NA NA 1 NA NA NA NA 11 NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA
[41] NA NA NA -6 NA NA NA NA NA NA NA NA NA -92 NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA
#Even worse.
###/*MY ATTEMPT USING sqldf*/###
No idea what to do here.
-----
----
Isaac
Research Assistant
Quantitative Finance Faculty, UTS
--
View this message in context: http://r.789695.n4.nabble.com/Can-t-import-this-4GB-DATASET-tp4607862.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list