[R] Problems completely reading in a "large" sized data set

Charles C. Berry cberry at tajo.ucsd.edu
Thu Jan 21 17:41:01 CET 2010


On Wed, 20 Jan 2010, Robin Jeffries wrote:

> I have been through the help file archives a number of times, and still
> cannot figure out what is wrong.
> I have a tab-delimited text file. 76Mb, so while it's large.. it's not
> -that- large. I'm running Win7 x64 w/4G RAM and R 2.10.1
>
> When I open this data in Excel, i have 27 rows and 450932 rows, excluding
> the first row containing variable names.
>
> I am trying to get this into R as a dataset for analysis.
>
> z<-"Data/media1y.txt"
> f=file(zz,'r') # open the file
> rl = readLines(f,1) # Read the first line
> colnames<-strsplit(rl, '\t')
> p = length(colnames[[1]]) # counte the number of columns
> nobs<-450932
> close(f)
>
> Using:
> d1<-matrix(scan(zz,skip=1,sep="\t",fill=TRUE,what=rep("character",p),
> nlines=nobs),ncol=p,nrow=nobs, byrow=TRUE,
> dimnames=list(NULL,colnames[[1]]))
>
> produces the error
> Read 5761719 items
> Warning message:
> In matrix(scan(zz, skip = 1, sep = "\t", fill = TRUE, what =
> rep("character",  :
>  data length [5761719] is not a sub-multiple or multiple of the number of
> rows [10]
>
> Now, 5761719/27 = 213397.
> If I change nobs<-213397 it reads in the file with no errors. It produces a
> matrix that I can work with from here. But the file obviously is not
> complete.
>

What does

 	length( grep( '\n', d1 ) )

report??

If non-zero, you have unmatched quotes in some lines.

HTH,

Chuck


> At first I thought it might be reading the first x amount of rows. So I
> sorted by the first variable alphabetically in Excel before saving it as a
> txt file and reading it into R.
> head(d1) shows the correct first 6 rows, but when I ask for tail(d1) the
> entry for the first variable in the last row is [213397,] "WSAH"
> The 213397th row in Excel, starts with "MM1" and the actual last row starts
> with "YE". The "WSA" in question can be found on Excel row # 397548
>
> That, confuses the heck out of me. There are no blank lines.
>
> Since there are >1000 categories for that first variable, i'm not going to
> manually match all of the frequencies, but the first 10 were exact, "MM1"
> was correct, and the last few before "WSA" was also correct. "WSA" itself
> had 3001 observations in R, whereas Excel has 3093. That also makes it seem
> that R is stopping reading the table at some point.
>
>
>
> It shouldn't be a memory issue.... right?
>> object.size(d1)
> 56328480 bytes
>> memory.size(max=TRUE)
> [1] 444.06
>> memory.size(max=NA)
> [1] 3583.88
>> memory.size(max=FALSE)
> [1] 251.09
>
>
>
> As a side question, i'm reading it all in as characters for now because when
> i tried to define a vector of column types wht
> <-list(rep("character",7),0,"logical",0,"character")) to use in scan(), it
> still read everything in as character. I'm also not sure about the "" 's, I
> had to put them in to get list() to even accept that. Or c(). Any ideas with
> this?
>
> Thanks!
>
> -- 
> Robin Jeffries
> Dr.P.H. Candidate
> Department of Biostatistics
> UCLA School of Public Health
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list