[R] Speeding reading of large file
David L Carlson
dcarlson at tamu.edu
Wed Nov 28 20:32:44 CET 2012
Using your first approach, this should be faster
raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Fisher Dennis
> Sent: Wednesday, November 28, 2012 11:43 AM
> To: r-help at r-project.org
> Subject: [R] Speeding reading of large file
>
> R 2.15.1
> OS X and Windows
>
> Colleagues,
>
> I have a file that looks that this:
> TABLE NO. 1
> PTID TIME AMT FORM PERIOD IPRED
> CWRES EVID CP PRED RES WRES
> 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
> 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
> 0.0000E+00
> 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
> 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
> 0.0000E+00
> 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
> 0.0000E+00
> 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
> 0.0000E+00
> 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
> 0.0000E+00
> 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
> 0.0000E+00
> 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
> 0.0000E+00
> 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
> 0.0000E+00
>
> The file is reasonably large (> 10^6 lines) and the two line header is
> repeated periodically in the file.
> I need to read this file in as a data frame. Note that the number of
> columns, the column headers, and the number of replicates of the
> headers are not known in advance.
>
> I have tried two approaches to this:
> First Approach:
> 1. readLines(FILENAME) to read in the file
> 2. use grep to find the repeat headers; strip out the
> repeat headers
> 3. write() the object to tempfile, read in that temporary
> file using read.table(tempfile, header=TRUE, skip=1) [an alternative is
> to use textConnection but that does not appear to speed things]
>
> Second Approach:
> 1. TEMP <- read.table(FILENAME, header=TRUE, skip=1,
> fill=TRUE, as.is=TRUE)
> 2. get rid of the errant entries with:
> TEMP[!is.na(as.numeric(TEMP[,1])),]
> 3. reading of the character entries forced all columns to
> character mode. Therefore, I convert each column to numeric:
> for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
> as.numeric(TEMP[,COL])
> The second approach is ~ 20% faster than the first. With the second
> approach, the conversion to numeric occupies 50% of the elapsed time.
>
> Is there some approach that would be much faster? For example, would a
> vectorized approach to conversion to numeric improve throughput? Or,
> is there some means to ensure that all data are read as numeric (I
> tried to use colClasses but that triggered an error when the text
> string was encountered).
>
> ############################
> A dput version of the data is:
> c("TABLE NO. 1", " PTID TIME AMT FORM
> PERIOD IPRED CWRES EVID CP PRED
> RES WRES",
> " 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
> 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
> 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
> 0.0000E+00"
> )
>
> This can be assembled into a large dataset and written to a file named
> FILENAME with the following code:
> cat(c("TABLE NO. 1", " PTID TIME AMT FORM
> PERIOD IPRED CWRES EVID CP PRED
> RES WRES",
> " 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
> 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
> 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
> 0.0000E+00",
> " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
> 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
> 0.0000E+00"
> )[rep(1:10, 1000)], file="FILENAME", sep="\n")
>
>
> Dennis
>
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list