[R] Speeding reading of a large file
Rui Barradas
ruipbarradas at sapo.pt
Thu Dec 6 18:46:38 CET 2012
Hello,
Yes, x[] forces x to keep it's dimensions. In your original post you've
asked "how does this become a data frame". It doesn't _become_, it
already _is_ one. The same goes for vectors, matrices and arrays. The
dimensions stay the same.
Rui Barradas
Em 06-12-2012 17:39, Juliet Hannah escreveu:
> Thanks, it does help. Is it possible to elaborate on how specifically
> why this syntax
> preserves dimensions. It this correct to just say that even though
> lapply returns a list, x[] forces x to have the
> same dimensions?
>
> On Thu, Dec 6, 2012 at 11:53 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
>> Hello,
>>
>> Because x[] keeps the dimensions, unlike just x.
>>
>> Hope this helps,
>>
>> Rui Barradas
>> Em 06-12-2012 16:24, Juliet Hannah escreveu:
>>
>>> All,
>>>
>>> Can someone describe what
>>>
>>> x[] <- lapply(x, as.numeric)
>>>
>>> I see that it is putting the list elements into a data frame. The
>>> results for lapply are a list, so how does this become
>>> a data frame.
>>>
>>> Thanks,
>>>
>>> Juliet
>>>
>>>
>>> On Mon, Dec 3, 2012 at 5:49 PM, Fisher Dennis <fisher at plessthan.com>
>>> wrote:
>>>> Colleagues,
>>>>
>>>> This past week, I asked the following question:
>>>>
>>>> I have a file that looks that this:
>>>>
>>>> TABLE NO. 1
>>>> PTID TIME AMT FORM PERIOD
>>>> IPRED CWRES EVID CP PRED RES WRES
>>>> 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
>>>> 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
>>>> 0.0000E+00
>>>> 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
>>>> 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
>>>> 0.0000E+00
>>>> 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
>>>> 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
>>>> 0.0000E+00
>>>> 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
>>>> 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
>>>> 0.0000E+00
>>>> 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
>>>> 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
>>>> 0.0000E+00
>>>> 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
>>>> 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
>>>> 0.0000E+00
>>>> 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
>>>> 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
>>>> 0.0000E+00
>>>> 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
>>>> 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
>>>> 0.0000E+00
>>>>
>>>> The file is reasonably large (> 10^6 lines) and the two line
>>>> header is repeated periodically in the file.
>>>> I need to read this file in as a data frame. Note that the
>>>> number of columns, the column headers, and the number of replicates of the
>>>> headers are not known in advance.
>>>>
>>>> I received a number of replies, many of them quite useful. Of these, one
>>>> beat out all the others in my benchmarking using files ranging from 10^5 to
>>>> 10^6 lines.
>>>> That version, provided by Jim Holtman, was:
>>>> x <- read.table(FILE, as.is = TRUE, skip=1,
>>>> fill=TRUE, header = TRUE)
>>>> x[] <- lapply(x, as.numeric)
>>>> x <- x[!is.na(x[,1]), ]
>>>>
>>>> Other versions involved readLines, following by edits, following by cat
>>>> (or write) to a temp file, then read.table again.
>>>> The overhead with invoking readLines, write/cat, and read.table was
>>>> substantially larger than the strategy of read.table / as.numeric / indexing
>>>>
>>>> Thanks for the input from many folks.
>>>>
>>>> Dennis
>>>>
>>>> Dennis Fisher MD
>>>> P < (The "P Less Than" Company)
>>>> Phone: 1-866-PLessThan (1-866-753-7784)
>>>> Fax: 1-866-PLessThan (1-866-753-7784)
>>>> www.PLessThan.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
More information about the R-help
mailing list