[R] the quote problem with readLines()

jim holtman jholtman at gmail.com
Wed Mar 18 17:35:33 CET 2009


Check out this reference:

http://tolstoy.newcastle.edu.au/R/e2/help/07/02/9709.html



On Wed, Mar 18, 2009 at 11:16 AM, Dongyan Song <yzhskdls at hotmail.com> wrote:
>
> Hi Jim,
>
> Thank you very much! I will try to sample them then.
>
> Best,
> Dongyan
>
>
> jholtman wrote:
>>
>> The amount of data that you want to read in (136M numbers) will
>> require about 1GB of memory (8 bytes per number for floating point -
>> truncation does not reduce this number of bytes).  So if you want to
>> read it all in, then find a 64-bit version of R and probably at least
>> 4GB of memory for your process.  A 32-bit version might have just
>> enough space if you can allocate all the 4GB of memory to that
>> process.
>>
>> So if you want to have it all in memory, invest in a larger computer.
>> If you want to run on the system you have, then you will probably have
>> to sample your data so that you can get a portion that will fit in
>> memory to run your test, or see if there is a way of processing
>> portions of the file and then combining for a final result.
>> On Wed, Mar 18, 2009 at 9:58 AM, Dongyan Song <yzhskdls at hotmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> Thank you for your concern!
>>>
>>> The file has 136,047,472 lines, with one value in each line, and is 1.7G
>>> in
>>> size. I run in a Linux (OpenSuse OS) with 4G memory in total. The error
>>> message is Error: cannot allocate vector of size 2.0 Gb. And the worst
>>> thing
>>> is even if I read all the data into R after I truncate the numbers'
>>> precision, i.e. from 1.234567e+00 to 1.2, I cannot manipulate these
>>> numbers,
>>> for example, I cannot do ks.test, histogram, kernel density estimator,
>>> which
>>> I want to do with these numbers. And after I input commands above,
>>> computer
>>> also give error messages like Error: cannot allocate vector of size 809.1
>>> Mb. I can read a half of file, but I want to know the overall
>>> distribution
>>> of those numbers, and values in this file is not ordered, and it is not
>>> quite easy to random pick up some numbers or sort them.
>>>
>>> Is these information enough? Thank you again!
>>>
>>> Best,
>>> Dongyan
>>>
>>>
>>>
>>> jholtman wrote:
>>>>
>>>> readLines is doing exactly what you are asking:
>>>>
>>>> Value
>>>> A character vector of length the number of lines read.
>>>>
>>>> You still have to convert the character strings to numeric.  Exactly
>>>> how large is "quite large"?  What system are you running on?  How much
>>>> memory do you have?  What is the error message that you are getting?
>>>> Exactly what does your file look like?  Have you tried reading in
>>>> portions of the file?  How big will it be if you could read it in?
>>>> Will it take up more than 25% of real memory?  There is still some
>>>> information you need to provide so an assessment can be made.
>>>>
>>>> On Tue, Mar 17, 2009 at 8:50 AM, Dongyan Song <yzhskdls at hotmail.com>
>>>> wrote:
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I read a file with all numbers with readLines function, as below,
>>>>>> f <- file("data.txt")
>>>>>> a <- readLines(f)
>>>>> but all the values in a are in format "....", and I cannot do the
>>>>> calculation with them since they are not numeric. I wonder how should I
>>>>> skip
>>>>> those quotes, thank you for help!
>>>>> I have to use readLines function instead of scan, read.table or matrix,
>>>>> because the size of file is quite large, and other function cannot
>>>>> allocate
>>>>> enough space/memory to read the input file.
>>>>>
>>>>> Best,
>>>>> Dongyan
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/the-quote-problem-with-readLines%28%29-tp22558454p22558454.html
>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jim Holtman
>>>> Cincinnati, OH
>>>> +1 513 646 9390
>>>>
>>>> What is the problem that you are trying to solve?
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>> -----
>>> Dongyan Song, Msc
>>> Medical informatics, Uppsala University, Sweden
>>> --
>>> View this message in context:
>>> http://www.nabble.com/the-quote-problem-with-readLines%28%29-tp22558454p22579163.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
> -----
> Dongyan Song, Msc
> Medical informatics, Uppsala University, Sweden
> --
> View this message in context: http://www.nabble.com/the-quote-problem-with-readLines%28%29-tp22558454p22581029.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?




More information about the R-help mailing list