[R] Reading large files

Sun Feb 7 01:53:53 CET 2010

Gabor:
It did suppress the message now and I was able to load the data. Question.

1. test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl parse_3wkout.pl") 

In the statement above, should the filename in file= and the file name that the perl script uses through the filter= command be the same? I would think not.  I would say that if filter= is passed to the statement, then the filename should be ignored. Is this how it works?

Thanks.
Satish

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
Sent: Saturday, February 06, 2010 4:58 PM
To: Vadlamani, Satish {FLNA}
Cc: r-help at r-project.org
Subject: Re: [R] Reading large files

I have uploaded another version which suppresses display of the error
message but otherwise works the same.  Omitting the redundant
arguments we have:

ibrary(sqldf)
# next line is only needed once per session to read in devel version
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")

test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl
parse_3wkout.pl")

On Sat, Feb 6, 2010 at 5:48 PM, Vadlamani, Satish {FLNA}
<SATISH.VADLAMANI at fritolay.com> wrote:
> Gabor:
> Please see the results below. Sourcing your new R script worked (although with the same error message). If I put eol="\n" option, it is adding a "\r" to the last column. I took out the eol option below. This is just some more feedback to you.
>
> I am thinking that I will just do an inline edit in Perl (that is create the csv file through Perl by overwriting the current file) and then use read.csv.sql without the filter= option. This seems to be more tried and tested. If you have any suggestions, please let me know. Thanks.
> Satish
>
>
> BEFORE SOURCING YOUR NEW R SCRIPT
>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl")
> Error in readRegistry(key, maxdepth = 3) :
>  Registry key 'SOFTWARE\R-core' not found
>> test_df
> Error: object 'test_df' not found
>
> AFTER SOURCING YOUR NEW R SCRIPT
>> source("f:/dp_modeling_team/downloads/R/sqldf.R")
>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl")
> Error in readRegistry(key, maxdepth = 3) :
>  Registry key 'SOFTWARE\R-core' not found
> In addition: Warning messages:
> 1: closing unused connection 5 (3wkoutstatfcst_small.dat)
> 2: closing unused connection 4 (3wkoutstatfcst_small.dat)
> 3: closing unused connection 3 (3wkoutstatfcst_small.dat)
>> test_df
>   allgeo area1 zone dist ccust1 whse bindc ccust2 account area2 ccust3
> 1       A     4    1   37     99 4925  4925     99      99     4     99
> 2       A     4    1   37     99 4925  4925     99      99     4     99
>
> -----Original Message-----
> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
> Sent: Saturday, February 06, 2010 4:28 PM
> To: Vadlamani, Satish {FLNA}
> Cc: r-help at r-project.org
> Subject: Re: [R] Reading large files
>
> The software attempts to read the registry and temporarily augment the
> path in case you have Rtools installed so that the filter can access
> all the tools that Rtools provides.  I am not sure why its failing on
> your system but there is evidently some differences between systems
> here and I have added some code to trap and bypass that portion in
> case it fails.  I have added the new version to the svn repository so
> try this:
>
> library(sqldf)
> # overwrite with development version
> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> # your code to call read.csv.sql
>
>
> On Sat, Feb 6, 2010 at 5:18 PM, Vadlamani, Satish {FLNA}
> <SATISH.VADLAMANI at fritolay.com> wrote:
>>
>> Gabor:
>> Here is the update. As you can see, I got the same error as below in 1.
>>
>> 1. Error
>>  test_df <- read.csv.sql(file="out_small.txt", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", eol="\n")
>> Error in readRegistry(key, maxdepth = 3) :
>>  Registry key 'SOFTWARE\R-core' not found
>>
>> 2. But the loading of the bigger file was successful as you can see below. 857 MB, 333,250 rows, 227 columns. This is good.
>>
>> I will have to just do an inline edit in Perl and change the file to csv from within R and then call the read.csv.sql.
>>
>> If you have any suggestions to fix 1, I would like to try them.
>>
>>  system.time(test_df <- read.csv.sql(file="out.txt"))
>>   user  system elapsed
>>  192.53   15.50  213.68
>> Warning message:
>> closing unused connection 3 (out.txt)
>>
>> Thanks again.
>>
>> Satish
>>
>> -----Original Message-----
>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>> Sent: Saturday, February 06, 2010 3:02 PM
>> To: Vadlamani, Satish {FLNA}
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Reading large files
>>
>> Note that you can shorten #1 to read.csv.sql("out.txt") since your
>> other arguments are the default values.
>>
>> For the second one, use read.csv.sql, eliminate the arguments that are
>> defaults anyways (should not cause a problem but its error prone) and
>> add an explicit eol= argument since SQLite can have problems with end
>> of line in some cases.  Also test out your perl script separately from
>> R first to ensure that it works:
>>
>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl
>> parse_3wkout.pl", eol = "\n")
>>
>> SQLite has some known problems with end of line so try it with and
>> without the eol= argument just in case.  When I just made up the
>> following gawk example I noticed that I did need to specify the eol=
>> argument.
>>
>> Also I have added a complete example using gawk as Example 13c on the
>> home page just now:
>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
>>
>>
>> On Sat, Feb 6, 2010 at 3:52 PM, Vadlamani, Satish {FLNA}
>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>> Gabor:
>>>
>>> I had success with the following.
>>> 1. I created a csv file with a perl script called "out.txt". Then ran the following successfully
>>> library("sqldf")
>>> test_df <- read.csv.sql(file="out.txt", sql = "select * from file", header = TRUE, sep = ",", dbname = tempfile())
>>>
>>> 2. I did not have success with the following. Could you tell me what I may be doing wrong? I could paste the perl script if necessary. From the perl script, I am reading the file, creating the csv record and printing each record one by one and then exiting.
>>>
>>> Thanks.
>>>
>>> Not had success with below..
>>> #test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>>> test_df
>>>
>>> Error message below:
>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>>> Error in readRegistry(key, maxdepth = 3) :
>>>  Registry key 'SOFTWARE\R-core' not found
>>> In addition: Warning messages:
>>> 1: closing unused connection 14 (3wkoutstatfcst_small.dat)
>>> 2: closing unused connection 13 (3wkoutstatfcst_small.dat)
>>> 3: closing unused connection 11 (3wkoutstatfcst_small.dat)
>>> 4: closing unused connection 9 (3wkoutstatfcst_small.dat)
>>> 5: closing unused connection 3 (3wkoutstatfcst_small.dat)
>>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>>> Error in readRegistry(key, maxdepth = 3) :
>>>  Registry key 'SOFTWARE\R-core' not found
>>>
>>> -----Original Message-----
>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>> Sent: Saturday, February 06, 2010 12:14 PM
>>> To: Vadlamani, Satish {FLNA}
>>> Cc: r-help at r-project.org
>>> Subject: Re: [R] Reading large files
>>>
>>> No.
>>>
>>> On Sat, Feb 6, 2010 at 1:01 PM, Vadlamani, Satish {FLNA}
>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>> Gabor:
>>>> Can I pass colClasses as a vector to read.csv.sql? Thanks.
>>>> Satish
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>>> Sent: Saturday, February 06, 2010 9:41 AM
>>>> To: Vadlamani, Satish {FLNA}
>>>> Cc: r-help at r-project.org
>>>> Subject: Re: [R] Reading large files
>>>>
>>>> Its just any Windows batch command string that filters stdin to
>>>> stdout.  What the command consists of should not be important.   An
>>>> invocation of perl that runs a perl script that filters stdin to
>>>> stdout might look like this:
>>>>  read.csv.sql("myfile.dat", filter = "perl myprog.pl")
>>>>
>>>> For an actual example see the source of read.csv2.sql which defaults
>>>> to using a Windows vbscript program as a filter.
>>>>
>>>> On Sat, Feb 6, 2010 at 10:16 AM, Vadlamani, Satish {FLNA}
>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>> Jim, Gabor:
>>>>> Thanks so much for the suggestions where I can use read.csv.sql and embed Perl (or gawk). I just want to mention that I am running on Windows. I am going to read the documentation the filter argument and see if it can take a decent sized Perl script and then use its output as input.
>>>>>
>>>>> Suppose that I write a Perl script that parses this fwf file and creates a CSV file. Can I embed this within the read.csv.sql call? Or, can it only be a statement or something? If you know the answer, please let me know. Otherwise, I will try a few things and report back the results.
>>>>>
>>>>> Thanks again.
>>>>> Saitsh
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: jim holtman [mailto:jholtman at gmail.com]
>>>>> Sent: Saturday, February 06, 2010 6:16 AM
>>>>> To: Gabor Grothendieck
>>>>> Cc: Vadlamani, Satish {FLNA}; r-help at r-project.org
>>>>> Subject: Re: [R] Reading large files
>>>>>
>>>>> In perl the 'unpack' command makes it very easy to parse fixed fielded data.
>>>>>
>>>>> On Fri, Feb 5, 2010 at 9:09 PM, Gabor Grothendieck
>>>>> <ggrothendieck at gmail.com> wrote:
>>>>>> Note that the filter= argument on read.csv.sql can be used to pass the
>>>>>> input through a filter written in perl, [g]awk or other language.
>>>>>> For example: read.csv.sql(..., filter = "gawk -f myfilter.awk")
>>>>>>
>>>>>> gawk has the FIELDWIDTHS variable for automatically parsing fixed
>>>>>> width fields, e.g.
>>>>>> http://www.delorie.com/gnu/docs/gawk/gawk_44.html
>>>>>> making this very easy but perl or whatever you are most used to would
>>>>>> be fine too.
>>>>>>
>>>>>> On Fri, Feb 5, 2010 at 8:50 PM, Vadlamani, Satish {FLNA}
>>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>>> Hi Gabor:
>>>>>>> Thanks. My files are all in fixed width format. They are a lot of them. It would take me some effort to convert them to CSV. I guess this cannot be avoided? I can write some Perl scripts to convert fixed width format to CSV format and then start with your suggestion. Could you let me know your thoughts on the approach?
>>>>>>> Satish
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>>>>>> Sent: Friday, February 05, 2010 5:16 PM
>>>>>>> To: Vadlamani, Satish {FLNA}
>>>>>>> Cc: r-help at r-project.org
>>>>>>> Subject: Re: [R] Reading large files
>>>>>>>
>>>>>>> If your problem is just how long it takes to load the file into R try
>>>>>>> read.csv.sql in the sqldf package.  A single read.csv.sql call can
>>>>>>> create an SQLite database and table layout for you, read the file into
>>>>>>> the database (without going through R so R can't slow this down),
>>>>>>> extract all or a portion into R based on the sql argument you give it
>>>>>>> and then remove the database.  See the examples on the home page:
>>>>>>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
>>>>>>>
>>>>>>> On Fri, Feb 5, 2010 at 2:11 PM, Satish Vadlamani
>>>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>>>>
>>>>>>>> Matthew:
>>>>>>>> If it is going to help, here is the explanation. I have an end state in
>>>>>>>> mind. It is given below under "End State" header. In order to get there, I
>>>>>>>> need to start somewhere right? I started with a 850 MB file and could not
>>>>>>>> load in what I think is reasonable time (I waited for an hour).
>>>>>>>>
>>>>>>>> There are references to 64 bit. How will that help? It is a 4GB RAM machine
>>>>>>>> and there is no paging activity when loading the 850 MB file.
>>>>>>>>
>>>>>>>> I have seen other threads on the same types of questions. I did not see any
>>>>>>>> clear cut answers or errors that I could have been making in the process. If
>>>>>>>> I am missing something, please let me know. Thanks.
>>>>>>>> Satish
>>>>>>>>
>>>>>>>>
>>>>>>>> End State
>>>>>>>>> Satish wrote: "at one time I will need to load say 15GB into R"
>>>>>>>>
>>>>>>>>
>>>>>>>> -----
>>>>>>>> Satish Vadlamani
>>>>>>>> --
>>>>>>>> View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html
>>>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Holtman
>>>>> Cincinnati, OH
>>>>> +1 513 646 9390
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>>
>>>>
>>>
>>
>