[R] exporting tables from an access database using parallel foreach
Vivek Sutradhara
viveksutra at gmail.com
Sat Nov 21 18:55:13 CET 2015
Hi John and Jeff,
Thanks a lot for your help. I agree that row numbers are not a standard
feature in SQL. What I am looking for is some kind of a hack. After all,
the sqlFetch command is able to return a specific number of rows. And the
sqlFetchMore command is able to take up the baton from that row onwards to
futher return rows corresponding to the max parameter.
I wonder if it is possible to straight away hop to a certain row number
(without going through sqlfetch and sqlFetchMore and without loading any
data into memory) and then return the contents corresponding to a certain
number of rows. The question is : is there a "catch" for accessing a row
location, and what could be the "hook" for that? I am interested in the the
recent updated rows to a table after a certain date. Is it possible to
identify them in a quick way? Running sql queries on such large tables
appears to take too long a time.
I understand that there is no provision to do this by available methods.
But, is it possible to get under the hood and find some hack?
Jeff, I will take your suggestion and try my luck at the R-sig-db mailing
list.
Thanks,
Vivek
2015-11-20 20:25 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:
> Row numbers are not a standard feature in SQL, and as far as I know the
> Access Jet engine does not support them. You are supposed to use the key
> columns to partition your data, but that may require knowing how many
> records fall within convenient bin sizes if the data are not uniformly
> distributed. You can find that out using SQL group by queries.
>
> Note that you the resource you appear to be limited by is the database
> engine. Parallel processing (more CPUs) is unlikely to yield any
> improvement, and is in fact likely to slow you down.
>
> This looks like a good topic for the R-sig-db mailing list if you have
> further questions about R and databases, or find a SQL support forum if you
> need to learn more about using SQL in general.
>
> On November 20, 2015 10:32:31 AM PST, Vivek Sutradhara <
> viveksutra at gmail.com> wrote:
>
>> Hi John,
>> Thanks a lot for your quick reply. And thanks for drawing my attention to
>> the openslsx package. I will certainly look into it when I work with Excel.
>> But right now, my problems are with Microsoft Access.
>>
>> There are huge tables there which I am not able to export to excel, csv or
>> text files with native access methods. The only solution that has worked so
>> far is to incrementally extract data with the the help of RODBC. This was a
>> huge leap in my attempts to export the tables. Once I have the data in form
>> of rds files (which are compressed as well), I have found that it is much
>> easier to work with them.
>>
>> But my wishes have suddenly expanded and I want to find out if it is
>> possible to go beyond the normal capabilities of RODBC (the sqlFetch
>> command does not have a provision for specifying the row number range). I
>> am a newbie with parallel methods (using the 4 cores on my pc) but I
>> am
>> hoping to progress with that for processing the data from the multiple
>> chunks of data (the first step will be just to filter and gather the data
>> of relevance).
>>
>> I hope that I have explained what I am looking for.
>> Thanks,
>> Vivek
>>
>> 2015-11-20 19:09 GMT+01:00 John McKown <john.archie.mckown at gmail.com>:
>>
>> A possibility could be to not use ODBC, but the CRAN package openslsx (
>>> https://cran.revolutionanalytics.com/web/packages/openxlsx/index.html ).
>>> Then use the read.xlsx() function.
>>> <quote>
>>> Description Read data from an Excel file or Workbook object into a
>>> data.frame
>>>
>>> Usage read.xlsx(xlsxFile, sheet = 1, startRow = 1, colNames = TRUE,
>>> rowNames = FALSE, detectDates = FALSE,
>>> skipEmptyRows = TRUE, rows = NULL,
>>> cols = NULL, check.names = FALSE, namedRegion = NULL)
>>>
>>> Arguments xlsxFile An xlsx file or Workbook object sheet The name or index
>>> of the sheet to read data from.
>>> startRow first row to begin looking for data. Empty rows at the top of a
>>> file are always skipped, regardless of the value of startRow.
>>> colNames If TRUE, the first row of data will be used as column names.
>>> rowNames If TRUE, first column of data will be used as row names.
>>> detectDates If TRUE, attempt to recognise dates and perform conversion.
>>> skipEmptyRows If TRUE, empty rows are skipped else empty rows after the
>>> first row containing data will return a row of NAs.
>>> rows A numeric vector specifying which rows in the Excel file to read. If
>>> NULL, all rows are read.
>>> cols A numeric vector specifying which columns in the Excel file to read.
>>> If NULL, all columns are read.
>>> check.names logical. If TRUE then
>>> the names of the variables in the data
>>> frame are checked to ensure that they are syntactically valid variable
>>> names
>>> namedRegion A named region in the Workbook. If not NULL startRow, rows and
>>> cols paramters are ignored.
>>> </quote>
>>>
>>> On Fri, Nov 20, 2015 at 11:38 AM, Vivek Sutradhara <viveksutra at gmail.com>
>>> wrote:
>>>
>>> Hi
>>>> I want to extract data from a Microsoft access database having many tables
>>>> with more than 1e7 rows. I find that the following code works to export a
>>>> table to a rds file :
>>>> #####################
>>>> setwd('C:/sFolder')
>>>> library(RODBC);library(DBI)
>>>> ch<-odbcConnect("sample")
>>>>
>>>> #No. of rows in the table not known
>>>> rowN<-1e6 # no. of rows defined
>>>> db<-sqlFetch(ch,"Table1",max=rowN,as.is=TRUE)
>>>>
>>>> file<-paste0('Table1',1,'.rds')
>>>> df1<-saveRDS(db,file1)
>>>>
>>>> rm(db);gc() # garbage collection to free up the memory
>>>>
>>>> # To successively obtain more chunks from the access database
>>>> for (i in 2:10) {
>>>> rm(df);gc()
>>>> df<-sqlFetchMore(ch,"Table1",max=rowN,as.is=TRUE)
>>>> file<-paste0('Table1',i,'.rds')
>>>> df1<-saveRDS(df,file)
>>>> if (dim(df)[1]<rowN)
>>>> break
>>>> }
>>>> rm(df);gc()
>>>> odbcCloseAll()
>>>> ##############################
>>>>
>>>> I would like to know the following :
>>>> 1. Is there any way to extract data from a table by just specifying the
>>>> row
>>>> number range. I have extracted data before. Instead of repeating the
>>>> operations, I would just like to obtain data from, let's say, 8e6 to 9e6
>>>> row range. I cannot do this now. I have to successively use the
>>>> sqlfetchMore command. I would like to know if it is possible to straight
>>>>
>>>> away go to the 8e6 to 9e6 row range.
>>>> 2. Is it possible to use the foreach package in the extraction step (in
>>>> place of the for loop above). I am planning to use the foreach command in
>>>> parallel later for processing the data in the multiple files. I just
>>>> wonder
>>>> if it is possible to do parallel processing for the data extraction also.
>>>> Thanks,
>>>> Vivek Sutradhara
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ------------------------------
>>>>
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Schrodinger's backup: The condition of any backup is unknown until
>>> a
>>> restore is attempted.
>>>
>>> Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.
>>>
>>> He's about as useful as a wax frying pan.
>>>
>>> 10 to the 12th power microphones = 1 Megaphone
>>>
>>> Maranatha! <><
>>> John McKown
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ------------------------------
>>
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list