[Bioc-devel] BiocParallel and AnnotationDbi: database disk image is malformed

Martin Morgan martin.morgan at roswellpark.org
Fri Jan 19 22:10:57 CET 2018


On 01/19/2018 02:24 PM, Ludwig Geistlinger wrote:
> I apologize if I haven't been specific enough - however, I am also having trouble to reliably reproduce the error.
> It does not seem to be exclusively related to the combination of AnnotationDbi and parallel computation, but also with some other packages I load.
> 
> While still trying to produce a minimal reproducible example,  here is what returns the error quite reliably on an 8-core Linux machine:
>   
> # loading some dependencies of my package
> library(org.Hs.eg.db)
> library(pathview)
> library(graph)
> library(BiocParallel)
> 
> # annotation packages for the datasets in the compendium
> pkgs <- rep(c("hgu133plus2.db","hgu133a.db"), 42)
> 
> #
> getSymbols <- function ( anno.pkg )
> {
>      require(anno.pkg, character.only=TRUE)
>      anno.pkg <- get(anno.pkg)
>      syms <- AnnotationDbi::mapIds(anno.pkg, keys=keys(anno.pkg),
>                                      keytype="PROBEID", column="ENTREZID")
>      return(syms)
> }
> 
>> x <- bplapply( pkgs , getSymbols )       ### sometimes I have to run this 2 or 3 times in a row to produce this error
> Loading required package: hgu133plus2.db
> 
> Error: BiocParallel errors
>    element index: 29, 30, 31, 32, 33, 34, ...
>    first error: database disk image is malformed

My guess is that the database is being accessed by multiple processes 
simultaneously and, even though the data bases are opened read-only, 
this causes a corruption in the access of some sort. You can avoid 
multiple processes accessing the database at the same time by using a 'lock'

getSymbols <- function ( anno.pkg, id )
{
     nmspc <- loadNamespace(anno.pkg)
     anno.pkg <- get(anno.pkg, nmspc)

     BiocParallel::ipclock(id)
     syms <- suppressMessages({
         AnnotationDbi::mapIds(
             anno.pkg, keys=keys(anno.pkg), keytype="PROBEID",
             column="ENTREZID"
         )
     })
     BiocParallel::ipcunlock(id)

     length(syms)
}

x <- bplapply(pkgs , getSymbols, ipcid())

There are two additional considerations here.

The first is that one wants to worry about the amount of data transfered 
between worker and manager compared to the amount of time spent in 
computation. So in your previous formulation you sent back all the 
symbols -- this will be relatively expensive compared to the amount of 
work done in the function (reading the ids from the database), and you 
would rather do more work and transmit less (both to and from the 
worker) in each call to getSymbol().

The second is similar, but from the lock perspective -- since the lock 
imposes essential serial evaluation through that portion of the code, 
you'd like the locked portion of the worker's task to be just a small 
portion of the total work done by the worker.

I guess a more clever use of locks would be one per data base (generate 
two ipcid()'s in the manager, and pass these to the worker in such a way 
that the worker uses the same lock for each database.

Martin

> 
> 
>> sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: SUSE Linux Enterprise Desktop 12 SP3
> 
> Matrix products: default
> BLAS: /mnt/raidbio/biosoft/software/R/R-3.4.1/lib/libRblas.so
> LAPACK: /mnt/raidbio/biosoft/software/R/R-3.4.1/lib/libRlapack.so
> 
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] parallel  stats4    stats     graphics  grDevices utils     datasets
> [8] methods   base
> 
> other attached packages:
> [1] BiocParallel_1.12.0  graph_1.56.0         pathview_1.18.0
> [4] org.Hs.eg.db_3.5.0   AnnotationDbi_1.40.0 IRanges_2.12.0
> [7] S4Vectors_0.16.0     Biobase_2.38.0       BiocGenerics_0.24.0
> 
> loaded via a namespace (and not attached):
>   [1] Rcpp_0.12.14      KEGGgraph_1.38.1  XVector_0.18.0    zlibbioc_1.24.0
>   [5] bit_1.1-12        R6_2.2.2          rlang_0.1.6       blob_1.1.0
>   [9] httr_1.3.1        tools_3.4.1       grid_3.4.1        png_0.1-7
> [13] DBI_0.7           bit64_0.9-7       digest_0.6.13     tibble_1.4.1
> [17] Rgraphviz_2.22.0  KEGGREST_1.18.0   memoise_1.1.0     RSQLite_2.0
> [21] compiler_3.4.1    pillar_1.1.0      Biostrings_2.46.0 XML_3.98-1.9
> [25] pkgconfig_2.0.1
> 
> 
> --
> Dr. Ludwig Geistlinger
> CUNY School of Public Health
> 
> ________________________________________
> From: Bioc-devel <bioc-devel-bounces at r-project.org> on behalf of Martin Morgan <martin.morgan at roswellpark.org>
> Sent: Friday, January 19, 2018 1:54 PM
> To: Gabe Becker; Vincent Carey
> Cc: bioc-devel at r-project.org
> Subject: Re: [Bioc-devel] BiocParallel and AnnotationDbi: database disk image is malformed
> 
> On 01/19/2018 12:37 PM, Gabe Becker wrote:
>> IT seems like you could also force a copy of the reference object via
>> <dbobject>$copy() and then force a refresh of the conn slot by assigning a
>> new db connection into it.
>>
>> I'm having trouble confirming that this would work, however, because I
>> actually can't reproduce the error. The naive way works for me on my mac
>> laptop (which is running an old R and Bioconductor) and on the linux
>> cluster I have access to (running Bioc 3.6):
>>
>>
>> (cluster)
>>
>>> getSymbol <- function ( x ) {
>>
>> + return( AnnotationDbi::mget( x , hgu95av2SYMBOL ) )
>>
>> + }
> 
> pass the data base connection to the function
> 
> getSymbol <- function ( x, db )
>       ## olde schoole
>       AnnotationDbi::mget(x, db)
>       ## AnnotationDbi::mapIds(db, x, "SYMBOL", "PROBEID")
> 
> and arrange for the general case, i.e., distinct processes with data
> serialized between them
> 
>   > cl = parallel::makePSOCKcluster(2)
>   > parLapply(cl, x, getSymbol, hgu95av2SYMBOL)
> Error in checkForRemoteErrors(val) :
>     2 nodes produced errors; first error: external pointer is not valid
> 
> (getSymbol would fail as originally written in the serial case, since
> the workers would not have access to hgu95av2SYMBOL
> 
> The workaround is to open the connection on the node, e.g.,
> 
> getSymbol <- function ( x, dbname ) {
>       nmspc <- loadNamespace(dbname)
>       db <- get(dbname, nmspc)
>       AnnotationDbi::mapIds(db, x, "SYMBOL", "PROBEID")
> }
> 
> lapply(x, getSymbol, "hgu95av2.db")
> bplapply(x, getSymbol, "hgu95av2.db")
> bplapply(x, getSymbol, "hgu95av2.db", BPPARAM = SnowParam())
> 
> Martin
> 
>>>
>>
>>> x <- list( "36090_at" , "38785_at" )
>>
>>>
>>
>>> mclapply( x , getSymbol )
>>
>> [[1]]
>>
>> [[1]]$`36090_at`
>>
>> [1] "TBL2"
>>
>>
>>
>> [[2]]
>>
>> [[2]]$`38785_at`
>>
>> [1] "MUC1"
>>
>>
>>
>>>
>>
>>> sessionInfo()
>>
>> R version 3.4.3 (2017-11-30)
>>
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)
>>
>>
>> Matrix products: default
>>
>> BLAS:
>> /gnet/is2/p01/apps/R/3.4.3-20171201-current/x86_64-linux-2.6-rhel6/lib64/R/lib/libRblas.so
>>
>> LAPACK:
>> /gnet/is2/p01/apps/R/3.4.3-20171201-current/x86_64-linux-2.6-rhel6/lib64/R/lib/libRlapack.so
>>
>>
>> locale:
>>
>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>
>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>
>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>
>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>
>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>
>> attached base packages:
>>
>> [1] stats4    parallel  stats     graphics  grDevices utils     datasets
>>
>> [8] methods   base
>>
>>
>> other attached packages:
>>
>> [1] hgu95av2.db_3.2.3    org.Hs.eg.db_3.5.0   AnnotationDbi_1.40.0
>>
>> [4] IRanges_2.12.0       S4Vectors_0.16.0     Biobase_2.38.0
>>
>> [7] BiocGenerics_0.24.0
>>
>>
>> loaded via a namespace (and not attached):
>>
>>    [1] Rcpp_0.12.14    digest_0.6.14   DBI_0.7         RSQLite_2.0
>>
>>    [5] pillar_1.1.0    rlang_0.1.6     blob_1.1.0      bit64_0.9-8
>>
>>    [9] bit_1.1-13      compiler_3.4.3  pkgconfig_2.0.1 memoise_1.1.0
>>
>> [13] tibble_1.4.1
>>
>>>
>>
>>
>> ~G
>>
>> On Fri, Jan 19, 2018 at 9:23 AM, Vincent Carey <stvjc at channing.harvard.edu>
>> wrote:
>>
>>> good question
>>>
>>> some of the discussion on
>>>
>>> http://sqlite.1065341.n5.nabble.com/Parallel-access-to-
>>> read-only-in-memory-database-td91814.html
>>>
>>> seems relevant.
>>>
>>> converting the relatively small annotation package content to pure R
>>> read-only tables on the master before parallelizing
>>> might be very simple?
>>>
>>> On Fri, Jan 19, 2018 at 11:43 AM, Ludwig Geistlinger <
>>> Ludwig.Geistlinger at sph.cuny.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> Within a package I am developing, I would like to enable parallel probe
>>> to
>>>> gene mapping for a compendium of microarray datasets.
>>>>
>>>> This accordingly makes use of annotation packages such as hgu133a.db,
>>>> which in turn connect to the SQLite database via AnnotationDbi.
>>>>
>>>> When running in multi-core mode (i.e. using a MulticoreParam with
>>>> BiocParallel) using more than 2 cores, this causes the error:
>>>>
>>>> database disk image is malformed
>>>>
>>>>
>>>> In a very similar problem:
>>>>
>>>> https://support.bioconductor.org/p/38541/
>>>>
>>>> Adi Tarca and Dan Tenenbaum identified and resolved this problem by
>>>> ensuring that each process has its own unique database connection, i.e.
>>>> AnnotationDbi is not loaded before sending the job to the workers.
>>>>
>>>> This solution was easily realized as this analysis was carried out within
>>>> a script and not a package.
>>>>
>>>> However, within my package, AnnotationDbi is loaded as a dependency of my
>>>> package's imports.
>>>>
>>>> How to resolve this here?
>>>> I am not sure whether I perfectly understand the underlying mechanisms,
>>>> but is there a way to make my workers load their own version of
>>>> AnnotationDbi instead of using the one of the parent process?
>>>> Or am I supposed to unload all packages depending on AnnotationDbi, and
>>>> AnnotationDbi itself, before sending the job to the workers (and reload
>>> all
>>>> of them after the job has finished?)
>>>>
>>>> Thanks a lot,
>>>> Ludwig
>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Ludwig Geistlinger
>>>> CUNY School of Public Health
>>>>
>>>>           [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>           [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>>
> 
> 
> This email message may contain legally privileged and/or...{{dropped:2}}
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the Bioc-devel mailing list