[Bioc-devel] is it normal for makeDBPackage to take a VERY long time?

Sat Jan 15 02:50:53 CET 2011

Hi Tim,

FWIW also make sure you use the latest version of RSQLite (currently
0.9-4). The SQLite engine shipped with RSQLite was updated a couple
of months ago and seems to be significantly faster than the previous
versions in some situations.

Cheers,
H.

On 01/13/2011 12:20 PM, Marc Carlson wrote:
> Hi Tim,
>
> It certainly is doing something.  There are a couple of different things
> that can cause slowness with this.  The 1st is that SQLForge is doing a
> lot of busy work so that the packages it produces can "know" how many
> mappings are expected from each (there is a table called map_counts that
> requires this information).  The second problem is that you are mapping
> over an order of magnitude more probes than we usually would ever need
> to do.  In the past, performance has not been a big problem, in part
> because you only need to do this step once and in part because normally
> people only want to map a few tens of thousands of probes.  But you seem
> to be pushing the envelope pretty hard and the wait time has become
> pretty extreme as a result.  My guess is that I need to add some
> temporary indices into the initial mapping process to speed up this step
> when there are a lot of probes.  I will take a look and see what I can do.
>
>    Marc
>
>
>
> On 01/13/2011 09:45 AM, Tim Triche, Jr. wrote:
>> Hi Sean, (and others)
>>
>> Does it usually take an obscenely long time for makeDBPackage to run when
>> given a bunch of refseq IDs?  I ran the following:
>>
>>
>>> makeDBPackage("HUMANCHIP_DB", affy=FALSE,
>>>
>> prefix="IlluminaHumanMethylation450k",fileName="acc450k.txt",baseMapType='refseq',version='1.0.0',manufacturer='Illumina',chipName='Human
>> Methylation 450k', manufacturerUrl='http://illumina.com/')
>> baseMapType is refseq # time passes...
>>
>> and it's doing SOMETHING, because the table temp_probe_map in the sqlite
>> file it created is filling up.  But it's been grinding away at this for the
>> past 12 hours, which seems a bit excessive for mapping 806,334 refseq
>> accessions.  Really, all I want is for my bimap objects to work as expected,
>> the annotation integration is just gravy.
>>
>> The idea was to release updated 27k and completed 450k packages to handle
>> the IDAT mappings, immediately followed by a methylumIDAT release, and start
>> merging that and my preprocessing stuff into the [methy]lumi toolchain.  The
>> 450k probes are split between two designs, so I kind of had to bite the
>> bullet and roll my own schema to do the mappings efficiently, plus between
>> 450k and FFPE samples there has been a LOT of weirdness lately that I'd like
>> not to depend on Illumina's software to handle.  So...
>>
>> I started out doing this on my laptop (gave up after a few hours and moved
>> it to the server):
>>
>>
>>> sessionInfo()
>>>
>> R version 2.13.0 Under development (unstable) (2010-12-21 r53879)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] grid      stats     graphics  grDevices datasets  utils     methods
>> [8] base
>>
>> other attached packages:
>>   [1] human.db0_2.4.1                      ChIPpeakAnno_1.7.0
>>
>>   [3] limma_3.5.20                         GO.db_2.4.1
>>
>>   [5] BSgenome.Ecoli.NCBI.20080805_1.3.16  BSgenome_1.19.2
>>
>>   [7] GenomicRanges_1.3.7                  Biostrings_2.19.2
>>
>>   [9] IRanges_1.9.17                       multtest_2.6.0
>>
>> [11] biomaRt_2.7.1                        groupedPMA_0.2
>>
>> [13] betareg_2.2-3                        Formula_1.0-0
>>
>> [15] PMA_1.0.7                            huge_0.9
>>
>> [17] MASS_7.3-9                           igraph_0.5.5-1
>>
>> [19] glasso_1.4                           glmnet_1.5.1
>>
>> [21] Matrix_0.999375-46                   lattice_0.19-17
>>
>> [23] grplasso_0.4-2                       impute_1.25.0
>>
>> [25] rGammaGamma_1.0                      methylumIDAT_0.1
>>
>> [27] IlluminaHumanMethylation27k.db_1.4.0 org.Hs.eg.db_2.4.6
>>
>> [29] RSQLite_0.9-4                        DBI_0.2-5
>>
>> [31] AnnotationDbi_1.13.0                 ggplot2_0.8.9
>>
>> [33] proto_0.3-8                          lumi_2.3.5
>>
>> [35] nleqslv_1.8                          matrixStats_0.2.2
>>
>> [37] R.methodsS3_1.2.1                    gsl_1.9-8
>>
>> [39] methylumi_1.3.3                      Biobase_2.11.7
>>
>> [41] gtools_2.6.2                         reshape_0.8.3
>>
>> [43] plyr_1.4
>>
>> loaded via a namespace (and not attached):
>>   [1] affy_1.27.2           affyio_1.17.4         annotate_1.27.1
>>   [4] digest_0.4.2          hdrcde_2.15           KernSmooth_2.23-4
>>   [7] lmtest_0.9-27         mgcv_1.7-2            nlme_3.1-97
>> [10] preprocessCore_1.11.0 RCurl_1.5-0           sandwich_2.2-6
>> [13] splines_2.13.0        survival_2.36-2       tools_2.13.0
>> [16] XML_3.2-0             xtable_1.5-6
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319