[Bioc-devel] GIntervalTree objects are corrupted during save/load

Hervé Pagès hpages at fhcrc.org
Tue Jul 1 21:04:57 CEST 2014


On 07/01/2014 10:38 AM, Michael Lawrence wrote:
> The difference of course being that you implemented those trees from
> scratch, while we're relying on the Kent library for the low-level
> management of the tree. We would probably need to break from the Kent
> library to pursue this approach.

I see. That makes things a little bit more complicated. I wonder if the
whole effort is worth it given that serialization of a GIntervalTree
doesn't seem like a common use case and that re-processing the
GIntervalTree from the GRanges object maybe doesn't take that much
time (I didn't do any timings to back this up though). For PDict
objects it was nice to be able to serialize them even though it's
probably not something the user should do. Turning a DNAStringSet
object into a PDict object is very fast and the resulting object is
so big that a save/load cycle would actually take much longer than
re-processing the PDict object at each new session.

Also my feeling that the time and effort required to break from the Kent
would perhaps be better spent trying to implement something new like the
Nested Containment List algo. Since this would probably have to be
implemented from scratch anyway then it would make sense to use
SEXP-based memory, or even better, to put a thin abstract layer between
the algo itself and memory management so they are decoupled.

Cheers,
H.

>
>
> On Tue, Jul 1, 2014 at 9:05 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Hector, Michael,
>
>
>     On 07/01/2014 05:57 AM, Michael Lawrence wrote:
>
>         It seems tough to make this work. There is no way for the R
>         serialization
>         machinery to understand what needs to be serialized after the
>         external
>         pointer. The easiest approach to fixing this would be to reimplement
>         everything on top of SEXPs, which is to say, it would not be easy.
>
>
>     This is what I did with PDict objects to store the Aho-Corasick tree.
>     It's actually easier than it sounds. You can use any atomic type, say
>     INTSXP or RAWSXP, it doesn't matter, That's just a way to get memory.
>     Then you do what you want with it (thru casting the pointer to it).
>     It not only solves the serialization problem, it also automatically
>     manages the memory, which is now in the hands of the garbage collector.
>
>     Cheers,
>     H.
>
>         Alternatively, we could write our own serializer. It seems R
>         needs a way to
>         register (de)serializers for external pointers.
>
>
>         On Tue, Jul 1, 2014 at 5:37 AM, Hector Corrada Bravo
>         <hcorrada at gmail.com <mailto:hcorrada at gmail.com>>
>         wrote:
>
>             Confirmed. Will look into it now.
>             Thanks for writing!
>             Hector
>
>
>             On Tue, Jul 1, 2014 at 2:40 AM, Kristoffer Vitting-Seerup <
>             kristoffer.vittingseerup at bio.__ku.dk
>             <mailto:kristoffer.vittingseerup at bio.ku.dk>> wrote:
>
>                 Hi bioc-devel
>
>                 I’ve fond an error in the usage of GIntervalTree:
>
>
>                     test <- GRanges(seqnames='Chr1',
>                     range=IRanges(start=10,end=20)__)
>                     test
>
>                 GRanges with 1 range and 0 metadata columns:
>                         seqnames    ranges strand
>                            <Rle> <IRanges>  <Rle>
>                     [1]     Chr1  [10, 20]      *
>
>                 this object I can save and load without problem:
>
>                 save(test, file='test.Rdata')
>
>                     rm(test)
>                     load('test.Rdata')
>                     test
>
>                 GRanges with 1 range and 0 metadata columns:
>                         seqnames    ranges strand
>                            <Rle> <IRanges>  <Rle>
>                     [1]     Chr1  [10, 20]      *
>
>
>                 But if I convert to to a GIntervalTree (for faster
>                 overlap finding) I get
>                 a fatal error when loading:
>
>                 test2 <- GIntervalTree(test)
>
>                     test2
>
>                 GIntervalTree with 1 range and 0 metadata columns:
>                         seqnames    ranges strand
>                            <Rle> <IRanges>  <Rle>
>                     [1]     Chr1  [10, 20]      *
>
>                     save(test2, file='test2.Rdata')
>                     rm(test2)
>                     load('test2.Rdata')
>                     test2
>
>                 GIntervalTree with 1 range and 0 metadata columns:
>
>                    *** caught segfault ***
>                 address 0xc, cause 'memory not mapped'
>
>                 Traceback:
>                    1: .Call(.NAME, ..., PACKAGE = PACKAGE)
>                    2: .Call2(fun, object at ptr, ..., PACKAGE = "IRanges")
>                    3: .IntervalForestCall(from, "asIRanges")
>                    4: asMethod(object)
>                    5: as(x at ranges, "IRanges")
>                    6: .GT_reorderValue(x, as(x at ranges, "IRanges"))
>                    7: .local(x, ...)
>                    8: ranges(x)
>                    9: ranges(x)
>
>                 Possible actions:
>                 1: abort (with core dump, if enabled)
>                 2: normal R exit
>                 3: exit R without saving workspace
>                 4: exit R saving workspace
>
>
>                 My session info:
>                 sessionInfo()
>                 R version 3.1.0 (2014-04-10)
>                 Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
>                 locale:
>                 [1] C
>
>                 attached base packages:
>                 [1] grDevices datasets  grid      parallel  stats
>                 graphics  utils
>                 methods   base
>
>                 other attached packages:
>                    [1] spliceR_1.5.0         plyr_1.8.1
>                   RColorBrewer_1.0-5
>                    VennDiagram_1.6.5     cummeRbund_2.7.1      Gviz_1.9.4
>                    rtracklayer_1.25.8    GenomicRanges_1.17.14
>                 GenomeInfoDb_1.1.5
>                    IRanges_1.99.13
>                 [11] S4Vectors_0.0.6       fastcluster_1.1.13
>                   reshape2_1.4
>                    ggplot2_0.9.3.1       RSQLite_0.11.4        DBI_0.2-7
>                 BiocGenerics_0.11.2
>
>                 loaded via a namespace (and not attached):
>                    [1] AnnotationDbi_1.27.6     BBmisc_1.6
>                 BSgenome_1.33.5
>                        BatchJobs_1.2            Biobase_2.25.0
>                 BiocParallel_0.7.0
>                       Biostrings_2.33.8        Formula_1.1-1
>                    GenomicAlignments_1.1.10
>                 [10] GenomicFeatures_1.17.6   Hmisc_3.14-4
>                 MASS_7.3-33
>                        R.methodsS3_1.6.1        RCurl_1.95-4.1
>                 Rcpp_0.11.1
>                        Rsamtools_1.17.14        VariantAnnotation_1.11.5
>                 XML_3.98-1.1
>                 [19] XVector_0.5.6            biomaRt_2.21.0
>                 biovizBase_1.13.7
>                        bitops_1.0-6             brew_1.0-6
>                 cluster_1.15.2
>                       codetools_0.2-8          colorspace_1.2-4
>                 dichromat_2.0-0
>                 [28] digest_0.6.4             fail_1.2
>                 foreach_1.4.2
>                        gtable_0.1.2             iterators_1.0.7
>                   lattice_0.20-29
>                        latticeExtra_0.6-26      matrixStats_0.8.14
>                 munsell_0.4.2
>                 [37] proto_0.3-10             scales_0.2.4
>                 sendmailR_1.1-2
>                        splines_3.1.0            stats4_3.1.0
>                 stringr_0.6.2
>                        survival_2.37-7          tools_3.1.0
>                   zlibbioc_1.11.1
>
>
>
>                 --
>                 Kindest regards
>                 Kristoffer Vitting-Seerup, cand.scient. (M.Sc.),
>                 Ph.D Fellow
>                 Sandelin Group
>
>                 Bioinformatics Centre | Biotech Research & Innovation
>                 Centre (BRIC), Dep.
>                 Of Biology
>                 University of Copenhagen
>                 Building 1, 3th floor, office 3 (1-3-03)
>                 Ole Maaløes Vej 5
>
>                 DK-2200 Copenhagen N
>                 Denmark
>                 http://binf.ku.dk | http://www.bric.ku.dk
>
>
>
>
>
>
>
>                           [[alternative HTML version deleted]]
>
>
>                 _________________________________________________
>                 Bioc-devel at r-project.org
>                 <mailto:Bioc-devel at r-project.org> mailing list
>                 https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                 <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>                       [[alternative HTML version deleted]]
>
>
>             _________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             mailing list
>             https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>                  [[alternative HTML version deleted]]
>
>
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list