[BioC] Biostrings bug?
Hervé Pagès
hpages at fhcrc.org
Sat Oct 9 00:33:15 CEST 2010
Arne,
I completely forgot to check this but I just realize that this has
already been addressed in the devel version of Biostrings (which will
soon become the new release version). Starting with Biostrings 2.17,
a DNAStringSet object can be much bigger: up to 2^31-1 sequences
per object and each sequence can itself be up to 2^31-1 letters
long (before that the cumulated length of the sequences needed to
be <= 2^31-1).
So as long as your machine has enough memory (and your OS knows
how to make use of that memory), you should be able to create big
DNAStringSet objects like this:
> myseq.bs = DNAStringSet(rep(paste(rep("A", 1200), collapse=""), 2000000))
> myseq.bs
A DNAStringSet instance of length 2000000
width seq
[1] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[2] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[3] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[4] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[5] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[6] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[7] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[8] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[9] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
... ... ...
[1999992] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[1999993] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[1999994] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[1999995] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[1999996] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[1999997] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[1999998] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[1999999] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
[2000000] 1200 AAAAAAAAAAAAAAAAAAAAA...AAAAAAAAAAAAAAAAAAAA
> sessionInfo()
R version 2.12.0 alpha (2010-09-27 r53048)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] Biostrings_2.17.47 IRanges_1.7.39
loaded via a namespace (and not attached):
[1] Biobase_2.9.2
Note that Biostrings_2.17 and IRanges_1.7 belong to BioC 2.7,
the current development version of Bioconductor (which is about
to be released). You need R 2.12 (which is about to be released
too) if you want to use BioC 2.7. Just use biocLite() from within
R-2.12 to install packages.
Cheers,
H.
On 10/07/2010 09:18 AM, arne.mueller at novartis.com wrote:
> Hi,
>
> sorry, the sequence in my original posting got screwed during copy/paste,
> this is the "real" sequence:
>
>>
> CTATGTGTGAGGGCAGCAACCAGAACTGTCTGCCCTGACTTCGCTCAGGATGCTGTGAACATGTGGCTCAGATGGTGCTA
> GGCATTTTCCTCTAGAGTCAGAAACGTGGACAGAGAGTCATCTCCTCTGGCTTCCCAGGCATGTCTGCCACTCTGAAGGT
> CTGAAGGTCTGGGTCTCCCTCCCATGGGATTTGAGTGCAGAGAGCTGTGTGACTGGGTCCCTTCAGATCCAGGTGGTGTC
> TGGACTGTAGCGTTGAGTGCCCTATCTTCCTGGTCTCAGAGCACCTATACAGTTTCCTCTTGGGCCAGGGATGTGGGCAG
> TGGTGGGCTGTACTGGAAGTCTCTCCTGTCCTGCAGTCTCAGGAGTGGCCACCTGTCTGGGTGGTGAGCTCTCTCTCCCA
> TGGGGTTAGGGAGCAGGGAGGTTTTGCAAGATTCAGATTTAAGGTCACATTTTATCATCATAATGGAGGACATTAGGAAG
> GTCAGAAATAACTCCCTTAAGGAAATACTTGACAACACAAGCAAACTAGTAGAAATCTTTTTAAAAGGAAACACAAAAGT
> ATTTTAAAGAATTACAGCAAACCACAACCAAATAGGAGAGGAAATTGAACAAAATCATCCAGGAGTTAAATATGGAAATA
> GAAACAATGAAGAGAGCACACAGCGAGACAACCCTGGAGATAGAAAATCTAAGGAAGAGATCAGGAGTCATAGATGCAAG
> CATCACTGACGGACTACATGAGATAGAAGAGAGAATTTTGGGAGCAGAAGATATCATAGAAAACATTGACACAACCTTCA
> AAGAGAACGTAAATAGGAAAAAGCTCCTAGCCCTAAACATGCAGGAAATCAGGAAACAAATCAAAGATCAAACCTAAATA
> TATCAGGTATAGAAGAGAGTGAAGACTCCCAACATAAAGGGATGGTAAATATCTTCAACAATATAAACAATATAAAGGAA
> AACATCCCTAACCAAAAGAAATAAATGTCCATAAATAGACATGAAGCCTGCAGAATTCCAAATAGAATGGACCAGAAAAT
> AAATTCCTCCTGTCACATAATAGTCAAAACACCAAATGCACAAAACAAAGAATGAATATTAAAAGCATTAACGGATAAAG
> GTCAAGTACATTTAAAGGCAGACATGTCAGAATTACACCAGAATTCTTACCATGGACTATGAAAGCCAGAAGACAGATGT
>
> It doesn't matter which sequence one uses to get the DNAStringSet error,
> it just has to be long and
> there have to be many of them, here's a more generic example:
>
>> myseq.bs = DNAStringSet(rep(paste(rep("A", 100), collapse=""), 2000))
>> myseq.bs = DNAStringSet(rep(paste(rep("A", 100), collapse=""), 2000000))
>> myseq.bs = DNAStringSet(rep(paste(rep("A", 1200), collapse=""), 2000))
>> myseq.bs = DNAStringSet(rep(paste(rep("A", 1200), collapse=""),
> 2000000))
> Error in .Call("new_SharedRaw_from_STRSXP", x, start(solved_SEW),
> width(solved_SEW), :
> negative length vectors are not allowed
>
> Arne
>
>
>
>
>
>
> arne.mueller at novartis.com
> Sent by: bioconductor-bounces at stat.math.ethz.ch
> 10/07/2010 05:55 PM
>
> To
> bioconductor at stat.math.ethz.ch
> cc
>
> Subject
> [BioC] Biostrings bug?
>
>
>
>
>
>
> Dear All,
>
> I came across the following error in DNAStringSet from the Biostrings
> package:
>
>> myseq =
> "CTATGTGTGAGGGCAGCAACCAGAACTGTCTGCCCTGACTTCGCTCAGGATGCTGTGAACATGTGGCTCAGATGGTGCTAGGCATTTTCCTCTAGAGTCAGAAACGTGGACAGAGAGTCATCTCCTCTGGCTTCCCAGGCATGTCTGCCACTCTGAAGGTCTGAAGGTCTGGGTCTCCCTCCCATGGGATTTGAGTGCAGAGAGCTGTGTGACTGGGTCCCTTCAGATCCAGGTGGTGTCTGGACTGTAGCGTTGAGTGCCCTATCTTCCTGGTCTCAGAGCACCTATACAGTTTCCTCTTGGGCCAGGGATGTGGGCAGTGGTGGGCTGTACTGGAAGTCTCTCCTGTCCTGCAGTCTCAGGAGTGGCCACCTGTCTGGGTGGTGAGCTCTCTCTCCCATGGGGTTAGGGAGCAGGGAGGTTTTGCAAGATTCAGATTTAAGGTCACATTTTATCATCATAATGGAGGACATTAGGAAGGTCAGAAATAACTCCCTTAAGGAAATACTTGACAACACAAGCAAACTAGTAGAAATCTTTTTAAAAGGAAACACAAAAGTATTTTAAAGAATTACAGCAAACCACAACCAAATAGGAGAGGAAATTGAACAAAATCATCCAGGAGTTAAATATGGAAATAGAAACAATGAAGAGAGCACACAGCGAGACAACCCTGGAGATAGAAAATCTAAGGAAGAGATCAGGAGTCATAGATGCAAGCATCACTGACGGACTACATGAGATAGAAGAGAGAATTTTGGGAGCAGAAGATATCATAGAAAACATTGACACAACCTTCAAAGAGAACGTAAATAGGAAAAAGCTCCTAGCCCTAAACATGCAGGAAATCAGGAAACAAATCAAAGATCAAACCTAAATATATCAGGTATAGAAGAGAGTGAAGACTCCCAACATAAAGGGATGGTAAATATCTTCAACAATATAAACAATATAAAGGAAAACATCCCTAACCAAAAGAAATAAATG
T!
>
> CCATAAATAGACATGAAGCCTGCAGAATTCCAAATAGAATGGACCAGAAAATAAATTCCTCCTGTCACATAATAGTCAAAACACCAAATGCACAAAACAAAGAATGAATATTAAAAGCATTAACGGATAAAGGTCAAGTACATTTAAAGGCAGACATGTCAGAATTACACCAGAATTCTTACCATGGACTATGAAAGCCAGAAGACAGATGT"
>> mysDNA = DNAStringSet(myseq) # ok!
>> myseq = rep(myseq, 2000000)
>> myseq.bs = DNAStringSet(myseq)
> Error in .Call("new_SharedRaw_from_STRSXP", x, start(solved_SEW),
> width(solved_SEW), :
>
> negative length vectors are not allowed
>
> Enter a frame number, or 0 to exit
> 1: DNAStringSet(myseq)
> 2: XStringSet("DNA", x, start = start, end = end, width = width, use.names
>
> = u
> 3: XStringSet("DNA", x, start = start, end = end, width = width, use.names
>
> = u
> 4: .charToXStringSet(basetype, x, start, end, width, use.names)
> 5: .charToXString(basetype, x, solved_SEW)
>
> Selection: 0
>>
>
> Strangely the following works ...:
>
> myseq.bs = c(DNAStringSet(myseq[1:1000000]),
> DNAStringSet(myseq[1000001:2000000]))
>
> Somehow there must be an overflow ... .
>
> Here's some more info on my system:
>
>> sessionInfo()
> R version 2.11.1 Patched (2010-06-20 r52342)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices datasets utils methods base
>
> other attached packages:
> [1] BSgenome.Rnorvegicus.UCSC.rn4_1.3.16 BSgenome_1.16.4
> [3] Biostrings_2.16.5 GenomicRanges_1.0.3
> [5] IRanges_1.6.11
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0 tools_2.11.1
>
> Linux version 2.6.18-92.el5 (brewbuilder at ls20-bc2-13.build.redhat.com)
> (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 29 13:16:15
>
> EDT 2008
>
> 64 Gb memory
>
> thanks for your help
> +kind regards,
>
> Arne
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list