[Bioc-devel] GenomicRanges: Storing 'seqlengths' as numeric

Nicolas Delhomme nicolas.delhomme at umu.se
Thu Dec 5 13:40:27 CET 2013


Hej all!

Just my 2 cents. There will eventually be chromosome larger than 2GB. Spruce has 12 chromosomes evenly sized for a genome of ~20GB and that’s not the largest of the Gymnosperms genomes. Pinus taeda  - currently being sequenced - has a genome around 24GB, again with evenly sized chromosomes, so that’s getting closed to 2GB :-). But because these genomes are excessively repetitive, we are nowhere close to have a full chromosome sequence and this I believe for some more time - that time being years would not come as a surprise to me.  So I think there’s still time before this becomes an issue, and by then the IT would have evolved further. 

On the other hand, what might turn into a problem earlier is that these released genomes are draft made of literally millions of scaffolds; hence very different from what most model organisms are, being constituted of a few handfuls of chromosomes. Hervé, Martin do you think this could be an issue for IRanges / GenomicRanges / Rsamtools? I’ve now got a number of BAM files aligned to Norway Spruce - half a million scaffolds - that I can share if you wish to use them for benchmarks. 

Next step in my pipeline is anyway to analyse these alignments, so I’m bound to see how it works out in R :-)

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 03 Dec 2013, at 21:41, Hervé Pagès <hpages at fhcrc.org> wrote:

> Hi Kasper,
> 
> On 12/03/2013 12:25 PM, Kasper Daniel Hansen wrote:
>> Is integer.max dependent on 32bit vs 64bit?
> 
> I don't think so. AFAIK integers are always 32-bit in R (at least on
> Intel platforms), even on 64-bit OSes. So .Machine$integer.max is
> always 2^31 - 1 (roughly 2 billions).
> 
>> It seems to me that the OP
>> specifically complains that he cannot represent 995*10^6 as an integer.
> 
> 995*10^6 is roughly 1 billion so it can be represented as an integer,
> except maybe on some exotic systems.
> 
>> Also, is there a sign issue here as well?
> 
> Not that I know of.
> 
> H.
> 
>> 
>> 
>> On Tue, Dec 3, 2013 at 2:53 PM, Hervé Pagès <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>> wrote:
>> 
>>   Hi,
>> 
>>   Agreed with Martin that until someone comes up with a chromosome that
>>   is longer than .Machine$integer.max I don't see the need for switching
>>   to double or int64 to represent the seqlengths.
>> 
>>   Furthermore, since the seqlengths are used in many range operations
>>   like checking the validity of the ranges in a GRanges object, trimming
>>   them, computing coverage, handling circularity, etc... it would not
>>   make much sense to make the switch for the seqlengths without also
>>   making it for Ranges objects. That would be a serious undertaking though
>>   and probably with many backward compatibility issues.
>> 
>>   H.
>> 
>> 
>> 
>>   On 12/03/2013 10:07 AM, Martin Morgan wrote:
>> 
>>       On 12/03/2013 02:29 AM, Julian Gehring wrote:
>> 
>>           Hi,
>> 
>>           Some of the chromosomes out in the world are fairly large
>>           (e.g. wheat
>>           chr 3B
>>           with > 995 Mbp [1]).  Currently, the 'seqlengths' of the
>>           reference
>>           sequence are
>>           stored as 'integers' which do not allow to store lengths of this
>>           size.  Are
>>           there any plans of switching to 'doubles' or 64-bit integers
>>           for the
>>           'seqlengths' slot?  Or extending the slot such that a user
>>           can store
>>           it either
>>           as integer or floating-point number?
>> 
>> 
>>       But
>> 
>>> .Machine$integer.max
>>       [1] 2147483647 <tel:%5B1%5D%202147483647>
>> 
>>       so we at least survive wheat chr 3B?
>> 
>>       If there is movement to support this I'd encourage exact
>>       representation
>>       as double (this is how R deals with long vectors, and I believe
>>       it is
>>       the javascript representation of integers so not completely
>>       unprecedented) rather than 64 bit integers (which do not have any
>>       support in R).
>> 
>>       I guess this would be quite a big undertaking so real use cases
>>       need to
>>       be present. And support for larger integers would seem to be
>>       useful to R
>>       generally rather than just to Bioc.
>> 
>>       Martin
>> 
>> 
>>           Best wishes
>>           Julian
>> 
>> 
>>           [1] http://www.sciencemag.org/__content/322/5898/101
>>           <http://www.sciencemag.org/content/322/5898/101>
>> 
>>           _________________________________________________
>>           Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>           mailing list
>>           https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>           <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> 
>> 
>> 
>> 
>>   --
>>   Hervé Pagès
>> 
>>   Program in Computational Biology
>>   Division of Public Health Sciences
>> 
>>   Fred Hutchinson Cancer Research Center
>>   1100 Fairview Ave. N, M1-B514
>>   P.O. Box 19024
>>   Seattle, WA 98109-1024
>> 
>>   E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>   Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>   Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>> 
>> 
>>   _________________________________________________
>>   Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>>   https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>   <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> 
>> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

---------------------------------------------------------------
Nicolas Delhomme

Nathaniel Street Lab
Department of Plant Physiology
Umeå Plant Science Center

Tel: +46 90 786 7989
Email: nicolas.delhomme at plantphys.umu.se
SLU - Umeå universitet
Umeå S-901 87 Sweden



More information about the Bioc-devel mailing list