# [R] big matrix reading and writing

Dave Roberts dvrbts at ecology.msu.montana.edu
Thu Jun 5 17:20:10 CEST 2014

```Adrienne,

I have a crew just starting work on this problem this week, but I
think in your case the best solution is more memory.  R stores the
distance matrix as a vector with (n^2-n)/2 entries.  It's perfectly
dense and immune to sparse matrix compaction.  It would be quite
possible to multiply your values by 1000 and store as integers (given
your range of values) and reduce the space required significantly, but
the regression function you ultimately pass this to expects doubles and
you would have to cast the values back to double on the fly, resulting
in wasted time and a matrix that is as big as it would have been anyway.

Our view is that while it's possible to store the distance matrix on
disk rather than memory, all the functions that accept this matrix as an
argument also have to re-written to work with the distances on disk.
We're looking into doing this for PCO and NMDS, but someone would have
to do the same for the geographically weighted regression.  I'm sure
that's doable, but certainly not trivial.

Dave

On 06/05/2014 08:56 AM, Adrienne Wootten wrote:
> Jim,
>
> There are not going to be additional copies of the distance matrix.  The
> distance matrix is what is needed for a geographically weighted regression,
> so I can estimate the results from that to be a SpatialPointsDataFrame at
> roughly 30000 rows by 7 columns.  Much smaller size than the distance
> matrix, though I'm not sure given the class of that object.
>
> Potentially lots of room taken up, at the moment the machine in question
> has 32GB to work with, but we may be able to shift this onto machines with
> more RAM.
>
> A
>
>
> On Thu, Jun 5, 2014 at 10:00 AM, jim holtman <jholtman at gmail.com> wrote:
>
>> The real question is how much memory does the machine that you are working
>> on have.  The 32000x32000 matrix will take up ~8GB of physical memory, so
>> how much memory will the rest of your objects take up.  Are any of them
>> going to be copies of the distance matrix, or is it always going to be
>> unchanged?  Normally my rule of thumb is that I should have 3-4 times the
>> largest object I am working with since I may be making copies of it.  So it
>> is important to understand how the rest of your program will be using this
>> large matrix and what type of operations you will be doing on it.
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>>
>> On Thu, Jun 5, 2014 at 9:48 AM, Adrienne Wootten <amwootte at ncsu.edu>
>> wrote:
>>
>>> Jim
>>>
>>> At the moment I'm using write.table.  I tried using write.matrix from the
>>> MASS package, but that failed.  Integers are not appropriate here because
>>> we are working with fractions of miles for some locations and that needs to
>>> be retained.  The range is from 0 to about 3.5 (it's a little less than
>>> that with the digits)
>>>
>>> I haven't tried the save function yet, but I wasn't aware of that one
>>> previously. Thanks for pointing that out.
>>>
>>> The bigger concern is reading and working with that dataset in the other
>>> calculation though.
>>>
>>>
>>>
>>> On Thu, Jun 5, 2014 at 9:37 AM, jim holtman <jholtman at gmail.com> wrote:
>>>
>>>> How are you writing it out now?  Are you using 'save' which will
>>>> compress the file?  What are the range of numbers in the matrix?  Can you
>>>> scale them to integers (what is the range of the numbers) which might save
>>>> some space?  You did not provide enough information to make a definitive
>>>> solution.
>>>>
>>>>
>>>> Jim Holtman
>>>> Data Munger Guru
>>>>
>>>> What is the problem that you are trying to solve?
>>>> Tell me what you want to do, not how you want to do it.
>>>>
>>>>
>>>> On Thu, Jun 5, 2014 at 9:26 AM, Adrienne Wootten <amwootte at ncsu.edu>
>>>> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> Got a tricky situation and unfortunately because it's a big file I can't
>>>>> exactly provide an example, so I'll describe this as best I can for
>>>>> everyone.
>>>>>
>>>>> I have a distance matrix that we are using for a modeling calculation in
>>>>> space for multiple days.  Since the matrix is never going to change for
>>>>> different dates, I want to keep the matrix in a file and refer to that
>>>>> so I
>>>>> don't have to repeat the calculation over and over again for that.  The
>>>>> problem is it's a 32000 X 32000 matrix and roughly works out to 15GB of
>>>>> storage.  This makes it a trick to read the file back into R, but it
>>>>> leaves
>>>>> me with two questions for the group.
>>>>>
>>>>> Is there anyway to have R write this out so that it takes up less
>>>>> space?  I
>>>>> know R primarily treats numbers as doubles, but I'm trying to find a
>>>>> way to
>>>>> get R to write the values as floats or singles.
>>>>>
>>>>> with how big it is, it may not be wise to save it as an object in R when
>>>>> read in, so I'm wondering is there anyway to have R do the calculation
>>>>> it
>>>>> needs to do without saving the matrix as an object in R?  Basically can
>>>>> I
>>>>> have it run the calculation off the file itself?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> State Climate Office of North Carolina
>>>>> Department of Marine, Earth and Atmospheric Sciences
>>>>> North Carolina State University
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> State Climate Office of North Carolina
>>> Department of Marine, Earth and Atmospheric Sciences
>>> North Carolina State University
>>>
>>
>>
>
>

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David W. Roberts                                     office 406-994-4548