[R-sig-hpc] Why pure computation time in parallel is longer than the serial version?

Drew Schmidt schmidt at math.utk.edu
Sun Feb 16 21:36:47 CET 2014


Someone politely pointed out to me in private that I meant to say 762 MiB,
not 762 GiB.  A 10000x10000 matrix is obviously not that big!  However,
the point still stands.

--
Drew Schmidt
National Institute for Computational Sciences
University of Tennessee, USA
http://r-pbd.org/


> There are a few things going on here. Most notably, the script you
> provided is comparing two completely different operations.  t(dx) %*% dx
> produces a matrix of dimension 10000x10000 (762 GiB), and mat %*% t(mat)
> produces a matrix of dimension 100x100 (78 KiB).  Of course the second
> one will be faster.  You also include the data generation within the
> parallel, but not serial timings, which doesn't fairly compare the
> computation time (especially for such small data, where hundredths of a
> second count).  Making only these changes, on my machine the timings
> with 2 ranks are 0.033 for the serial operation, and 0.153 for the
> parallel one.
>
> Also, as noted above, this matrix is actually quite small, only about 8
> MiB in size for the double precision storage it will get converted to
> for LAPACK/ScaLAPACK operations.  For small matrices, the communication
> overhead in ScaLAPACK can eat you alive.  You can see this by making the
> bldim large enough to encompass the entire matrix; even then, when the
> "parallel" product is done on one rank, there is some communication
> overhead.  On my machine, again with 2 ranks, in this case the timings
> are 0.033 and 0.043, for serial and parallel respectively.  You can use
> pbdDMAT effectively on a small shared memory machine, but it really
> begins to shine for larger, distributed platforms (servers, clusters,
> supercomputers).
>
> As a final side note, you can improve the performance of both the serial
> and parallel operations by using crossprod()/tcrossprod().
>
> --
> Drew Schmidt
> National Institute for Computational Sciences
> University of Tennessee, USA
> http://r-pbd.org/
>
> On 02/16/2014 08:16 AM, Xuening Zhu wrote:
>> Hi George:
>>   I wonder whether pbdR is better for multiple machines (computer
>> cluster)
>> for speeding up matrix computation? Since I have only a single machine,
>> I
>> didn't see much better performance here. I tried to compare 'parallel'
>> and
>> 'pbdDMAT' package in the experiment for matrix multiplication in
>> parallel
>> below:
>>
>> ############################################
>>
>> library(parallel)
>>
>> mat = matrix(1:1e6,ncol=1e4)
>> group = sample(rep(1:4,length.out=ncol(mat)))
>> mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
>>
>> system.time({
>>    a = mclapply(mm,function(m){
>>      m%*%t(m)
>>    },mc.cores=2)
>>    b = Reduce("+",a)
>> })
>>
>> ##############################################
>>
>> library(pbdDMAT,quiet=TRUE)
>> init.grid()
>> tt = system.time({
>>    #ScaLAPACK blocking dimension
>>    bldim<-c(4,4)
>>    #Generate data on process0, then distribute to theothers
>>    if(comm.rank()==0){
>>      mat = matrix(1:1e6,ncol=1e4)
>>    }else{
>>      mat=NULL
>>    }
>>    dx<-as.ddmatrix(x=mat,bldim=bldim)
>>
>>    #Computations in parallel
>>    ddx<-t(dx)%*%dx
>> })
>> mm = as.matrix(ddx)
>>
>> if(comm.rank()==0){
>>    print(system.time({
>>      MM = mat%*%t(mat)
>>    }))
>>    print(all.equal(MM,mm))
>> }
>>
>> comm.print(tt)
>> finalize()
>>
>> ###############################################
>>
>> The second one takes about 4.561 seconds while the first one takes only
>> 0.104 seconds.
>>
>>
>>
>> 2014-02-14 1:21 GMT+08:00 George Ostrouchov <georgeost at gmail.com>:
>>
>>> Consider using pbdR. It puts PBLAS and ScaLAPACK at your disposal for
>>> FORTRAN speed matrix parallelism without the need to learn their API.
>>> While
>>> built for truly big machines, you will see a lot of benefit with your
>>> size
>>> machine already. Start with pbdDEMO to learn the basics. It is batch
>>> computing (because that's what's done on big machines) with Rscript but
>>> the
>>> speed and simplicity are worth it!
>>>
>>> Cheers,
>>> George
>>>
>>>
>>> On 2/13/14 2:32 AM, romunov wrote:
>>>
>>>> When doing calculations in parallel, there's also some overhead costs.
>>>> If
>>>> computation time per core is short, the overhead costs may exceed the
>>>> computation time, time-wise raising the cost of parallel task.
>>>>
>>>> Cheers,
>>>> Roman
>>>>
>>>>
>>>> On Thu, Feb 13, 2014 at 5:26 AM, Xuening Zhu <puddingnnn529 at gmail.com>
>>>> wrote:
>>>>
>>>>   I am learning about parallel computing in R , and I found this
>>>> happening
>>>>> in
>>>>> my experiments.
>>>>>
>>>>> Briefly, in the following example, why are most values of user time
>>>>> in
>>>>> t smaller
>>>>> than that in mc_t ? My machine has 32GB memory, 2 cpus with 4 cores
>>>>> and 8
>>>>> hyper threads in total. Tools such as BLAS to enhance performance
>>>>> aren't
>>>>> installed as well.
>>>>>
>>>>> system.time({t = lapply(1:4,function(i) {
>>>>>       m = matrix(1:10^6,ncol=100)
>>>>>       t = system.time({
>>>>>           m%*%t(m)
>>>>>       })
>>>>>       return(t)})})
>>>>>
>>>>>
>>>>> library(multicore)
>>>>> system.time({
>>>>>       mc_t = mclapply(1:4,function(m){
>>>>>           m = matrix(1:10^6,ncol=100)
>>>>>           t = system.time({
>>>>>               m%*%t(m)
>>>>>           })
>>>>>           return(t)
>>>>>       },mc.cores=4)})
>>>>>
>>>>>   t[[1]]
>>>>> user  system elapsed
>>>>>
>>>>> 11.136   0.548  11.703
>>>>>
>>>>> [[2]]
>>>>> user  system elapsed
>>>>>
>>>>> 11.533   0.548  12.098
>>>>>
>>>>> [[3]]
>>>>> user  system elapsed
>>>>>
>>>>> 11.665   0.432  12.115
>>>>>
>>>>> [[4]]
>>>>> user  system elapsed
>>>>>
>>>>> 11.580   0.512  12.115
>>>>>
>>>>>   mc_t[[1]]
>>>>> user  system elapsed
>>>>>
>>>>> 16.677   0.496  17.199
>>>>>
>>>>> [[2]]
>>>>> user  system elapsed
>>>>>
>>>>> 16.741   0.428  17.198
>>>>>
>>>>> [[3]]
>>>>> user  system elapsed
>>>>>
>>>>> 16.653   0.520  17.198
>>>>>
>>>>> [[4]]
>>>>> user  system elapsed
>>>>>
>>>>> 11.056   0.444  11.520
>>>>>
>>>>> mc_t and t measures pure computation time according to my
>>>>> comprehension.
>>>>>    Such things happens to parLapply in parallel package as well. The
>>>>> memory
>>>>> in my machine is enough for that computation. (It just takes a few
>>>>> percent
>>>>> of that).
>>>>>
>>>>> Also I try to run 4 similar Rscript as below by hand using command
>>>>> 'Rscript' at the same time on the same machine and save the results.
>>>>> The
>>>>> elapsed time for each of them is about 12s as well. So I think it may
>>>>> not
>>>>> be the contention of cores.
>>>>>
>>>>> system.time({t = lapply(1,function(i) {
>>>>>       m = matrix(1:10^6,ncol=100)
>>>>>       t = system.time({
>>>>>           m%*%t(m)
>>>>>       })
>>>>>       return(t)})})
>>>>>
>>>>> So what happened during the parallel? Does mc_t really measure the
>>>>> pure
>>>>> computation time? Can someone explain the whole process by step in
>>>>> detail?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>>
>>>>> Xuening Zhu
>>>>>
>>>>>           [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-hpc mailing list
>>>>> R-sig-hpc at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>
>>
>>
>>
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>



More information about the R-sig-hpc mailing list