[R-sig-Geo] Parallel processing in doesn´t use all cores
Jaime Burbano Girón
jaimebg27 at gmail.com
Wed Sep 13 09:47:06 CEST 2017
Hi every one.
I´m trying to parallelize a process that runs over the rows of a matrix. I
expect for each element of the row, which is a species, that the process
extracts and writes a file (raster) corresponding to the distribution of
each species over their habitats.
Habitas layer is a raster file and each species distribution is a polygon
(or group of polygons) from a shapefile. I first convert the species
polygon(s) to a raster, second extract the habitats for the species (stored
in a matrix where the species habitats codes are matched with the habitats
raster values), and third intersect (multiply) the distribution and the
habitats.
In addition, I would like to produce a richness (number of species map)
file (raster). Then, I add (sum) to an empty raster (with values of zero)
each final species distribution. I wrote the following function:
extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover){
#Libraries
library(rgdal)
library(raster)
#raster file with zeros
richness_cur=raster("richness_current.tif")
#Selection of species polygons
rows=as.numeric(which(as.character(spp_polygons at data$binomial)==
as.character(spp_data$binomial[k])))
spp_poly=spp_polygons[rows,]
#Covert polygon(s) to raster
spp_poly=rasterize(spp_poly,cover,1,background=0)
#Match species habitats codes with habitats raster values
habs=as.character(spp_data$hab_code[k])
habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
[which(as.character(habitat_codes[,1])%in%habs)]))
#Intersect species distributions with habitats
cov_mask=spp_poly*cover
#Extract species habitats
cov_mask=Which(cov_mask%in%cov_classes)
writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
#Sum species richness
richness_cur=richness_cur+cov_mask
return (richness_cur)}
I try to parallelize the process using clusterApply and foreach functions.
However, I couldn't return a raster object from the function (which is
something obvious to get in a regular loop function) in any of both
functions to add to that object the sum of the species richness. So, here
is my first question. *1. Does anyone know how to return an object
different from a list (I could save each raster inside a single list but no
sum each one of them to the previos one), matrix, or vector in a
parallelization process?*
I solve this problem writing the richness file in each "iteration".
Nevertheless, this option causes the process to be slower, so for me, it is
not the ideal. Then, the function was rewritten as follows:
extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover){
#Libraries
library(rgdal)
library(raster)
#raster file with zeros
richness_cur=raster("richness_current.tif")
#Selection of species polygons
rows=as.numeric(which(as.character(spp_polygons at data$binomial)==
as.character(spp_data$binomial[k])))
spp_poly=spp_polygons[rows,]
#Covert polygon(s) to raster
spp_poly=rasterize(spp_poly,cover,1,background=0)
#Match species habitats codes with habitats raster values
habs=as.character(spp_data$hab_code[k])
habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";"
cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab
[which(as.character(habitat_codes[,1])%in%habs)]))
#Intersect species distributions with habitats
cov_mask=spp_poly*cover
#Extract species habitats
cov_mask=Which(cov_mask%in%cov_classes)
writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep=""))
#Sum species richness
richness_cur=richness_cur+cov_mask
writeRaster(richness_cur,"richness_current.tif")}
Complete code to run parallelization was:
#Number of cores
no_cores=detectCores()-1#Initiate cluster
cl=makeCluster(no_cores,type="PSOCK")
registerDoParallel(cl)
#Table with name and habitat information (columns) for each species (rows)
spp_data=read.xlsx2("species_file.xls",sheetIndex=1)#Shape file with
species distributions as polygons
spp_polygons=readOGR("distributions.shp")#Separation symbol for
species habitats stored in spp_data
sep=";"#Tabla joining habitas species codes with habitats raster
habitat_codes=read.xlsx2("spp_habitats_final.xls",sheetIndex=1)#Habitats raster
cover=raster("Z:/Data/cover_2015_proj_fixed_reclas_1km.tif")
#Paralelization
foreach(k=1:nrow(spp_data)) %dopar% extract_habitats(k=k,
spp_data=spp_data,
spp_polygons=spp_polygons,sep=sep,
habitat_codes=habitat_codes,
cover=cover)
stopImplicitCluster()
stopCluster(cl)
The parallelization process runs; however, it didn't works how I expected,
since it is not using all the cores: Image of processors working
<https://i.stack.imgur.com/m4h5y.jpg>. So, what the parallelization process
does, is to start 39 (the number of cores) processes: Image of processes
opened <https://i.stack.imgur.com/nDkX4.jpg>, but it does't write one by
one the files, how I could expect in a regular loop. It suddenly write
blocks of 39 files (something that I can understand), but taking a lot of
time (because it seems to work in few cores), even more than if I ran a
regular loop (running the regular loop each file is written each two or
three minutes, while the block of 39 files is written approximately each
one hour).
*So, here is my second group of questions. 2. What I´m doing bad? 3. Why it
is not using all the 39 processors, or if it uses them, why it doesn't use
them at the maximum level? 4. Why it not begins another task (run an other
process, e.g., the process 40) when it finishes one (I suppose it because
it always write the files in blocks of 39)?*
Thanks in advance for your help.
Cheers,
Jaime
--
Jaime Burbano Girón
Estudiante Doctoral
Doctorado en Estudios Ambientales y Rurales
Pontificia Universidad Javeriana
Bogotá, Colombia
(57-1) 3208320 Ext. 4844
[[alternative HTML version deleted]]
More information about the R-sig-Geo
mailing list