[R-sig-Geo] How to parallize my code?

Mon Feb 3 15:58:24 CET 2014

Yes, but why you are quering each point individually (one by one) in a
loop, and why you are writing every single data point in a different
file?  Loops are very slow in R, and reading/writing from/to a
disk file is also a very slow process.  Do it several hundred thousand
times and it is no surprise that this takes forever.  I do not have time to examine
your code either (next time provide a minimally reproducible example,
not a long script), but it seems that you have some netCDF files with
environmental data and a shapefile with locations, and you want to get
the environmental data for each location.  If this is what you want to
do, you should:

a) Read the shapefiles into a SpatialPoints object.
b) Use the raster package and read the netCDF rasters into a rasterbrick
object.
c) Use the over function and do an overlay between the points and the
rasterbrick.  You will get a SpaitalPointsDataFrame object, with the
data associated to each point.
d) Save the data into a single file.

This should take some time, say 20-30 minutes perhaps, maybe much less, depending on your
machine speed.

Julian

ping yang writes:

> Hi,
>
> The purpose is to generate weather information(tmin,tmim,prcp) for every 4
> KM (every 4KM there is a point represent that grid) for 32 years throughout
> Contingent U.S.
> I want to run several states simultaneously which I want it run them
> parallel.
>
> Thanks,
>
> Ping
>
>
> On Sun, Feb 2, 2014 at 1:45 PM, Julian Burgos <julian at hafro.is> wrote:
>
>> A clear explanation of what you are trying to do would help.  I cannot
>> figure out why you need to generate such a huge amount of files!  Writing
>> data into a disk is a very slow step.
>>
>> > Hi r-sig-geo,
>> >
>> >
>> >
>> > I am using the following script to extraction weather information for
>> each
>> > point (from a rasterized shapefile for the COUS) which generate about
>> > 460,000 files:
>> >
>> > Originally I read the coordinates from a shapefile:
>> >
>> > Library(rgdal)
>> >
>> > pl <- readOGR(".","US_2_points")
>> >
>> >
>> >
>> > then I used the following code to extract the information from three
>> > NetCDF
>> > files that contains the tmin,tmax and pr for each year :
>> >
>> > for (year in 1981:2012)
>> >
>> >   {
>> >
>> >     #get the combined netCDF file
>> >
>> >     tminfile <- paste("tmin","_",year,".nc",sep='')
>> >
>> >     b_tmin <- brick(tminfile,varname='tmin')
>> >
>> >     pptfile <- paste("ppt","_",year,".nc",sep='')
>> >
>> >     b_ppt <- brick(pptfile,varname='ppt')
>> >
>> >     tmaxfile <- paste("tmax","_",year,".nc",sep='')
>> >
>> >     b_tmax <- brick(tmaxfile,varname='tmax')
>> >
>> >     #Get the first year here!!!
>> >
>> >     print(paste("processing year :",year,sep=''))
>> >
>> >     for(l in 1:length(pl))
>> >
>> >     {
>> >
>> >       v <- NULL
>> >
>> >       #generate file with the name convention with
>> > t_n(latitude)w(longitude).txt, 5 digits after point should be work
>> >
>> >       filename <-
>> >
>> paste("e:/PRISM/US3/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
>> >
>> >
>> >       print(paste("processing file :",filename,sep=''))
>> >
>> >       tmin <-
>> > as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
>> >
>> >       tmax <-
>> > as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
>> >
>> >       ppt <-
>> > as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
>> >
>> >       v <- cbind(tmax,tmin,ppt)
>> >
>> >       tablename <- c("tmin","tmax","ppt")
>> >
>> >       v <- data.frame(v)
>> >
>> >       colnames(v) <- tablename
>> >
>> >       v["default"] <- 0
>> >
>> >       v["year"] <- year
>> >
>> >       date <-
>> >
>> seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
>> >
>> >       month <- as.numeric(substr(date,6,7))
>> >
>> >       day   <- as.numeric(substr(date,9,10))
>> >
>> >       v["month"] <- month
>> >
>> >       v["day"]  <-  day
>> >
>> >       v <- v[c("year","month","day","default","tmin","tmax","ppt")]
>> >
>> >       #write into a file with the fixed wide format
>> >
>> >       if(file.existes(filename))
>> >
>> >       {
>> >
>> >
>> >
>> write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
>> >
>> >       }
>> >
>> >       else
>> >
>> >       {
>> >
>> >
>> >
>> write.fwf(x=v,filename,append=FALSE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
>> >
>> >       }
>> >
>> >
>> >
>> > I found it will takes weeks to finish, then I divide the points from the
>> > shape file into each State(to generate a shape file for each state for
>> the
>> > points), and using the following function:
>> >
>> >
>> >
>> > weather4Point <- function(state)
>> >
>> > {
>> >
>> >   #get the point file for the state
>> >
>> >   setwd("c:/PRISM/US1Points/")
>> >
>> >   pl <- readOGR(".",paste("points4_",state,sep=''))
>> >
>> >   setwd("e:/PRISM/NetCDF/")
>> >
>> >   for (year in 1981:2012)
>> >
>> >   {
>> >
>> >     #get the combined netCDF file
>> >
>> >     tminfile <- paste("tmin","_",year,".nc",sep='')
>> >
>> >     b_tmin <- brick(tminfile,varname='tmin')
>> >
>> >     pptfile <- paste("ppt","_",year,".nc",sep='')
>> >
>> >     b_ppt <- brick(pptfile,varname='ppt')
>> >
>> >     tmaxfile <- paste("tmax","_",year,".nc",sep='')
>> >
>> >     b_tmax <- brick(tmaxfile,varname='tmax')
>> >
>> >     #Get the first year here!!!
>> >
>> >     print(paste("processing year :",year,sep=''))
>> >
>> >     for(l in 1:length(pl))
>> >
>> >     {
>> >
>> >       v <- NULL
>> >
>> >       #generate file with the name convention with
>> > t_n(latitude)w(longitude).txt, 5 digits after point should be work
>> >
>> >       filename <-
>> >
>> paste("e:/PRISM/US3/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
>> >
>> >
>> >       print(paste("processing file :",filename,sep=''))
>> >
>> >       tmin <-
>> > as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
>> >
>> >       tmax <-
>> > as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
>> >
>> >       ppt <-
>> > as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
>> >
>> >       v <- cbind(tmax,tmin,ppt)
>> >
>> >       tablename <- c("tmin","tmax","ppt")
>> >
>> >       v <- data.frame(v)
>> >
>> >       colnames(v) <- tablename
>> >
>> >       v["default"] <- 0
>> >
>> >       v["year"] <- year
>> >
>> >       date <-
>> >
>> seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
>> >
>> >       month <- as.numeric(substr(date,6,7))
>> >
>> >       day   <- as.numeric(substr(date,9,10))
>> >
>> >       v["month"] <- month
>> >
>> >       v["day"]  <-  day
>> >
>> >       v <- v[c("year","month","day","default","tmin","tmax","ppt")]
>> >
>> >       v
>> >
>> >       #print(paste(v), zero.print = ".")
>> >
>> >       #write into a file with the APEX format
>> >
>> >       if(file.existes(filename))
>> >
>> >       {
>> >
>> >
>> >
>> write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
>> >
>> >       }
>> >
>> >       else
>> >
>> >       {
>> >
>> >
>> >
>> write.fwf(x=v,filename,append=FALSE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
>> >
>> >       }
>> >
>> >     }
>> >
>> >   }
>> >
>> > }
>> >
>> >
>> >
>> > Then using the following code by opening several R session to make this
>> > process "parallel".
>> >
>> >
>> >
>> > States <- c("AL","AZ","AR","CO","CT","DC","FL","GA","ID","IA",
>> >
>> >             "IN","KS","KY","LA","MA","ME","MI","MN","MT","NC")
>> >
>> > for (state in States)
>> >
>> > {
>> >
>> >   weather4Point(state)
>> >
>> > }
>> >
>> >
>> >
>> > However, I still feel slow when I plan to do for the whole US. I am
>> asking
>> > here is there a better solution to speed up this process(any possibility
>> > to
>> > employ parallel programming in R) ?
>> >
>> >
>> >
>> > Environment information:
>> >
>> > R version 3.0.1 (2013-05-16)
>> >
>> > Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >
>> > Memory: 16 GB
>> >
>> > Processor: Intel(R) Core(tm) i7-3770 CPU @ 3.4Ghz 3.4Ghz
>> >
>> >
>> >
>> > Looking forward to any suggestions?
>> >
>> >
>> >
>> > Regards,
>> >
>> >
>> >
>> > Ping
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > R-sig-Geo mailing list
>> > R-sig-Geo at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>> >
>>
>>
>>

-- 
Julian Mariano Burgos, PhD
Hafrannsóknastofnun/Marine Research Institute
Skúlagata 4, 121 Reykjavík, Iceland
Sími/Telephone : +354-5752037
Bréfsími/Telefax:  +354-5752001
Netfang/Email: julian at hafro.is