[R-sig-Geo] How to parallize my code?

Jonathan Greenberg jgrn at illinois.edu
Mon Feb 3 17:46:51 CET 2014


Ping:

First: For anyone getting started with parallel programming in R, you
should really read:
http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf

Note that wherever it says "snow" you should swap in the package
"parallel".  I'm a big proponent of the "foreach" package, which
generalizes parallel processing across different parallel
infrastructures, and more clearly follows "for" - type looping (*apply
statement are, in general, much more difficult to understand).

Second: if you are doing high I/O, low CPU processes (lots of
read/write), in general parallel processing doesn't help unless you
are on a RAIDed or a parallel filesystem.

Third, as Julian points out: it is easier to start with what you are
trying to do, rather than "fix my script".  What do you mean by:

"The purpose is to generate weather information(tmin,tmim,prcp) for every 4
KM (every 4KM there is a point represent that grid) for 32 years throughout
Contingent U.S."

?

Are you interpolating data?  Downscaling it?  Extracting statistics
per-state?  Summarizing it?  All of these have very different
interpretations, and different approaches (not all of which are
parallel processing).

--j

On Mon, Feb 3, 2014 at 8:58 AM, Julian M. Burgos <julian at hafro.is> wrote:
> Yes, but why you are quering each point individually (one by one) in a
> loop, and why you are writing every single data point in a different
> file?  Loops are very slow in R, and reading/writing from/to a
> disk file is also a very slow process.  Do it several hundred thousand
> times and it is no surprise that this takes forever.  I do not have time to examine
> your code either (next time provide a minimally reproducible example,
> not a long script), but it seems that you have some netCDF files with
> environmental data and a shapefile with locations, and you want to get
> the environmental data for each location.  If this is what you want to
> do, you should:
>
> a) Read the shapefiles into a SpatialPoints object.
> b) Use the raster package and read the netCDF rasters into a rasterbrick
> object.
> c) Use the over function and do an overlay between the points and the
> rasterbrick.  You will get a SpaitalPointsDataFrame object, with the
> data associated to each point.
> d) Save the data into a single file.
>
> This should take some time, say 20-30 minutes perhaps, maybe much less, depending on your
> machine speed.
>
> Julian
>
>
> ping yang writes:
>
>> Hi,
>>
>> The purpose is to generate weather information(tmin,tmim,prcp) for every 4
>> KM (every 4KM there is a point represent that grid) for 32 years throughout
>> Contingent U.S.
>> I want to run several states simultaneously which I want it run them
>> parallel.
>>
>> Thanks,
>>
>> Ping
>>
>>
>> On Sun, Feb 2, 2014 at 1:45 PM, Julian Burgos <julian at hafro.is> wrote:
>>
>>> A clear explanation of what you are trying to do would help.  I cannot
>>> figure out why you need to generate such a huge amount of files!  Writing
>>> data into a disk is a very slow step.
>>>
>>> > Hi r-sig-geo,
>>> >
>>> >
>>> >
>>> > I am using the following script to extraction weather information for
>>> each
>>> > point (from a rasterized shapefile for the COUS) which generate about
>>> > 460,000 files:
>>> >
>>> > Originally I read the coordinates from a shapefile:
>>> >
>>> > Library(rgdal)
>>> >
>>> > pl <- readOGR(".","US_2_points")
>>> >
>>> >
>>> >
>>> > then I used the following code to extract the information from three
>>> > NetCDF
>>> > files that contains the tmin,tmax and pr for each year :
>>> >
>>> > for (year in 1981:2012)
>>> >
>>> >   {
>>> >
>>> >     #get the combined netCDF file
>>> >
>>> >     tminfile <- paste("tmin","_",year,".nc",sep='')
>>> >
>>> >     b_tmin <- brick(tminfile,varname='tmin')
>>> >
>>> >     pptfile <- paste("ppt","_",year,".nc",sep='')
>>> >
>>> >     b_ppt <- brick(pptfile,varname='ppt')
>>> >
>>> >     tmaxfile <- paste("tmax","_",year,".nc",sep='')
>>> >
>>> >     b_tmax <- brick(tmaxfile,varname='tmax')
>>> >
>>> >     #Get the first year here!!!
>>> >
>>> >     print(paste("processing year :",year,sep=''))
>>> >
>>> >     for(l in 1:length(pl))
>>> >
>>> >     {
>>> >
>>> >       v <- NULL
>>> >
>>> >       #generate file with the name convention with
>>> > t_n(latitude)w(longitude).txt, 5 digits after point should be work
>>> >
>>> >       filename <-
>>> >
>>> paste("e:/PRISM/US3/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
>>> >
>>> >
>>> >       print(paste("processing file :",filename,sep=''))
>>> >
>>> >       tmin <-
>>> > as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
>>> >
>>> >       tmax <-
>>> > as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
>>> >
>>> >       ppt <-
>>> > as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
>>> >
>>> >       v <- cbind(tmax,tmin,ppt)
>>> >
>>> >       tablename <- c("tmin","tmax","ppt")
>>> >
>>> >       v <- data.frame(v)
>>> >
>>> >       colnames(v) <- tablename
>>> >
>>> >       v["default"] <- 0
>>> >
>>> >       v["year"] <- year
>>> >
>>> >       date <-
>>> >
>>> seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
>>> >
>>> >       month <- as.numeric(substr(date,6,7))
>>> >
>>> >       day   <- as.numeric(substr(date,9,10))
>>> >
>>> >       v["month"] <- month
>>> >
>>> >       v["day"]  <-  day
>>> >
>>> >       v <- v[c("year","month","day","default","tmin","tmax","ppt")]
>>> >
>>> >       #write into a file with the fixed wide format
>>> >
>>> >       if(file.existes(filename))
>>> >
>>> >       {
>>> >
>>> >
>>> >
>>> write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
>>> >
>>> >       }
>>> >
>>> >       else
>>> >
>>> >       {
>>> >
>>> >
>>> >
>>> write.fwf(x=v,filename,append=FALSE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
>>> >
>>> >       }
>>> >
>>> >
>>> >
>>> > I found it will takes weeks to finish, then I divide the points from the
>>> > shape file into each State(to generate a shape file for each state for
>>> the
>>> > points), and using the following function:
>>> >
>>> >
>>> >
>>> > weather4Point <- function(state)
>>> >
>>> > {
>>> >
>>> >   #get the point file for the state
>>> >
>>> >   setwd("c:/PRISM/US1Points/")
>>> >
>>> >   pl <- readOGR(".",paste("points4_",state,sep=''))
>>> >
>>> >   setwd("e:/PRISM/NetCDF/")
>>> >
>>> >   for (year in 1981:2012)
>>> >
>>> >   {
>>> >
>>> >     #get the combined netCDF file
>>> >
>>> >     tminfile <- paste("tmin","_",year,".nc",sep='')
>>> >
>>> >     b_tmin <- brick(tminfile,varname='tmin')
>>> >
>>> >     pptfile <- paste("ppt","_",year,".nc",sep='')
>>> >
>>> >     b_ppt <- brick(pptfile,varname='ppt')
>>> >
>>> >     tmaxfile <- paste("tmax","_",year,".nc",sep='')
>>> >
>>> >     b_tmax <- brick(tmaxfile,varname='tmax')
>>> >
>>> >     #Get the first year here!!!
>>> >
>>> >     print(paste("processing year :",year,sep=''))
>>> >
>>> >     for(l in 1:length(pl))
>>> >
>>> >     {
>>> >
>>> >       v <- NULL
>>> >
>>> >       #generate file with the name convention with
>>> > t_n(latitude)w(longitude).txt, 5 digits after point should be work
>>> >
>>> >       filename <-
>>> >
>>> paste("e:/PRISM/US3/N",round(coordinates(pl[l,])[2],5),"W",abs(round(coordinates(pl[l,])[1],5)),".wth",sep='')
>>> >
>>> >
>>> >       print(paste("processing file :",filename,sep=''))
>>> >
>>> >       tmin <-
>>> > as.numeric(round(extract(b_tmin,coordinates(pl[l,])),digits=1))
>>> >
>>> >       tmax <-
>>> > as.numeric(round(extract(b_tmax,coordinates(pl[l,])),digits=1))
>>> >
>>> >       ppt <-
>>> > as.numeric(round(extract(b_ppt,coordinates(pl[l,])),digits=2))
>>> >
>>> >       v <- cbind(tmax,tmin,ppt)
>>> >
>>> >       tablename <- c("tmin","tmax","ppt")
>>> >
>>> >       v <- data.frame(v)
>>> >
>>> >       colnames(v) <- tablename
>>> >
>>> >       v["default"] <- 0
>>> >
>>> >       v["year"] <- year
>>> >
>>> >       date <-
>>> >
>>> seq(as.Date(paste(year,"/1/1",sep='')),as.Date(paste(year,"/12/31",sep='')),"days")
>>> >
>>> >       month <- as.numeric(substr(date,6,7))
>>> >
>>> >       day   <- as.numeric(substr(date,9,10))
>>> >
>>> >       v["month"] <- month
>>> >
>>> >       v["day"]  <-  day
>>> >
>>> >       v <- v[c("year","month","day","default","tmin","tmax","ppt")]
>>> >
>>> >       v
>>> >
>>> >       #print(paste(v), zero.print = ".")
>>> >
>>> >       #write into a file with the APEX format
>>> >
>>> >       if(file.existes(filename))
>>> >
>>> >       {
>>> >
>>> >
>>> >
>>> write.fwf(x=v,filename,append=TRUE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
>>> >
>>> >       }
>>> >
>>> >       else
>>> >
>>> >       {
>>> >
>>> >
>>> >
>>> write.fwf(x=v,filename,append=FALSE,na="NA",rownames=FALSE,colnames=FALSE,width=c(6,3,3,5,5,5,6))
>>> >
>>> >       }
>>> >
>>> >     }
>>> >
>>> >   }
>>> >
>>> > }
>>> >
>>> >
>>> >
>>> > Then using the following code by opening several R session to make this
>>> > process "parallel".
>>> >
>>> >
>>> >
>>> > States <- c("AL","AZ","AR","CO","CT","DC","FL","GA","ID","IA",
>>> >
>>> >             "IN","KS","KY","LA","MA","ME","MI","MN","MT","NC")
>>> >
>>> > for (state in States)
>>> >
>>> > {
>>> >
>>> >   weather4Point(state)
>>> >
>>> > }
>>> >
>>> >
>>> >
>>> > However, I still feel slow when I plan to do for the whole US. I am
>>> asking
>>> > here is there a better solution to speed up this process(any possibility
>>> > to
>>> > employ parallel programming in R) ?
>>> >
>>> >
>>> >
>>> > Environment information:
>>> >
>>> > R version 3.0.1 (2013-05-16)
>>> >
>>> > Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> >
>>> > Memory: 16 GB
>>> >
>>> > Processor: Intel(R) Core(tm) i7-3770 CPU @ 3.4Ghz 3.4Ghz
>>> >
>>> >
>>> >
>>> > Looking forward to any suggestions?
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> >
>>> >
>>> > Ping
>>> >
>>> >       [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > R-sig-Geo mailing list
>>> > R-sig-Geo at r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>> >
>>>
>>>
>>>
>
>
> --
> Julian Mariano Burgos, PhD
> Hafrannsóknastofnun/Marine Research Institute
> Skúlagata 4, 121 Reykjavík, Iceland
> Sími/Telephone : +354-5752037
> Bréfsími/Telefax:  +354-5752001
> Netfang/Email: julian at hafro.is
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo



-- 
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
259 Computing Applications Building, MC-150
605 East Springfield Avenue
Champaign, IL  61820-6371
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007



More information about the R-sig-Geo mailing list