[R] Please help

Mon Mar 31 19:28:45 CEST 2014

Hi,
replace `lst2` with:
#Subset of data

lst1Sub <- lapply(lst1Not1970,function(x) x[c(1:25, 18707:18708)])

lst2 <- lapply(lst1Sub,function(x) {dateSite <- gsub("(.*G.{3}).*","\\1",x); dat1 <- data.frame(Year=as.numeric(substr(dateSite,1,4)), Month=as.numeric(substr(dateSite,5,6)),Day=as.numeric(substr(dateSite,7,8)),Site=substr(dateSite,9,12),stringsAsFactors=FALSE); Sims <- str_trim(gsub(".*G.{3}\\s?(.*)","\\1",x));Sims[grep("\\d+-",Sims)] <- gsub("(.*)([- ][0-9]+\\.[0-9]+)","\\1 \\2",gsub("^([0-9]+\\.[0-9]+)(.*)","\\1 \\2", Sims[grep("\\d+-",Sims)]));Sims1 <- read.table(text=Sims,header=FALSE); names(Sims1) <- c("Precipitation", "Tmin", "Tmax");dat2 <- cbind(dat1,Sims1)})
lapply(lst2,tail,3)
[[1]] Year Month Day Site Precipitation   Tmin  Tmax
25 1971     1   1 GG25          0.36 -14.32  3.87
26 1971     6   5 G107        144.09  11.25 30.44
27 1971     6   5 G108          0.66   9.33 32.96 

A.K.

On Monday, March 31, 2014 2:35 AM, Zilefac Elvis <zilefacelvis at yahoo.com> wrote:

Hi AK,
I figured out that the error is from "Sim1971-2000_Daily_Sim001.dat".
The other files had no error when I ran this section of the code which detects an error:

               lst2 <- lapply(lst1Sub,
               function(x) {dateSite <- gsub("(.*G\\d+).*","\\1",x); 
                            dat1 <- data.frame(Year=as.numeric(substr(dateSite,1,4)),Month=as.numeric(substr(dateSite,5,6)),Day=as.numeric(substr(dateSite,7,8)),Site=substr(dateSite,9,12),stringsAsFactors=FALSE);
                            Sims <- gsub(".*G\\d+\\s+(.*)","\\1",x); Sims[grep("\\d+-",Sims)] <- gsub("(.*)([- ][0-9]+\\.[0-9]+)","\\1 \\2",gsub("^([0-9]+\\.[0-9]+)(.*)","\\1 \\2", Sims[grep("\\d+-",Sims)])); 
                            Sims1 <- read.table(text=Sims,header=FALSE); 
                            names(Sims1) <- c("Precipitation", "Tmin", "Tmax");dat2 <- cbind(dat1,Sims1)})

After examining line 18707 of lst1Sub obtained by using only "Sim1971-2000_Daily_Sim001.dat"
It reads as 1971 6 5G107144.09 11.25 30.44. When I replace 144.09 with 44.09, the code runs perfect. 144.09 is such a high value but that is what the simulation realised from the calibrated model. In most cases, Precip values are 2 values before a decimal point. However, in some cases as above, it could be 3 values before decimal point.

How can we avoid the error and read the data as is? Please try to include the bold text in the code below and see what happens:

lst2 <- lapply(lst1Sub,
               function(x) {dateSite <- gsub("(.*G\\d+).*","\\1",x); 
                            dat1 <- data.frame(Year=as.numeric(substr(dateSite,1,4)),Month=as.numeric(substr(dateSite,5,6)),Day=as.numeric(substr(dateSite,7,8)),Site=substr(dateSite,9,12)),Precipitation=substr(dateSite,13,18)),Tmin=substr(dateSite,14,24)),Tmax=substr(dateSite,25,30),stringsAsFactors=FALSE);
                            Sims <- gsub(".*G\\d+\\s+(.*)","\\1",x); Sims[grep("\\d+-",Sims)] <- gsub("(.*)([- ][0-9]+\\.[0-9]+)","\\1 \\2",gsub("^([0-9]+\\.[0-9]+)(.*)","\\1 \\2", Sims[grep("\\d+-",Sims)])); 
                            Sims1 <- read.table(text=Sims,header=FALSE); 
                            names(Sims1) <- c("Precipitation", "Tmin", "Tmax");dat2 <- cbind(dat1,Sims1)})

Thanks AK.
Atem.
On Sunday, March 30, 2014 11:01 PM, Zilefac Elvis <zilefacelvis at yahoo.com> wrote:

Hi AK,
You did just what I wanted. I tried it using this subset:
#Using a small subset:
lst1Sub <- lapply(lst1Not1970,function(x) x[1:1000]) 

and it worked so well.

However, I would like to do it for all the data, so I changed x[1:1000]
to lst1Sub <- lapply(lst1Not1970,function(x) x) # did I make a mistake here?
and got an error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 18707 did not have 3 elements

Where could the mistake be coming from? 
I tried lst1Sub <- lapply(lst1Not1970,function(x) x[1:18708])
but encountered same error.

lst1Sub <- lapply(lst1Not1970,function(x) x[1:18706]) works perfect.

I have opened all the three files and checked line 18707 but found nothing wrong with the values. Please help.

Thanks,
Atem.
On Sunday, March 30, 2014 7:21 PM, arun <smartpink111 at yahoo.com> wrote:

I did exactly the same as you mentioned, but on a smaller datatset as it takes time.  Also, your dataset is not very consistent in formatting especially in the Precipitation, Tmin, Tmax columns.  For e.g., some values are:
0.48-2.14 -1.48
1.48 -2.12-1.21

Check the space between the two options above.  Anyway, I did change those in the subset dataset.  I am not sure whether there is some other problems in your original dataset.  
Arun

On Sunday, March 30, 2014 7:09 PM, Zilefac Elvis <zilefacelvis at yahoo.com> wrote:

Hi AK,
I will try the code you just sent when I reach home.
However, let me
use the example you just provided and be clearer on how
the output should look like.
list.files(pattern="Sim1971-2000")
#[1] "Sim1971-2000_Daily_Sim001.dat" "Sim1971-2000_Daily_Sim002.dat"
#[3] "Sim1971-2000_Daily_Sim003.dat"

The above 3 files represent 3 simulations.
In the three folders, I will have 120 files per folder.
As you said:
1989 4 5GG38 0.48 -3.25 13.69 
#represents:
year month day site Precipitation Tmax Tmin.
So, in the output, I wanted in the:
#Precipitation folder
1989 4 5GG38  0.48 
1989 4 5GG39  0.00 
1989 4 5GG40  0.00
1989 4 5GG41  0.00 

But the individual files should have as filenames:
filename38=GG38
filename40=GG40 and so on for 120 sites.
The contents of each filename should be year,Month,Day, sim001,sim002,sim003
So, take all site codes and use the to name the files in each folder. Within each site code, there are precipitation values from 1971-2005 and from sim001,sim002,sim003.

In essence, I had 120 sites, each site had Precipitation,Tmin,Tmax. I did 3 simulations and put it in 'sample' file. The simulation is from 1971-2005 period.
Now I want to take each site and for each variable (in 3 folders), create a dataframe where the 3 simulations are stored with colnames as: year,Month,Day, sim001,sim002,sim003. Do this for Precip,Tmin and Tmax separately.
#############################################################################

#Tmin folder
1989 4 5GG38   -3.25 
1989 4 5GG39   -9.82 
1989 4 5GG40  -14.74 
1989 4 5GG41   -4.37  

.... do same as for precipitation

#Tmax folder
1989 4 5GG38   13.69
1989 4 5GG39   10.75
1989 4 5GG40   -1.13
1989 4 5GG41   8.06

Do same as for precipitation

Thanks very much,
Atem.
On Sunday, March 30, 2014 2:11 PM, arun <smartpink111 at yahoo.com> wrote:

HI,
I have one more doubt.  You mentioned 120 files in Precipitation folder, and similarly that
many
files in Tmin
and Tmax.  In the "sample" folder, you have simulation files:

list.files(pattern="Sim1971-2000")
#[1] "Sim1971-2000_Daily_Sim001.dat" "Sim1971-2000_Daily_Sim002.dat"
#[3] "Sim1971-2000_Daily_Sim003.dat" 

As I understand the problem, you would have 120*3 ie. 360 files each for Precipitation, Tmin and Tmax from the Simulation datasets.  Please be clear about what you wanted.
Arun

On Sunday, March 30, 2014 1:34 PM, arun <smartpink111 at yahoo.com> wrote:
HI,
Just to be clear:

1989 4 5GG38
0.48 -3.25 13.69 
#represents:
year month day site Precipitation Tmax Tmin.
So, in the output, you wanted:
Precipitation:
1989 4 5GG38 0.48
Tmax:
1989 4 5GG38 -3.25
Tmin:
1989 4 5GG38 13.69

Instead of long description, if you have just what you wanted just like above, we wouldn't have to do this back and forth emails.

Also, you mentioned a lot of simulations (sim1 to sim10).  According to your statement:
" For example, I will take precip from site GGG1 and have a data frame with colnames such as Year,Month,Day, sim1,sim2,...,sim100. Repeat this for all 120 sites. So that for Precip, you will have 120 files corresponding to the site
codes. Each
file
has nrows
with Year,Month,Day, sim1...sim100 columns."

What I understand is that in the Preciptation folder, there are 120 files:
For example (using the same data):
1989 4 5GGG1 0.48

1989 4 6GGG1 0.25
------------------

#2nd precipitation file:

1989 4 5GGG2 0.74
1989 4 6GGG2 0.84

etc.

Now, back to the Sim1 ....sim100.  

indxdat <- cbind(paste0("Sim",1:100),rep(c("Precipitation","Tmax","Tmin"),length.out=100)) 

indxdat[1:5,]
 [,1]   [,2]          
[1,] "Sim1"
"Precipitation"
[2,] "Sim2" "Tmax"        
[3,] "Sim3" "Tmin"        
[4,] "Sim4" "Precipitation"
[5,] "Sim5" "Tmax" 

In your original file, if this is how the values are repeated, then:
In your result dataset:
Precipitation folders contain:
year month day Site Sim1 Sim4 Sim7 ....Sim100

Tmax folder:
year month day Site Sim2 Sim5 ....

Let me know if this is what you wanted the output.
Arun

Also, if you respect the positions I gave you, the variables will be perfectly split. Precipitation has no -(minus). The positions ensure that values do not cross from one variable to another
when the splitting is done.

Thanks, Atem.

------ Original Message ------

From : arun
>To : Zilefac Elvis;
>Sent : 30-03-2014 02:14
>Subject : Re: Re: Please help
> 
>HI Atem, It is still not clear.  You mentioned Precipitation occupies 13-18. But, in the file, after the site, it is "Sim1", "Sim2", Sim3. etc.  So, I am not sure what you are referring to Precipitation.
Tell me, in this data: 1989 4 5GG38  0.48 -3.25 13.69
1989 4 5GG39  0.00 -9.82 10.75
1989 4 5GG40  0.00-14.74 -1.13
1989 4 5GG41  0.00 -4.37  8.06 Which one is precipation, Tmax, and
Tmin?
Arun On Sunday, March 30, 2014 1:38 AM, Zilefac Elvis  wrote: Hi AK,
I was able to download the files.  In those files, the formatting is not consistent. I am glad you finally downloaded the files.As I indicated in the email description, the analysis starts from 1971 to 2005. Any values before 1971 are meaningless. I used 1970 to initialize my simulation. So, let's start in 1971 onwards. Also, I didn't quite understand about splitting by Precipitation, Tmin, Tmax If look at this output: 1989 4 5GG38  0.48 -3.25 13.69
1989 4 5GG39  0.00 -9.82 10.75
1989 4 5GG40  0.00-14.74 -1.13
1989 4 5GG41  0.00 -4.37  8.06 The first 4 values represent the Year, next two values is the Month. For example, April is coded as 04 but the zero is just 'space', and December is coded as 12.After the month, the next two values represent Day (1 to 31/30/28)
depending
on the month of the year. The GGGs represent site code. Fpr example, site 1 = GGG1 and site 120 is G120. Now, if you open one of the Sim1971-2000_Daily_ files in an editor, it is Fortran-style read. For example, in all the files (see code below), "Year" occupies position 1-4, "Month" occupies position 5-6, "Day" occupies position 7-8, "Site" occupies position 9-12, Precipitation occupies position 13-18, Tmin occupies position 14-24, and Tmax occupies position 25-30. In anther project, I read such files unto R workspace using this code: rain.data <- scan("gaugvals.all",what=character(),sep="\n")# change 'gaugvals.all' to file names in your directory
rain.data <- data.frame(Year=as.numeric(substr(rain.data,1,4)),                        Month=as.numeric(substr(rain.data,5,6)),
                       Day=as.numeric(substr(rain.data,7,8)),                         Site=substr(rain.data,9,12),                                               Precip=as.numeric(substr(rain.data,13,18))),       Tmin=as.numeric(substr(rain.data,14,24))),                       Tmax=as.numeric(substr(rain.data,25,30))) # please check that brackets are enough
Now you
should begin to get a feel of the data coding and how to split precipitation, Tmin and Tmax.   You mentioned that the columns are Year month date Sim1, Sim2, Sim3.  So, where is the info to split to the three folders? The original data file before I did the simulation was a dataframe which you helped me to re-arrange following
the instructions I gave you. You used this code to reshape the data into the format which now appears in the Sim1971-2000_Daily_  files. dat1 <- read.table("predictand.csv",header=TRUE,stringsAsFactors=TRUE,sep="\t") # Predictand.csv had 123 #columns with the columns 1,2,3 as date.
dat1<-precipitation
dat2M <- melt(dat1,id.var=c("year","month","day"))
dat2M1 <- dat2M[with(dat2M,order(year,month,day,variable)),]
dim(dat2M1)
#[1] 1972320       5
row.names(dat2M1)
<- 1:nrow(dat2M1)
PrecipTminTmax<-cbind(precipitation,Tmin,Tmax)  So you can see that here we reshaped the original data to [year,month,day,site,variable]. I did this for Precip,Tmin and Tmax separately and then combined them using PrecipTminTmax<-cbind(precipitation,Tmin,Tmax). This is just how the sim files are structured. Our task now is to do the opposite of the above code and undo cbind(precipitation,Tmin,Tmax) so that precip,tmin and tmax will have separate folders. In each of 3 folders, there will be 120 files named by site codes. Each final file has nrows with Year,Month,Day, sim1...sim100 columns. But for the sample data I sent you, I think there are only 3 simulations, so we will have as final output Year,Month,Day, sim1, sim2, sim3 columns Let me know if you get a feel of what I am trying to achieve. Thanks very much AK.
Atem. On Saturday, March
29, 2014 9:30 PM, arun  wrote: HI Atem,
I was able to download the files.  In those files, the formatting is not consistent. 1989 4 5GG38  0.48 -3.25 13.69
1989 4 5GG39  0.00 -9.82 10.75
1989 4 5GG40  0.00-14.74 -1.13
1989 4 5GG41  0.00 -4.37  8.06 Compared to: 19701228GGG1  3.89 -3.94  7.90
19701228GGG2  3.89 -3.94  7.90
19701228GGG3  3.89 -3.94  7.90
19701228GGG4  3.89 -3.94  7.90 Also, I didn't quite understand about splitting by Precipitation, Tmin, Tmax.  You mentioned that the columns are Year month date Sim1, Sim2, Sim3.  So, where is the info to split to the three folders? Arun On Friday, March 28, 2014 1:56 AM, Zilefac Elvis  wrote: Hi AK,
Attached is a sample from the large file. The expected output is explained at the end of
this message (bold).
It is a little
lengthy but is worth it given that the number of sites is plentiful. I have attached three simulations, so your will have sim1,sim2,sim3 instead of sim1 to sim100 as in the previous message.
############################################################################
I have done some simulations in R and would like to order my data to usable format.
The data is to large so I have attached via Dropbox.
When you load Calibration.RData to the
workspace, you will find the site codes (column 1) in "Prairies.Sites".
My initial dataset was in the form of a dataframe with with columns denoting stations. So I had three dataframes each for precipitation, Tmin, and Tmax. Individually, you reshaped the dataframes to three column vectors (see file called PrecipTminTmax) using this code:
library(reshape2)
dat1 <-
read.table("predictand.csv",header=TRUE,stringsAsFactors=TRUE,sep="\t") # Predictand.csv had 123 #columns with the columns 1,2,3 as date.
dat1<-precipitation
dat2M <- melt(dat1,id.var=c("year","month","day"))
dat2M1 <- dat2M[with(dat2M,order(year,month,day,variable)),]
dim(dat2M1)
#[1] 1972320       5
row.names(dat2M1) <- 1:nrow(dat2M1)
PrecipTminTmax<-cbind(precipitation,Tmin,Tmax) The problem to be
solved Attached is a large file (SimCalibration.zip) containing my simulations (001 to 100). Please import files starting with "Sim1971-2000_Daily_" only. The rest is not important. My analysis is for the period 1971-2000. Any data before or after this period should be ignored.
My simulation was done in R using
Fortran encoding to read data values. All files are ".dat".
In each file, the columns are as follows :
Year, Month, Day, Site, Precip, Tmin, Tmax. In another project involving rainfall only, I read such files into R using this code:
rain.data <- scan("gaugvals.all",what=character(),sep="\n",n=257212)
rain.data <- data.frame(Year=as.numeric(substr(rain.data,1,4)),                        Month=as.numeric(substr(rain.data,5,6)),                       Day=as.numeric(substr(rain.data,7,8)),                        Site=substr(rain.data,10,12),
                                             Rain=as.numeric(substr(rain.data,13,18)))  Q1) So, I would like to read all files beginning with "Sim1971-2000_Daily_".
2) Split each file by variable name (Precip, Tmin, Tmax) and then arrange each variable in the form of a dataframe. For example, I will take precip from site GGG1 and have a data frame with colnames
such as Year,Month,Day, sim1,sim2,...,sim100. Repeat this for all 120 sites. So that for Precip, you will have 120 files corresponding to the site codes. Each file has nrows with Year,Month,Day, sim1...sim100 columns. 3) Please repeat the above for Tmin and Tmax so that in the end I will have three folders (Precip, Tmin and
Tmax). Each
folder has 120 files with each file being a dataframe containing date and 100 columns).  When you successfullly go through this "difficult" section,I will access each folder, read each file and apply a function to it one at a time. Thanks AK, this is part of my Msc thesis project. Your help would be fully acknowledged. You have helped me a lot towards the success of this project. Atem. On Thursday, March 27, 2014 9:09 PM, arun  wrote: HI Atem, I tried to download the first file. 
It is taking me forever.  With the speed I have, I doubt it would be successful.  Can you just provide some small reproducible example data and what your expected output would be?
Arun On Thursday, March 27,
2014 9:50 AM, "zilefacelvis at yahoo.com"  wrote: Oh! Hope you had a safe trip.  No problem AK. Please try and see what you can do. I will be waiting.  Have a great time in a beautiful country. Atem. ------ Original Message ------ From : arun
>To : Zilefac Elvis;
>Sent : 27-03-2014 04:07
>Subject : Re: Please help
> 
>HI Zilefac, I was on flight.  Right now, I am in India.  At my place, the speed is not so great to download large files.  I will try later if it works.
Arun On Thursday, March 27, 2014 12:54
AM, Zilefac Elvis  wrote: Hi AK,
Please I need your help again.
I have done some
simulations in R and would like to order my data to usable format.
The data is to large so I have attached via Dropbox.
When you load Calibration.RData to the workspace, you will find the site codes
(column 1) in "Prairies.Sites".
My initial dataset was in the form of a dataframe with with columns denoting stations. So I had three dataframes each for precipitation, Tmin, and Tmax. Individually, you reshaped the dataframes to three column vectors (see file called PrecipTminTmax) using this code: library(reshape2)
dat1 <- read.table("predictand.csv",header=TRUE,stringsAsFactors=TRUE,sep="\t") # Predictand.csv had 123 #columns with the columns 1,2,3 as date.
dat1<-precipitation
dat2M <-
melt(dat1,id.var=c("year","month","day"))
dat2M1 <- dat2M[with(dat2M,order(year,month,day,variable)),]
dim(dat2M1)
#[1] 1972320       5
row.names(dat2M1) <-
1:nrow(dat2M1)
PrecipTminTmax<-cbind(precipitation,Tmin,Tmax) The problem to be solved
Attached is a large file
(SimCalibration.zip) containing my simulations (001 to 100). Please import files starting with "Sim1971-2000_Daily_" only. The rest is not important. My analysis is for the period 1971-2000. Any data before or after this period should be ignored.
My simulation was done in R using Fortran encoding to read data values. All files are ".dat". In each file, the columns are as follows :
Year, Month, Day, Site, Precip, Tmin, Tmax In another project involving rainfall only, I read such files into R using this code:
rain.data <- scan("gaugvals.all",what=character(),sep="\n",n=257212)
rain.data <- data.frame(Year=as.numeric(substr(rain.data,1,4)),                       Month=as.numeric(substr(rain.data,5,6)),
                      Day=as.numeric(substr(rain.data,7,8)),                        Site=substr(rain.data,10,12),                        Rain=as.numeric(substr(rain.data,13,18))) Q1) So, I would like to read all files beginning with
"Sim1971-2000_Daily_".
2) Split each file by variable name (Precip, Tmin, Tmax) and then arrange each variable in the form of a dataframe. For example, I will take precip from site GGG1 and have a data frame with colnames such as Year,Month,Day, sim1,sim2,...,sim100. Repeat this for all 120 sites. So that for Precip, you will have 120 files
corresponding to
the site
codes. Each file
has nrows with Year,Month,Day, sim1...sim100 columns. 3) Please repeat the above for Tmin and Tmax so that in the end I will have three folders (Precip, Tmin and Tmax). Each folder has 120 files with each file being a dataframe containing date and 100 columns).  When you successfullly go through this "difficult" section,I will access each folder, read each file and apply a function to it one at a time. Thanks AK, this is part of my Msc thesis project. Your help would be fully acknowledged. You have helped me a lot
towards the success of this project. Atem.