[R] Splitting strings in data files R

David Winsemius dwinsemius at comcast.net
Thu Jan 21 03:33:55 CET 2016


> On Jan 20, 2016, at 3:33 PM, Zilefac Elvis <zilefacelvis at yahoo.com> wrote:
> 
> I did not want to include attachments but as they are requested I am attaching the original files.
> File1=dx701S001
> 
> File2= dt402DAF0

These are fixed width files. The "upper-left corner of the file looks like this in a text editor:

402DAF0,LEADER AIRPORT           ,SK,station joined    ,Daily adjusted precipitation, mm, Updated to December 2014
1923  1 -9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-
1923  2 -9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-
1923  3 -9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-
1923  4     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00  
1923  5     0.00     0.00     0.00     0.00    10.83     0.00     0.00     0.00     0.00     0.00     0.00     0.00  
1923  6     6.87     0.00     0.00     0.00     0.00     0.00     5.52     0.00     0.00     0.00     0.00     0.00  
1923  7     0.00     0.00     0.00     0.00     2.09     0.00     7.91     1.57     3.65     0.00     0.00     0.00  
1923  8     0.00     0.00     0.00     0.00     0.00     0.00     0.00     8.12     0.00     0.00     0.00     0.00  
1923  9     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00  
1923 10     0.00     0.00     0.00     0.00     0.00     0.00     0.00     4.17     0.00     0.00     0.00     0.00  
1923 11     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00  
1923 12     0.11T    0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00  
1924  1     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     6.27     0.00  
1924  2     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00     0.00  

So the convention of the file authors is to define location boundaries for numeric values (eight characters wide, after the first two) and then the interleaved columns (one character wide) are some sort of annotation. You can see two such annotation types here but I can see several others (T,A,C at the very least) in a text editor. One is clearly "M" for missing and the other that can be seen here is "T" of unknown import. Clearly -9999.99 is a missing value.

You need to use the read.fwf function in package foreign (shipped with every full copy of R). It's possible that you also want to loop through these files with readLines just to get the first line, but it's clearly not a column header line.

-- 
David.

> 
> I read them into R using:
> 
> temp1 = list.files(pattern="*.txt") # list all text file names in your working directory 
> myfiles = lapply(temp1, read.delim)# 
> 
> The started processing them with:
> 
> res<-lapply(temp1,function(x) {con <- file(x);Lines1<- readLines(con);close(con); 
> Lines2<-Lines1[-1]; 
> Lines3<- str_split(Lines2,"-9999.99M")})
> 
> Thanks
> AT
> 
> 
> 
> 
> 
> 
> On Wednesday, January 20, 2016 4:47 PM, David Winsemius <dwinsemius at comcast.net> wrote:
> 
>> On Jan 20, 2016, at 12:53 PM, Zilefac Elvis via R-help <r-help at r-project.org> wrote:
>> 
>> 
>> 
>> 
>> Please I need help processing  files with strings in R. All the files have two patterns (thus,examine separately):
> 
> You do need help, that much is clear. But the first thing to do is retrace your initial data-entry steps. You have used the wrong read-function. The data in the input file either whitespace-separated (or fixed width format and you apparently thought it was a CSV-file.
> 
> Asking us to no work with this mess is just unreasonable. Post the original input file.
> 
> -- 
> David.
> 
> 
>> Pattern 1 (see file1 below): Delete Lines 1,2 & 4 in file1. Line 3 contains the column names. Then find anything as.character and delete. Please do not delete any values (e.g. delete T in 0.21T). Also find -999.99M,-999.99 and replace with with NA.
>> 
>> File1 output format should be: Year Month Day_1 Day_2 ... Day_31  ## so all months should 31 days. Months with <31 days should have NA where appropraite (e.g. Feb 30=NA, 31=NA)
>> 
>> Pattern 2 (see file2 below): Delete Line 1 in file2.Then find anything as.character and delete. Please do not delete any values (e.g. delete T in 0.21T). Also find -999.99M,-999.99 and replace with withNA. File2 has no column names. Please do not include any. 
>> File2 output format: Year Month Day_1 Day_2 ... Day_31 but no column names
>> 
>> Here is a simple reproducible example for both files/cases: 
>> 
>> 
>> file1=list(df1,df1)df1=list(structure(list(X7011982.....DONNACONA........QC..station.joined......Homogenized.daily.maximum.temperature..........Deg.Celcius...........Updated.to.December.2014 =structure(c(20L,19L,21L,1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L,17L,18L),.Label =c(" 1918  7 -9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M-9999.9M  23.6a  25.9a  25.8a  24.9a  24.9a  29.6a  27.4a  24.5a  28.5a  28.5a  30.1a  25.3a  28.5a  19.6a  24.1a"," 1918  8   23.7a  18.6a  17.6a  19.0a  23.7a  24.7a  18.6a  22.6a  20.1a  21.4a  22.6a  24.9a  24.1a  23.2a  22.0a  17.6a  19.0a  19.0a  23.7a  24.1a  24.9a  27.9a  26.2a  22.6a  24.0a  25.4a  21.4a  24.4a  19.0a  22.6a  23.7a"," 1918  9   22.0a  22.0a  24.0a  19.0a  14.4a  11.2a  17.1a  18.1a  19.0a  12.0a  13.5a   9.6a  11.2a  10.7a  18.1a  18.1a  16.3a  14.4a  14.3a  15.9a  10.1a   9.8a  11.3a  11.4a  13.6a  14.4a   9.3a   9.6a   9.2a   8.4a-9999!
>> .9M"," 1918 10    9.3a   9.5a  11.3a  10.2a   9.9a-9999.9M   4.6a-9999.9M   9.8a  13.6a  17.0a  15.2a  15.1a  15.9a   8.1a   9.3a   8.8a   6.0a   8.7a   9.8a   9.8a  10.7a  11.3a   9.5a  10.7a-9999.9M  10.7a  16.9a  17.1a  10.7a  12.7a"," 1918 11    8.8a-9999.9M   4.0a   3.4a   4.0a   6.6a   4.0a   7.3a   8.1a   7.3a   2.5a   3.4a   7.7a   6.1a   2.2a   4.0a   4.6a   2.5a   2.2a   1.6a   2.2a   3.0a  -3.4a   2.5a   1.6a  -3.4a   2.1a   0.0a   2.6a   0.6a-9999.9M"," 1918 12  -10.1a  -8.3a  -6.3a  -5.5a  -5.1a  -7.2a-9999.9M  -3.4a  -2.2a  -5.5a  -6.0a  -3.4a   0.6a   3.0a   4.7a   0.6a  -5.0a  -6.4a  -5.9a  -2.2a   1.2a   4.0a   5.3a-9999.9M  -2.2a  -5.5a  -7.5a  -9.6a  -7.3a  -6.6a-9999.9M"," 1919  1    2.5a   0.0a  -7.3a  -6.7a  -6.6a  -9.2a  -5.9a  -0.7a  -2.9a  -13.2a  -8.0a  -17.1a  -7.4a  -4.0a  -5.5a   0.6a  -7.1a  -5.5a  -2.2a  -7.6a  -7.0a  -3.4a  -2.2a  -6.7a  -8.0a  -2.9a  -1.5a  -5.9a  -5.5a  -5.8a  -3.4a"," 1919  2 -9999.9M   0.0a  -4.0a  -3.4a  -1.5a  -2.1a  -3!
>> .4a  -7.2a  -2.8a  -5.5a  -6.7a  -5.1a  -2.1a  -2.1a   1.2a  -2.1a  -5
>> .9a  -2.8a  -4.5a  -4.5a  -3.4a   2.1a   0.0a   0.0a   1.2a  -2.1a  -8.3a  -6.6a-9999.9M-9999.9M-9999.9M"," 1919  3 -9999.9M   0.0a   1.6a   1.7a  -5.1a  -6.7a  -5.1a  -3.4a  -2.0a   1.2a   3.4a   1.2a  -8.3a  -7.9a  -3.4a  -2.0a   1.2a   3.4a   6.6a   1.2a   6.6a   1.2a   6.6a   6.6a   3.4a  10.7a   6.6a   6.6a   0.6a   0.6a  -6.8a"," 1919  4   -5.9a  -3.4a   2.1a   3.4a   3.0a   3.0a   8.1a   8.1a   6.0a   2.1a   6.6a   8.5a   6.0a   1.7a   4.7a   2.4a   2.1a   8.5a   1.2a   1.7a   9.6a   8.6a  12.8a   9.5a  -2.8a   9.8a   4.7a  10.7a   6.6a  11.2a-9999.9M"," 1919  5   16.4a   8.5a   9.4a   6.0a  10.7a   9.8a   8.5a  13.6a  14.4a-9999.9M  16.4a  19.0a  23.2a  16.9a  17.0a  19.6a  11.3a   9.4a  12.1a  17.2a  15.2a  17.0a  15.2a  17.5a  10.2a  22.6a  14.5a  22.0a  24.9a  23.8a  19.0a"," 1919  6   17.7a  25.4a  31.2a  25.3a  26.8a  22.0a  15.8a  19.0a  12.7a  19.6a  19.0a  24.5a  25.1a  27.4a  26.8a  19.0a  20.8a  26.8a  27.9a  25.8a  20.1a  17.7a  19.0a  32.4a  30.7a  22.6a !
>> 19.0a  13.6a  17.5a  24.1a-9999.9M"," 1919  7   23.7a  24.4a  27.9a  29.6a  23.7a  21.3a  23.7a  20.1a  23.7a  21.3a  17.0a  17.8a  23.7a  27.4a  18.2a  23.2a  24.5a  26.2a  25.8a  27.9a  29.0a  25.3a  25.1a  23.9a  22.6a  23.9a  20.8a  25.8a  20.1a  23.2a  23.7a"," 1919  8   20.8a  18.2a  20.1a  20.1a  25.1a  20.8a  24.6a  18.5a  17.6a  22.0a  24.0a  23.2a  24.0a  24.0a  20.8a  24.0a  23.7a  23.8a  17.1a  23.8a  24.6a  23.8a  19.6a  24.0a  24.0a  16.9a  18.2a  18.6a  18.6a  23.2a  20.8a"," 1919  9   24.0a  21.3a  24.4a  18.1a  19.0a  19.0a  17.7a  11.4a  10.7a  12.7a  15.2a  15.2a  18.6a  12.7a  15.2a  10.1a  12.0a  12.7a  19.6a  18.5a  28.5a  28.5a  10.7a  14.5a  15.8a  11.3a  11.3a  20.8a  23.2a  11.3a-9999.9M"," 1919 10   11.3a   8.2a   8.2a  16.4a  10.7a  17.5a   7.7a   6.0a  11.3a   7.3a  12.1a   7.7a  10.2a  15.9a  18.2a   9.0a  10.7a   9.8a   8.2a   7.3a   7.7a   8.9a   9.5a  12.1a  10.2a  10.2a   4.0a  10.7a   2.9a   5.3a   3.0a"," 1919 11    8.2a   2.2a   1.2a   !
>> 2.6a   1.7a   2.6a   6.1a   8.2a   7.7a   5.3a   4.7a   8.9a   4.7a  
>> 1.7a  -4.0a   2.2a   7.7a   7.7a   0.6a  -4.5a   2.6a   3.4a   2.5a  -3.4a  -5.9a  -5.1a  -5.5a  -6.4a   8.9a   3.0a-9999.9M"," 1919 12   -4.0a  -9.2a  -10.5a  -5.1a  -4.5a  -6.9a  -4.0a  -4.0a   3.0a   2.2a  -9.2a  -3.4a   5.3a  -6.4a  -6.9a  -20.4a  -20.4a  -17.6a  -10.5a  -13.8a  -8.7a  -3.4a  -2.9a  -4.5a  -5.5a  -5.5a  -2.9a  -0.8a  -10.1a  -6.9a  -5.9a"," Year Mo  Day 01  Day 02  Day 03  Day 04  Day 05  Day 06  Day 07  Day 08  Day 09  Day 10  Day 11  Day 12  Day 13  Day 14  Day 15  Day 16  Day 17  Day 18  Day 19  Day 20  Day 21  Day 22  Day 23  Day 24  Day 25  Day 26  Day 27  Day 28  Day 29  Day 30  Day 31","7011982,   DONNACONA    , QC, station jointe   , Temperature quotidienne maximale homogeneisee, Deg Celcius, Mise a jour jusqu a decembre 2014","Annee Mo Jour 01 Jour 02 Jour 03 Jour 04 Jour 05 Jour 06 Jour 07 Jour 08 Jour 09 Jour 10 Jour 11 Jour 12 Jour 13 Jour 14 Jour 15 Jour 16 Jour 17 Jour 18 Jour 19 Jour 20 Jour 21 Jour 22 Jour 23 Jour 24 Jour 25 Jour 26 Jour !
>> 27 Jour 28 Jour 29 Jour 30 Jour 31"),class ="factor")),.Names ="X7011982.....DONNACONA........QC..station.joined......Homogenized.daily.maximum.temperature..........Deg.Celcius...........Updated.to.December.2014",class ="data.frame",row.names =c(NA,-21L)))
>> 
>> 
>> 
>> 
>> file2=list(df2,df2)df2=list(structure(list(X250M001.MOULD.BAY.................NT.station.joined.....Daily.adjusted.precipitation..mm..Updated.to.December.2014 =structure(1:24,.Label =c("1948  1 -9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M","1948  2 -9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M","1948  3 -9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99!
>> M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M","1948  4 -9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M","1948  5 -9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M-9999.99M   0.00    0.00    0.21T   0.21T   1.69    0.21T   0.21T   0.21T   0.21T   0.00    0.21T   0.00    0.00    0.00    0.00    1.39    0.00    0.21T","1948  6    0.00    0.00    0.30T   3.34T   0.21T   0.00    0.00    7.19T   0.21T   1.04    0.00    4.29    1.69    0.21T   0.00    0.00    0.21T   0.65    0.21T   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.21T   0.00    0.21T-9999.99M","1948  7    0.21T   0.51T   0.00    2.74    0.00    0.00    0.00    0.00    0.00    1.05    0.00    !
>> 0.00    1.57    1.57    2.30    0.74T   0.00    0.30T   0.74    0.53  
>> 0.00    0.21T   0.00    0.00    0.00    0.00    0.00    0.53    3.34    2.30   13.43 ","1948  8    0.30T   2.61    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.53    0.21T   0.65    3.65    3.25    0.21T   3.90    0.21T   0.21T   0.21T   0.30T   0.21T   0.21T   0.21T   0.00    0.21T   0.65    1.95    0.21T   0.21T","1948  9    0.00    0.21T   0.21T   0.21T   0.21T   0.69T   7.54    0.00    0.00    0.00    0.21T   0.21T   0.00    0.21T   0.21T   0.21T   0.00    0.21T   1.04    0.00    0.00    0.00    0.00    7.28    4.68    2.34    1.95    3.90    1.30    0.21T-9999.99M","1948 10    1.04    0.00    0.00    1.69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.65    0.21T   0.21T   0.00    0.21T   0.21T   0.21T   0.00    0.21T   0.21T   0.21T   0.21T   0.21T   0.21T","1948 11    0.00    0.00    0.00    0.21T   0.21T   0.00    0.21T   1.04    0.21T   0.00    0.00    1.69    0.21T   0.21T !
>>  0.00    0.00    0.21T   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 -9999.99M","1948 12    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.21T   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.21T   0.21T   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 ","1949  1    0.00    0.00    0.00    0.00    0.00    0.39    0.65    0.00    0.21T   0.21T   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.21T   0.21T   0.65    0.39    0.21T   0.00    0.00    0.00    0.21T   0.00    0.00 ","1949  2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.21T   0.00    0.00    0.00    0.00    0.00    0.00    0.21T   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.21T   0.21T   0.00    0.00    0.00 -9999.99M-9999.99M-9999.99M","1949  3    0.00    0.00    0.00    0.00    0.00    0.00    0.21T   0.!
>> 21T   0.00    1.69    0.00    1.04    1.69    0.65    0.21T   0.51T  
>> 0.21T   0.21T   0.21T   0.21T   0.21T   0.00    0.00    0.21T   0.00    0.00    0.00    0.00    0.21T   0.21T   0.00 ","1949  4    0.00    0.00    0.39    0.21T   0.00    0.00    0.39    0.21T   0.00    0.00    0.00    0.21T   0.00    0.21T   0.00    0.00    0.00    0.21T   0.21T   0.00    0.00    0.65    0.21T   0.21T   0.00    0.00    0.00    0.00    0.00    0.00 -9999.99M","1949  5    0.00    0.00    0.00    0.00    0.00    0.21T   0.39    0.21T   0.00    0.00    0.21T   0.00    0.00    0.00    0.00    0.00    0.21T   0.39    0.39    0.39    0.00    0.00    0.21T   0.21T   0.21T   0.39    0.21T   0.39    0.39    0.21T   1.04 ","1949  6    0.39    0.21T   0.00    0.21T   0.00    0.21T   0.21T   0.65    0.00    0.00    0.00    0.21T   0.21T   0.00    0.00    0.51T   0.21T   0.00    0.39    0.21T   0.00    0.21T   0.21T   0.39    0.39    0.21T   0.00    0.00    0.00    0.00 -9999.99M","1949  7    0.00    0.00    0.53    0.51T   0.51T   0.21T   0.51T   0.30T   0.00    0.00   !
>> 0.00    0.00    0.00    0.00    0.30T   0.30T   0.00    0.00    0.00    0.00    0.51T   0.51T   6.25   22.63    0.00    0.51T   0.21T   0.30T   0.30T   0.00    0.00 ","1949  8    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.51T   0.51T   0.00    0.30T   0.30T   0.00    0.00    6.56    0.00    0.00    0.00    0.00    1.05    0.21T   0.21T   0.21T   0.00    0.21T   0.21T   0.51T   0.00    0.30T   0.21T   0.21T","1949  9    0.30T   0.30T   0.00    0.39    0.39    0.21T   0.00    0.00    0.21T   0.21T   0.00    0.00    0.00    0.21T   0.00    0.21T   0.00    0.00    0.00    0.00    0.00    0.21T   0.00    0.00    0.00    0.00    0.00    0.21T   0.21T   0.21T-9999.99M","1949 10    0.21T   0.00    0.00    0.00    1.04    0.39    0.65    0.21T   0.00    0.00    0.21T   0.21T   0.21T   0.39    0.21T   0.65    0.65    0.21T   0.65    0.00    0.00    0.21T   0.00    0.21T   0.00    0.21T   0.21T   0.00    0.00    0.00    0.00 ","1949 11    0.00    0.00    0.00    0.00  !
>>  1.04    0.21T   0.00    0.00    0.21T   0.00    0.00    0.00    0.00
>>   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.21T-9999.99M","1949 12    0.21T   0.21T   0.21T   0.00    0.00    0.00    0.00    0.00    0.21T   0.21T   0.00    0.00    0.21T   0.39    0.00    0.00    0.00    0.00    0.21T   0.21T   0.21T   0.21T   0.21T   0.21T   0.00    0.00    0.00    0.00    0.21T   0.00    0.00 "),class ="factor")),.Names ="X250M001.MOULD.BAY.................NT.station.joined.....Daily.adjusted.precipitation..mm..Updated.to.December.2014",class ="data.frame",row.names =c(NA,-24L)))
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA
> <dt402DAF0.txt><dx701S001.txt>

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list