[R] reading data from web data sources

Sun Feb 28 02:21:15 CET 2010

On Feb 27, 2010, at 6:17 PM, Phil Spector wrote:

> Tim -
>   I don't understand what you mean about interleaving rows.  I'm  
> guessing
> that you want a single large data frame with all the data, and not a  
> list with each year separately.  If that's the case:
>
> x = read.table('http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat' 
> ,
>                header=FALSE,fill=TRUE,skip=13)
> cts = apply(x,1,function(x)sum(is.na(x)))
> wh = which(cts == 12)
> start = wh+1
> end = c(wh[-1] - 1,nrow(x))
> ans = mapply(function(i,j)x[i:j,],start,end,SIMPLIFY=FALSE)
> names(ans) = x[wh,1]
> alldat = do.call(rbind,ans)
> alldat$year = rep(names(ans),sapply(ans,nrow))
> names(alldat) = c('day',month.name,'year')
>
> On the other hand, if you want a long data frame with month, day,  
> year and value:
>
> longdat = reshape(alldat,idvar=c('day','year'),
>                   
> varying=list(month.name),direction='long',times=month.name)
> names(longdat)[c(3,4)] = c('Month','value')
>
> Next , if you want to create a Date variable:
>
> longdat = transform(longdat,date=as.Date(paste(Month,day,year),'%B  
> %d %Y'))
> longdat = na.omit(longdat)
> longdat = longdat[order(longdat$date),]
>
> and finally:
>
> zoodat = zoo(longdat$value,longdat$date)
>
> which should be suitable for time series analysis.

OK, I think I get it:

(From Gabor's DF)

 > dta <- data.matrix(DF[, -c(1,14)])
 > dtafrm <-data.frame(rdta=dta[!is.na(dta)],
                       d.o.m= DF[row(dta)[!is.na(dta)], 1],
                       month= col(dta)[!is.na(dta)],
                       year=DF[row(dta)[!is.na(dta)], 14])

 > library(zoo)

 > zoodat2 <- with(dtafrm, zoo(rdta, as.Date(paste(month, d.o.m,  
year), "%m %d %Y")))
 > str(zoodat2)
‘zoo’ series from 1910-01-01 to 1919-12-31
   Data: num [1:3652] 6.4 6.5 6.3 6.7 6.7 6.8 7 7.1 7.1 7.2 ...
   Index: Class 'Date'  num [1:3652] -21915 -21914 -21913 -21912  
-21911 ...

>
> Hope this helps.
>                                                    - Phil
>
> On Sat, 27 Feb 2010, Tim Coote wrote:
>
>> Thanks, Gabor. My take away from this and Phil's post is that I'm  
>> going to have to construct some code to do the parsing, rather than  
>> use a standard function. I'm afraid that neither approach works, yet:
>>
>> Gabor's gets has an off-by-one error (days start on the 2nd, not  
>> the first), and the years get messed up around the 29th day.  I  
>> think that na.omit (DF) line is throwing out the baby with the  
>> bathwater.  It's interesting that this approach is based on  
>> read.table, I'd assumed that I'd need read.ftable, which I couldn't  
>> understand the documentation for.  What is it that's removing the  
>> -999 and -888 values in this code -they seem to be gone, but I  
>> cannot see why.
>>
>> Phil's reads in the data, but interleaves rows with just a year and  
>> all other values as NA.
>>
>> Tim
>> On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:
>>
>>> Mark Leeds pointed out to me that the code wrapped around in the  
>>> post
>>> so it may not be obvious that the regular expression in the grep is
>>> (i.e. it contains a space):
>>> "[^ 0-9.]"
>>> On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
>>> <ggrothendieck at gmail.com> wrote:
>>>> Try this.  First we read the raw lines into R using grep to  
>>>> remove any
>>>> lines containing a character that is not a number or space.  Then  
>>>> we
>>>> look for the year lines and repeat them down V1 using cumsum.   
>>>> Finally
>>>> we omit the year lines.
>>>> myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat 
>>>> "
>>>> raw.lines <- readLines(myURL)
>>>> DF <- read.table(textConnection(raw.lines[!grepl("[^
>>>> 0-9.]",raw.lines)]), fill = TRUE)
>>>> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
>>>> DF <- na.omit(DF)
>>>> head(DF)
>>>> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org 
>>>> > wrote:
>>>>> Hullo
>>>>> I'm trying to read some time series data of meteorological  
>>>>> records that are
>>>>> available on the web (eg
>>>>> http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat) 
>>>>> . I'd
>>>>> like to be able to read in the digital data directly into R.  
>>>>> However, I
>>>>> cannot work out the right function and set of parameters to  
>>>>> use.  It could
>>>>> be that the only practical route is to write a parser, possibly  
>>>>> in some
>>>>> other language,  reformat the files and then read these into R.  
>>>>> As far as I
>>>>> can tell, the informal grammar of the file is:
>>>>> <comments terminated by a blank line>
>>>>> [<year number on a line on its own>
>>>>> <daily readings lines> ]+
>>>>> and the <daily readings> are of the form:
>>>>> <whitespace> <day number> [<whitespace> <reading on day of  
>>>>> month>] 12
>>>>> Readings for days in months where a day does not exist have  
>>>>> special values.
>>>>> Missing values have a different special value.
>>>>> And then I've got the problem of iterating over all relevant  
>>>>> files to get a
>>>>> whole timeseries.
>>>>> Is there a way to read in this type of file into R? I've read  
>>>>> all of the
>>>>> examples that I can find, but cannot work out how to do it. I  
>>>>> don't think
>>>>> that read.table can handle the separate sections of data  
>>>>> representing each
>>>>> year. read.ftable maybe can be coerced to parse the data, but I  
>>>>> cannot see
>>>>> how after reading the documentation and experimenting with the  
>>>>> parameters.
>>>>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>>>> Any help/suggestions would be greatly appreciated. I can see  
>>>>> that this type
>>>>> of issue is likely to grow in importance, and I'd also like to  
>>>>> give the data
>>>>> owners suggestions on how to reformat their data so that it is  
>>>>> easier to
>>>>> consume by machines, while being easy to read for humans.
>>>>> The early records are a serious machine parsing challenge as  
>>>>> they are tiff
>>>>> images of old notebooks ;-)
>>>>> tia
>>>>> Tim
>>>>> Tim Coote
>>>>> tim at coote.org
>>>>> vincit veritas
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> Tim Coote
>> tim at coote.org
>> vincit veritas
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT