[R] reading data from web data sources
David Winsemius
dwinsemius at comcast.net
Sun Feb 28 02:21:15 CET 2010
On Feb 27, 2010, at 6:17 PM, Phil Spector wrote:
> Tim -
> I don't understand what you mean about interleaving rows. I'm
> guessing
> that you want a single large data frame with all the data, and not a
> list with each year separately. If that's the case:
>
> x = read.table('http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat'
> ,
> header=FALSE,fill=TRUE,skip=13)
> cts = apply(x,1,function(x)sum(is.na(x)))
> wh = which(cts == 12)
> start = wh+1
> end = c(wh[-1] - 1,nrow(x))
> ans = mapply(function(i,j)x[i:j,],start,end,SIMPLIFY=FALSE)
> names(ans) = x[wh,1]
> alldat = do.call(rbind,ans)
> alldat$year = rep(names(ans),sapply(ans,nrow))
> names(alldat) = c('day',month.name,'year')
>
> On the other hand, if you want a long data frame with month, day,
> year and value:
>
> longdat = reshape(alldat,idvar=c('day','year'),
>
> varying=list(month.name),direction='long',times=month.name)
> names(longdat)[c(3,4)] = c('Month','value')
>
> Next , if you want to create a Date variable:
>
> longdat = transform(longdat,date=as.Date(paste(Month,day,year),'%B
> %d %Y'))
> longdat = na.omit(longdat)
> longdat = longdat[order(longdat$date),]
>
> and finally:
>
> zoodat = zoo(longdat$value,longdat$date)
>
> which should be suitable for time series analysis.
OK, I think I get it:
(From Gabor's DF)
> dta <- data.matrix(DF[, -c(1,14)])
> dtafrm <-data.frame(rdta=dta[!is.na(dta)],
d.o.m= DF[row(dta)[!is.na(dta)], 1],
month= col(dta)[!is.na(dta)],
year=DF[row(dta)[!is.na(dta)], 14])
> library(zoo)
> zoodat2 <- with(dtafrm, zoo(rdta, as.Date(paste(month, d.o.m,
year), "%m %d %Y")))
> str(zoodat2)
‘zoo’ series from 1910-01-01 to 1919-12-31
Data: num [1:3652] 6.4 6.5 6.3 6.7 6.7 6.8 7 7.1 7.1 7.2 ...
Index: Class 'Date' num [1:3652] -21915 -21914 -21913 -21912
-21911 ...
>
> Hope this helps.
> - Phil
>
> On Sat, 27 Feb 2010, Tim Coote wrote:
>
>> Thanks, Gabor. My take away from this and Phil's post is that I'm
>> going to have to construct some code to do the parsing, rather than
>> use a standard function. I'm afraid that neither approach works, yet:
>>
>> Gabor's gets has an off-by-one error (days start on the 2nd, not
>> the first), and the years get messed up around the 29th day. I
>> think that na.omit (DF) line is throwing out the baby with the
>> bathwater. It's interesting that this approach is based on
>> read.table, I'd assumed that I'd need read.ftable, which I couldn't
>> understand the documentation for. What is it that's removing the
>> -999 and -888 values in this code -they seem to be gone, but I
>> cannot see why.
>>
>> Phil's reads in the data, but interleaves rows with just a year and
>> all other values as NA.
>>
>> Tim
>> On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:
>>
>>> Mark Leeds pointed out to me that the code wrapped around in the
>>> post
>>> so it may not be obvious that the regular expression in the grep is
>>> (i.e. it contains a space):
>>> "[^ 0-9.]"
>>> On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
>>> <ggrothendieck at gmail.com> wrote:
>>>> Try this. First we read the raw lines into R using grep to
>>>> remove any
>>>> lines containing a character that is not a number or space. Then
>>>> we
>>>> look for the year lines and repeat them down V1 using cumsum.
>>>> Finally
>>>> we omit the year lines.
>>>> myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat
>>>> "
>>>> raw.lines <- readLines(myURL)
>>>> DF <- read.table(textConnection(raw.lines[!grepl("[^
>>>> 0-9.]",raw.lines)]), fill = TRUE)
>>>> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
>>>> DF <- na.omit(DF)
>>>> head(DF)
>>>> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org
>>>> > wrote:
>>>>> Hullo
>>>>> I'm trying to read some time series data of meteorological
>>>>> records that are
>>>>> available on the web (eg
>>>>> http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat)
>>>>> . I'd
>>>>> like to be able to read in the digital data directly into R.
>>>>> However, I
>>>>> cannot work out the right function and set of parameters to
>>>>> use. It could
>>>>> be that the only practical route is to write a parser, possibly
>>>>> in some
>>>>> other language, reformat the files and then read these into R.
>>>>> As far as I
>>>>> can tell, the informal grammar of the file is:
>>>>> <comments terminated by a blank line>
>>>>> [<year number on a line on its own>
>>>>> <daily readings lines> ]+
>>>>> and the <daily readings> are of the form:
>>>>> <whitespace> <day number> [<whitespace> <reading on day of
>>>>> month>] 12
>>>>> Readings for days in months where a day does not exist have
>>>>> special values.
>>>>> Missing values have a different special value.
>>>>> And then I've got the problem of iterating over all relevant
>>>>> files to get a
>>>>> whole timeseries.
>>>>> Is there a way to read in this type of file into R? I've read
>>>>> all of the
>>>>> examples that I can find, but cannot work out how to do it. I
>>>>> don't think
>>>>> that read.table can handle the separate sections of data
>>>>> representing each
>>>>> year. read.ftable maybe can be coerced to parse the data, but I
>>>>> cannot see
>>>>> how after reading the documentation and experimenting with the
>>>>> parameters.
>>>>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>>>> Any help/suggestions would be greatly appreciated. I can see
>>>>> that this type
>>>>> of issue is likely to grow in importance, and I'd also like to
>>>>> give the data
>>>>> owners suggestions on how to reformat their data so that it is
>>>>> easier to
>>>>> consume by machines, while being easy to read for humans.
>>>>> The early records are a serious machine parsing challenge as
>>>>> they are tiff
>>>>> images of old notebooks ;-)
>>>>> tia
>>>>> Tim
>>>>> Tim Coote
>>>>> tim at coote.org
>>>>> vincit veritas
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> Tim Coote
>> tim at coote.org
>> vincit veritas
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list