[R] reading data from web data sources

Sat Feb 27 13:15:00 CET 2010

Try this.  First we read the raw lines into R using grep to remove any
lines containing a character that is not a number or space.  Then we
look for the year lines and repeat them down V1 using cumsum.  Finally
we omit the year lines.

myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat"
raw.lines <- readLines(myURL)
DF <- read.table(textConnection(raw.lines[!grepl("[^
0-9.]",raw.lines)]), fill = TRUE)
DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
DF <- na.omit(DF)
head(DF)

On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org> wrote:
> Hullo
> I'm trying to read some time series data of meteorological records that are
> available on the web (eg
> http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat). I'd
> like to be able to read in the digital data directly into R. However, I
> cannot work out the right function and set of parameters to use.  It could
> be that the only practical route is to write a parser, possibly in some
> other language,  reformat the files and then read these into R. As far as I
> can tell, the informal grammar of the file is:
>
> <comments terminated by a blank line>
> [<year number on a line on its own>
> <daily readings lines> ]+
>
> and the <daily readings> are of the form:
> <whitespace> <day number> [<whitespace> <reading on day of month>] 12
>
> Readings for days in months where a day does not exist have special values.
> Missing values have a different special value.
>
> And then I've got the problem of iterating over all relevant files to get a
> whole timeseries.
>
> Is there a way to read in this type of file into R? I've read all of the
> examples that I can find, but cannot work out how to do it. I don't think
> that read.table can handle the separate sections of data representing each
> year. read.ftable maybe can be coerced to parse the data, but I cannot see
> how after reading the documentation and experimenting with the parameters.
>
> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>
> Any help/suggestions would be greatly appreciated. I can see that this type
> of issue is likely to grow in importance, and I'd also like to give the data
> owners suggestions on how to reformat their data so that it is easier to
> consume by machines, while being easy to read for humans.
>
> The early records are a serious machine parsing challenge as they are tiff
> images of old notebooks ;-)
>
> tia
>
> Tim
> Tim Coote
> tim at coote.org
> vincit veritas
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>