[R] reading data from web data sources

Sat Feb 27 18:33:41 CET 2010

Mark Leeds pointed out to me that the code wrapped around in the post
so it may not be obvious that the regular expression in the grep is
(i.e. it contains a space):
"[^ 0-9.]"

On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Try this.  First we read the raw lines into R using grep to remove any
> lines containing a character that is not a number or space.  Then we
> look for the year lines and repeat them down V1 using cumsum.  Finally
> we omit the year lines.
>
> myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat"
> raw.lines <- readLines(myURL)
> DF <- read.table(textConnection(raw.lines[!grepl("[^
> 0-9.]",raw.lines)]), fill = TRUE)
> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
> DF <- na.omit(DF)
> head(DF)
>
>
> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org> wrote:
>> Hullo
>> I'm trying to read some time series data of meteorological records that are
>> available on the web (eg
>> http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat). I'd
>> like to be able to read in the digital data directly into R. However, I
>> cannot work out the right function and set of parameters to use.  It could
>> be that the only practical route is to write a parser, possibly in some
>> other language,  reformat the files and then read these into R. As far as I
>> can tell, the informal grammar of the file is:
>>
>> <comments terminated by a blank line>
>> [<year number on a line on its own>
>> <daily readings lines> ]+
>>
>> and the <daily readings> are of the form:
>> <whitespace> <day number> [<whitespace> <reading on day of month>] 12
>>
>> Readings for days in months where a day does not exist have special values.
>> Missing values have a different special value.
>>
>> And then I've got the problem of iterating over all relevant files to get a
>> whole timeseries.
>>
>> Is there a way to read in this type of file into R? I've read all of the
>> examples that I can find, but cannot work out how to do it. I don't think
>> that read.table can handle the separate sections of data representing each
>> year. read.ftable maybe can be coerced to parse the data, but I cannot see
>> how after reading the documentation and experimenting with the parameters.
>>
>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>
>> Any help/suggestions would be greatly appreciated. I can see that this type
>> of issue is likely to grow in importance, and I'd also like to give the data
>> owners suggestions on how to reformat their data so that it is easier to
>> consume by machines, while being easy to read for humans.
>>
>> The early records are a serious machine parsing challenge as they are tiff
>> images of old notebooks ;-)
>>
>> tia
>>
>> Tim
>> Tim Coote
>> tim at coote.org
>> vincit veritas
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>