[R] how to read a web page and extract an html table?
James Howison
jhowison at syr.edu
Tue May 6 18:01:42 CEST 2003
> On Tue, 6 May 2003 07:31:29 -0700 (PDT), you wrote in message
> <20030506143129.33487.qmail at web12105.mail.yahoo.com>:
>
>> I want to extract the table from the html file.
>> Is there a function html2R, the opposite of R2html?
>> How should I do this?
>
> I don't think there is anything that does that, but the XML package
> (from CRAN) contains a function called htmlTreeParse should get you
> partway there.
>
> Duncan Murdoch
Or if you know (or can learn) perl here is a script that will do it
(and output it as a csv). You need to edit $url and @tableheaders and
to install WWW::Mechanize and HTML::TableExtract from cpan.
http://cpan.org
#!/usr/bin/perl
use HTML::TableExtract;
use WWW::Mechanize;
my $url = "http://shangorilla.syr.edu/testR.html";
my @tableheaders = qw (Firstcol Secondcol Thirdcol);
my $agent = WWW::Mechanize->new();
$agent->get($url);
# Output headers
print join(',', at tableheaders), "\n";
# Find table in html page
$te = new HTML::TableExtract( headers => \@tableheaders );
$te->parse( $agent->content() ); #parse contents
# Examine all matching tables (there is only be one?)
foreach $ts ($te->table_states) {
foreach $row ($ts->rows) {
print join(',', @$row), "\n";
}
}
(copy into editor and save as testRtable.pl then chmod u+x
testRtable.pl)
run as ./testRtable.pl to check content
then
./testRtable.pl > csvforReadingIntoR.txt
Then in R
> data <- read.csv("csvforReadingIntoR.txt")
I think that should work for you. (or just send me the url and I'll run
it and mail you back the csv - if this is a one off.)
Speaking of perl - Does anyone know if there is a standard way to use
perl scripts from within R - I guess one can call them as one does from
the commandline. Is it possible to program R modules in perl (or would
the cpan dependancies kill us?)
If this is a one off (ie not for scripting) then I think you can
directly select a table in IE and paste it into Excel - then save as
csv to read into R.
Cheers
James
On Tuesday, May 6, 2003, at 11:17 AM, Duncan Murdoch wrote:
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
More information about the R-help
mailing list