[R] Re ad HTML table

Duncan Temple Lang duncan at wald.ucdavis.edu
Tue Nov 20 06:55:13 CET 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



theta wrote:
> 
> f.jamitzky wrote:
>> You can use htmlTreeParse and xpathApply from the XML library.
>> something like:
>>
>> xpathApply( htmlTreeParse("http://blabla", useInt=T), "//td", function(x)
>> xmlValue(x))
>>
>> should do it.
>>
> 
> Thank you, any further ideas how to transform the result into a matrix,
> something that R easily could search and find values, i want to use the
> imported data in various calculations (Rmetrics) and hope to automate the
> process somewhat.
> 
> Another thing, the htmlTreeParse takes a while to complete, for a 15 row
> table it takes about 10-15 seconds, considering i am planning to use this
> method on multiple (15-20) tables with up to 1000 rows it might not be the
> ideal solution?

I doubt the parsing is taking very long at all.
On a Linux box running virtually on my Mac, I can parse a 4566 line
HTML file in .3 seconds.

If you pass a URL rather than a local file, then you have to separate
the download time and the parsing time to figure out where the time
is consumed.

And if you are going to download multiple tables from the same server in
rapid succession, then you might want to use some advanced features of
HTTP such as persistent connections or multiple interleaved requests.
These can all be done via the RCurl package and the results fed to
htmlTreeParse().  There is a paper on the RCurl web site that describes
some of these advanced features.

 D.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHQnbB9p/Jzwa2QP4RAhxXAJ4pQz8IEge5UKZ6uwPnPa8qziR2DACffYt8
VRo1CqTGB925amKBNUcOBsI=
=EHd5
-----END PGP SIGNATURE-----



More information about the R-help mailing list