[R] XML htmlTreeParse fails with no obvious error
Nicolas Delhomme
nicolas.delhomme at plantphys.umu.se
Fri Jun 8 14:34:07 CEST 2012
Hi all,
Sorry for the rather uninformative subject, but the error I get is not very informative either.
When using the XML and RCurl package to retrieve the content of an html page, htmlTreeParse fails, printing out the beginning of the HTML:
Error in htmlTreeParse(getURL(url)) :
File <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
<head>
<title>Deutsches Krebsforschungszentrum</title>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta http-equiv="imagetoolbar" content="no" />
<meta name="MSSmartTagsPreventParsing" content="true" />
<meta name="revisit-after" content="5 days" />
<meta name="language" content="de" />
<meta lang="de" content="" xml:lang="de" name="keywords">
<meta lang="de" xml:lang="de" name="description" content="Das Deutsche Krebsforschungszentrum hat die Aufgabe, die Mechanismen der Krebsentstehung systematisch zu erforschen und Risikofaktoren f√ºr Krebserkrankungen zu erfassen. Aus den Ergebnissen dieser grundlegenden Arbeiten sollen neue Ans√
This code reproduces the error:
library(RCurl)
library(XML)
url <- "www.dkfz.de/en/genetics/pages/projects/bioinformatics/Custom_Chip_Definition_File.html"
htmlTreeParse(getURL(url))
The issue seems to originate in htmlTreeParse as getURL alone works and returns the expected content. I checked that it could not be an encoding issue and as far as I can tell it seems not to be.
Moreover, using htmlParse(paste("http://",url,sep="") works. Note that htmlTreeParse(getURL(paste("http://",url,sep=""))) fails too, the "http://" is important only for htmlParse, so that it identifies it as an URL.
This issue is rather new, and as I've been using the same version of XML and RCurl, I suppose it might have to do with some of the content of the website having been updated, but given the error, I can't quite figure out what is raising it.
Although it works on that simple example, using htmlParse is not really a work around, as I need to use additional arguments in the getURL call (such as userpwd), which I can't provide to htmlParse.
Any hints would be greatly appreciated,
Cheers,
Nico
sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.9-4 RCurl_1.91-1 bitops_1.0-4.1
loaded via a namespace (and not attached):
[1] tools_2.15.0
---------------------------------------------------------------
Nicolas Delhomme
Nathaniel Street Lab
Department of Plant Physiology
Umeå Plant Science Center
Tel: +46 90 786 7989
Email: nicolas.delhomme at plantphys.umu.se
SLU - Umeå universitet
Umeå S-901 87 Sweden
---------------------------------------------------------------
More information about the R-help
mailing list