[R] Removing Embedded Null characters from text/html

David Young dyoung at telefonica.net
Wed Oct 14 19:49:34 CEST 2009


Hi,

I'm trying to download some data from the web and am running into
problems with 'embedded null' characters.  These seem to indicate to R
that it should stop processing the page so I'd like to remove them.
I've been looking around and can't seem to identify exactly what the
character is and consequently how to remove it.

# THE CODE WORKS ON THIS PAGE
library(RCurl)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)

# BUT DOES NOT WORK HERE DUE TO EMBEDDED NULL CHARACTERS
theurl <- "http://screen.yahoo.com/b?pr=1/&s=nm&db=stocks&vw=0&b=21"
webpage <- getURL(theurl)

Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
  Failed writing body (1371 != 1461)
In addition: Warning messages:
1: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
  truncating string with embedded nul: 'ttp://finance.  
  ## I DELETED SOME HERE FOR BREVITY##  al>\nData and  [... truncated]
2: In curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
  only read 1371 of the 1461 input bytes/characters

# THIS CODE COPIES THE PROBLEMATIC PAGE TO MY COMPUTER
destfile<-"file:///C:/projects/stock data/data/test.htm"
download.file ( theurl , destfile , quiet = TRUE )

# WHICH LEAVES ME WITH JUST IDENTIFYING WHAT CHARACTER IS CAUSING THE
# PROBLEM AND THEN GETTING RID OF IT.

I'd appreciate any advice.


-- 
Best regards,

David Young
Marketing and Statistical Consultant
Madrid, Spain
+34 913 540 381
http://www.linkedin.com/in/europedavidyoung



More information about the R-help mailing list