[R] Externalptr class to character class from (web) scrape
Duncan Murdoch
murdoch.duncan at gmail.com
Fri Jul 26 19:26:11 CEST 2013
On 26/07/2013 12:43 PM, Nick McClure wrote:
> I'm hitting a wall. When I use the 'scrape' function from the package
> 'scrapeR' to get the pagesource from a web page, I do the following:
> (as an example)
>
> website.doc = parse("http://www.google.com")
>
> When I look at it, it seems fine:
>
> website.doc[[1]]
>
> This seems to have the information I need. Then when I try to get it
> into a character vector,
>
> character.website = as.character(website.doc[[1]])
>
> I get the error:
>
> Error in as.vector(x, "character") :
> cannot coerce type 'externalptr' to vector of type 'character'
>
> I'm trying very very hard to wrap my head around how to get this
> external pointer to a character, but after reading many help files, I
> cannot understand how to do this. Any ideas?
You should use str() in cases like this. When I look at
str(website.doc[[1]]) (after producing website.doc with scrape(), not
parse()), I see
> str(website.doc[[1]])
Classes 'HTMLInternalDocument', 'HTMLInternalDocument',
'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
- attr(*, "headers")= Named chr [1:2] "<HTML><HEAD><meta
http-equiv=\"content-type\"
content=\"text/html;charset=utf-8\">\n<TITLE>302
Moved</TITLE></HEAD><BODY>\n<H1>"| __truncated__ "</BODY></HTML>"
..- attr(*, "names")= chr [1:2] "<HTML><HEAD><meta
http-equiv=\"content-type\"
content=\"text/html;charset=utf-8\">\n<TITLE>302
Moved</TITLE></HEAD><BODY>\n<H1>"| __truncated__ "</BODY></HTML>"
So it is an external pointer with a number of classes. One or more of
those will have a print method. methods(print) will list all the print
methods, and I see there's a (hidden) print.XMLInternalDocument method
somewhere. Then
> getAnywhere("print.XMLInternalDocument")
A single object matching ‘print.XMLInternalDocument’ was found
It was found in the following places
registered S3 method for print from namespace XML
namespace:XML
with value
function (x, ...)
{
cat(as(x, "character"), "\n")
}
<environment: namespace:XML>
shows that the as() generic should work, even though as.character()
doesn't, and indeed as(website.doc[[1]], "character") does display
something.
Duncan Murdoch
More information about the R-help
mailing list