[R] how to loop or lapply over a "XMLNodeSet" object with condition (if else)

Omar André Gonzáles Díaz oma.gonzales at gmail.com
Sun May 24 01:08:52 CEST 2015


Hi, R-Help members,

I'm doing some webscraping. This time i need the image (url) of the
products of an ecommerce.
I can get the nodes where the urls are, but when trying to extract the URL,
i need to take 1
additional step:

"src" vs "data-original": in the source code, some urls are in the "src"
attribute, while others in the "data-original" attribute.

How to make a loop of an apply function to: if node element contains
"data-original" do:

...     %>%
        html_attr("data-original")

else do:

...     %>%
         html_attr("src")


The result should be a vector with the urls.


My code:

1.- I can get the nodes for the images:


##########################################################
#This result in a "XMLNodeSet" object

library(rvest)

PCs <- html("http://www.linio.cl/computacion/pc-escritorio/") %>%
        html_nodes(".product-item-img") %>%
        html_nodes("img")

###########################################################


#for the attr "data-original"

PCs2 <- html("http://www.linio.cl/computacion/pc-escritorio/") %>%
        html_nodes(".product-item-img") %>%
        html_nodes("img") %>%
        html_attr("data-original")

Gives the urls for the attr "data-original", and NAs where there isn't this
attr.


#for the attr "src"


PCs3 <- html("http://www.linio.cl/computacion/pc-escritorio/") %>%
        html_nodes(".product-item-img") %>%
        html_nodes("img") %>%
        html_attr("src")


Gives the content for the "src" attr. How ever, in some products the url
needed is in the "data-original" attr, and not here.

#### combination throwing NAs as result #####


PCs4 <- html("http://www.linio.cl/computacion/pc-escritorio/") %>%
        html_nodes(".product-item-img") %>%
        html_nodes("img") %>%
        html_attr("data-original|src")



################################################

I've also tried something like this:

lapply(PCs, function(e) {
        if ("data-original" %in% i) {
                print("ok")
        }
})

but get this:


 Error in match(x, table, nomatch = 0L) :
  'match' requires vector arguments



Thanks.

	[[alternative HTML version deleted]]



More information about the R-help mailing list