[R] Web scraping - Having trouble figuring out how to approach this problem

henrique monte henrique.monte66 at gmail.com
Wed Feb 22 23:52:55 CET 2017


Sometimes I need to get some data from the web organizing it into a
dataframe and waste a lot of time doing it manually. I've been trying to
figure out how to optimize this proccess, and I've tried with some R
scraping approaches, but couldn't get to do it right and I thought there
could be an easier way to do this, can anyone help me out with this?

Fictional example:

Here's a webpage with countries listed by continents:
https://simple.wikipedia.org/wiki/List_of_countries_by_continents

Each country name is also a link that leads to another webpage (specific of
each country, e.g. https://simple.wikipedia.org/wiki/Angola).

I would like as a final result to get a data frame with number of
observations (rows) = number of countries listed and 4 variables (colums)
as ID=Country Name, Continent=Continent it belongs to, Language=Official
language (from the specific webpage of the Countries) and Population = most
recent population count (from the specific webpage of the Countries).

...

The main issue I'm trying to figure out is handling several webpages, like,
would it be possible to scrape from the first link of the problem the
countries as a list with the links of the countries webpages and then
create and run a function to run a scraping command in each of those links
from the list to get the specific data I'm looking for?

	[[alternative HTML version deleted]]



More information about the R-help mailing list