[R] help getting a research project started on regulations.gov
Drake Gossi
dr@ke@go@@| @end|ng |rom gm@||@com
Tue Feb 19 21:30:20 CET 2019
Hello everyone,
I will be using R to manipulate this data
<https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064>.
Specifically, it's proposed changes to Title IX--over 11,000 publicly
available comments. So, the end goal is for me to tabulate each of these
11,000 comments in a csv file, so I can begin to manipulate and visualize
the data.
But I'm not there yet. I just put in for an API key and, while I have one,
I'm waiting for it to be activated. After that, though, I'm a little lost.
Do I need to scrape the comments from the site? Or does having the API
render that unnecessary? There is this interface
<https://regulationsgov.github.io/developers/console/> that works with the
API, but I don't know if, though it, I can get the data I need. I'm still
trying to figure out what JSON is.
Or, if I have to scrape the comments, can I do that with R? I can't get a
straight answer from the python people. I can't tell if I need to do this
through beautiful soup or through scrapy (or even if I need to do it at
all, as I said...). The trouble with the comments is, they are each on
their own URL, so--and again this is assuming that I will have to scrape
them--I don't know how to code in order to grab all of the comments from
all of the URLs.
I also am trying to figure out how to isolate the essence of the comments
in the html. From the python people, I've heard the following:
scrapy fetch 'url'
will download the raw page you are interested in. And you can look at
the raw source code. Important to appreciate that what you see in the
browser is often processed in your browser before you see it.
Of course, a scraper can do the same processing, but it's complicated.
So, start by looking at the raw source code. Maybe you can grab what you
need with simple parsing like Beautiful Soup does. Maybe you need to do
more. Scrapy is your friend.
Beautiful soup is your friend here. It can analyze the data within
the html tags on your scraped page. But often javascript is used on
'modern' web pages so the page is actually not just html, but
javascript that changes the html. For this you need another tool -- i
think one is called scrapy. Others here probably have experience with
that.
I think part of my problem relates to that yellow part. I was saying things
like
I think what I might be looking for is a div class = GIY1LSJIXD, since
that's where the hierarchy seems to taper off in the html for the comment
I'm looking to scrape.
What I'm trying to do here is, locate the comment in the html so I can tell
the request function to extract it.
Any help anyone could offer here would be much appreciated. I'm very lost.
Drake
[[alternative HTML version deleted]]
More information about the R-help
mailing list