[R] help getting a research project started on regulations.gov

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Wed Feb 20 00:16:28 CET 2019


Please search yourself first!

"scrape JSON from web" at the rseek.org site produced what appeared to be
several relevant hits,
especially this CRAN task view:
https://cran.r-project.org/web/views/WebTechnologies.html


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Feb 19, 2019 at 3:07 PM Drake Gossi <drake.gossi using gmail.com> wrote:

> Hello everyone,
>
> I will be using R to manipulate this data
> <
> https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064
> >.
> Specifically, it's proposed changes to Title IX--over 11,000 publicly
> available comments. So, the end goal is for me to tabulate each of these
> 11,000 comments in a csv file, so I can begin to manipulate and visualize
> the data.
>
> But I'm not there yet. I just put in for an API key and, while I have one,
> I'm waiting for it to be activated. After that, though, I'm a little lost.
> Do I need to scrape the comments from the site? Or does having the API
> render that unnecessary? There is this interface
> <https://regulationsgov.github.io/developers/console/> that works with the
> API, but I don't know if, though it, I can get the data I need. I'm still
> trying to figure out what JSON is.
>
> Or, if I have to scrape the comments, can I do that with R? I can't get a
> straight answer from the python people. I can't tell if I need to do this
> through beautiful soup or through scrapy (or even if I need to do it at
> all, as I said...). The trouble with the comments is, they are each on
> their own URL, so--and again this is assuming that I will have to scrape
> them--I don't know how to code in order to grab all of the comments from
> all of the URLs.
>
> I also am trying to figure out how to isolate the essence of the comments
> in the html. From the python people, I've heard the following:
>
> scrapy fetch 'url'
> will download the raw page you are interested in. And you can look at
> the raw source code. Important to appreciate that what you see in the
> browser is often processed in your browser before you see it.
>
> Of course, a scraper can do the same processing, but it's complicated.
> So, start by looking at the raw source code. Maybe you can grab what you
> need with simple parsing like Beautiful Soup does. Maybe you need to do
> more. Scrapy is your friend.
>
>  Beautiful soup is your friend here.  It can analyze the data within
> the html tags on your scraped page. But often javascript is used on
> 'modern' web pages so the page is actually not just html, but
> javascript that changes the html.  For this you need another tool -- i
> think one is called scrapy.  Others here probably have experience with
> that.
>
> I think part of my problem relates to that yellow part. I was saying things
> like
>
> I think what I might be looking for is a div  class = GIY1LSJIXD, since
> that's where the hierarchy seems to taper off in the html for the comment
> I'm looking to scrape.
>
>
> What I'm trying to do here is, locate the comment in the html so I can tell
> the request function to extract it.
>
> Any help anyone could offer here would be much appreciated. I'm very lost.
>
> Drake
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list