[R] help getting a research project started on regulations.gov
bgunter@4567 @end|ng |rom gm@||@com
Wed Feb 20 00:16:28 CET 2019
Please search yourself first!
"scrape JSON from web" at the rseek.org site produced what appeared to be
several relevant hits,
especially this CRAN task view:
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Tue, Feb 19, 2019 at 3:07 PM Drake Gossi <drake.gossi using gmail.com> wrote:
> Hello everyone,
> I will be using R to manipulate this data
> Specifically, it's proposed changes to Title IX--over 11,000 publicly
> available comments. So, the end goal is for me to tabulate each of these
> 11,000 comments in a csv file, so I can begin to manipulate and visualize
> the data.
> But I'm not there yet. I just put in for an API key and, while I have one,
> I'm waiting for it to be activated. After that, though, I'm a little lost.
> Do I need to scrape the comments from the site? Or does having the API
> render that unnecessary? There is this interface
> <https://regulationsgov.github.io/developers/console/> that works with the
> API, but I don't know if, though it, I can get the data I need. I'm still
> trying to figure out what JSON is.
> Or, if I have to scrape the comments, can I do that with R? I can't get a
> straight answer from the python people. I can't tell if I need to do this
> through beautiful soup or through scrapy (or even if I need to do it at
> all, as I said...). The trouble with the comments is, they are each on
> their own URL, so--and again this is assuming that I will have to scrape
> them--I don't know how to code in order to grab all of the comments from
> all of the URLs.
> I also am trying to figure out how to isolate the essence of the comments
> in the html. From the python people, I've heard the following:
> scrapy fetch 'url'
> will download the raw page you are interested in. And you can look at
> the raw source code. Important to appreciate that what you see in the
> browser is often processed in your browser before you see it.
> Of course, a scraper can do the same processing, but it's complicated.
> So, start by looking at the raw source code. Maybe you can grab what you
> need with simple parsing like Beautiful Soup does. Maybe you need to do
> more. Scrapy is your friend.
> Beautiful soup is your friend here. It can analyze the data within
> 'modern' web pages so the page is actually not just html, but
> think one is called scrapy. Others here probably have experience with
> I think part of my problem relates to that yellow part. I was saying things
> I think what I might be looking for is a div class = GIY1LSJIXD, since
> that's where the hierarchy seems to taper off in the html for the comment
> I'm looking to scrape.
> What I'm trying to do here is, locate the comment in the html so I can tell
> the request function to extract it.
> Any help anyone could offer here would be much appreciated. I'm very lost.
> [[alternative HTML version deleted]]
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
More information about the R-help