[R] Creating a web site checker using R

Enrico Schumann e@ @end|ng |rom enr|co@chum@nn@net
Fri Aug 9 08:36:02 CEST 2019


>>>>> "Chris" == Chris Evans <chrishold using psyctc.org> writes:

    Chris> I use R a great deal but the huge web crawling power of
    Chris> it isn't an area I've used. I don't want to reinvent a
    Chris> cyberwheel and I suspect someone has done what I want.
    Chris> That is a program that would run once a day (easy for
    Chris> me to set up as a cron task) and would crawl a single
    Chris> root of a web site (mine) and get the file size and a
    Chris> CRC or some similar check value for each page as pulled
    Chris> off the site (and, obviously, I'd want it not to follow
    Chris> off site links). The other key thing would be for it to
    Chris> store the values and URLs and be capable of being run
    Chris> in "create/update database" mode or in "check pages"
    Chris> mode and for the change mode run to Email me a warning
    Chris> if a page changes.  The reason I want this is that two
    Chris> of my sites have recently had content "disappear":
    Chris> neither I nor the ISP can see what's happened and we
    Chris> are lacking the very useful diagnostic of the date when
    Chris> the change happened which might have mapped it some
    Chris> component of WordPress, plugins or themes having
    Chris> updated.

    Chris> I am failing to find anything such and all the services
    Chris> that offer site checking of this sort are prohibitively
    Chris> expensive for me (my sites are zero income and either
    Chris> personal or offering free utilities and information).

    Chris> If anyone has done this, or something similar, I'd love
    Chris> to hear if you were willing to share it.  Failing that,
    Chris> I think I will have to create this but I know it will
    Chris> take me days as this isn't my area of R expertise and
    Chris> as, to be brutally honest, I'm a pretty poor
    Chris> programmer.  If I go that way, I'm sure people may be
    Chris> able to point me to things I may be (legitimately) able
    Chris> to recycle in parts to help construct this.

    Chris> Thanks in advance,

    Chris> Chris

    Chris> -- 
    Chris> Chris Evans <chris using psyctc.org> Skype: chris-psyctc
    Chris> Visiting Professor, University of Sheffield <chris.evans using sheffield.ac.uk>
    Chris> I do some consultation work for the University of Roehampton <chris.evans using roehampton.ac.uk> and other places but this <chris using psyctc.org> remains my main Email address.
    Chris> I have "semigrated" to France, see: https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book to talk, I am trying to keep that to Thursdays and my diary is now available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
    Chris> Beware: French time, generally an hour ahead of UK.  That page will also take you to my blog which started with earlier joys in France and Spain!

Not an answer, but perhaps two pointers/ideas:

1) Since you know cron, I suppose you work on a
   Unix-like system, and you likely have a programme
   called 'wget' either installed or can easily install
   it. 'wget' has an option 'mirror', which allows you
   to mirror a website.

2) There is tools::md5sum for computing checksums. You
   could store those to a file and check changes in the
   files content (e.g. via 'diff').


regards
        Enrico
-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net



More information about the R-help mailing list