[R] Proposal: Archive UseR conference presentations at www.r-project.org/useR-yyyy

Wed Sep 26 09:05:47 CEST 2007

>>>>> On Tue, 25 Sep 2007 09:17:23 -0500,
>>>>> hadley wickham (hw) wrote:

  >> > but I can understand your desire to do
  >> > that.  Perhaps just taking a static snapshot using something like
  >> > wget, and hosting that on the R-project website would be a good
  >> > compromise.
  >> 
  >> Hmm, wouldn't it be easier if the hosting institution would make a tgz
  >> file? wget over HTTP is rather bad in resolving links etc

  > Really?  I've always found it to be rather excellent.

Sorry, my statement was rather ambigous. wget is excellent, but
mirroring via HTTP is terrible (administering CRAN for a couple of
years gives you more experience than you ever wanted to have in that
arena).

With "links" I meant symbolic links on the server filesystem. HTTP
does not make a difference between a symbolic link and a real
file/directory, so for every symbolic link you get a copy of the
target (and there is nothing wget can do about it AFAIK).

  > The reason I suggest it is that unless you have some way to generate a
  > static copy of the site, you'll need to ensure that the R-project
  > supports any dynamic content.  e.g. for example the user 2008 site
  > uses some (fairly vanilla) php for including the header and
  > footer.

I don't care how the tgz file I get is created, but probably it is
better if the local authors create (and check) it rather than I do
it. So no problem if the tarball is created using wget ... but I'd
rather prefer not to do it myself.

  >> we could include a note on the top page that this is only a snapshot
  >> copy and have a link to the original site (in case something changes
  >> there).

  > That's reasonable, although it would be even better to have it on
  > every page.

Again, if the authors create a tarball, they can put the note wherever
they like. I thought of adding a link "local copy from 200x-yy-zz" to
the list of conferences at www.R-project.org next to the links to the
original  sites.

  >> > The one problem is setting up a redirect so that existing links and
  >> > google searches aren't broken.  This would need to be put in place at
  >> > least 6 months before the old website closed.
  >> 
  >> Yes, very good point, I didn't think about that. But the R site is
  >> searched very often, so material there appears rather quickly on
  >> Google searches. Ad bookmarks: I don't want to remove the old site,
  >> just have an archive copy at a central location.

  > In that case, should it be labelled no-index as it's just a cache of
  > material that should be available elsewhere?  We need some
  > machine-readable way of indicating where the canonical resource is.
  > It's always frustrated me a little that when googling for r
  > documentation, you find hundreds of the same page hosted at different
  > sites.

Well, 2 copies are not as bad as hundreds. But material might get
found faster on the www.R-project.org site, because that ranks
surprisingly high in many google searches.

Best,
Fritz