[Rd] I have corrected a dead link in the treering documentation

Thomas Levine _ at thomaslevine.com
Fri Sep 1 15:23:47 CEST 2017


Martin Maechler writes:
> There may be one small problem:  IIUC, the wayback machine is a
> +- private endeavor and really great and phantastic but it does
> need (US? tax deductible) donations, https://archive.org/donate/,
> to continue thriving.
> This makes me hesitate a bit to link to it within the "base R"
> documentation.  But that may be wrong -- and I should really use
> it to *help* the project ?

I agree that the Wayback Machine is a private endeavor. After reviewing
other base library documentation, I have concluded that it would
regardless be consistent with current practice to reference it in the
base documentation.

I share your concern regarding the support of other institutions, and
I have found some references that are more problematic to me than the
one of present interest. I would thus support an initiative to consider
the social implications of the different references and to adjust the
references accordingly.

Below I start by making a distinction between two types of references
that I think should be treated differently in terms of your concern.
Next, I assess whether there is a precedent for inclusion of references
to private publishers, as in the present patch; I include that there
is such a president. Then I present my opinion regarding the present
patch. Finally, I present some other considerations that I find relevant
to the discussion.

Distinguishing between two link types
-------------------------------------
For discussion of this issue, I think it is helpful to distinguish
between references to sources and references to other materials.

In the case of references of to sources, there is little choice but to
reference the publisher, even though the overwhelming majority of
referenced publishers are private companies that impose restrictive
licenses on their journals and books and cannot be reasonably trusted
to maintain access to the materials nor availability of webpages.

With other references, it is possible to replace the reference
with a different document that contains similar information.

For example, if a function implements an method based on a particular
journal article, that article's citation needs to stay, even if the
journal is published by a private institution. On the other hand, if the
reference just provides context or suggestions related to usage, then
the reference is provided just as information and can be replaced.

Precedent for inclusion of private non-source materials
-------------------------------------------------------
The dead link of interest is only informational, not a citation of a
source, and so it could be replaced. So I assessed whether it would
match current practice to include it, and I concluded that there is
substantial precedent for inclusion of private reference materials other
than strict sources. Not having access to a good library at the moment,
I have limited my research on this matter to website references.

In SVN revision 73164, \url calls are distributed among 148 files, from
1 call to 13 calls per file, with mean of 1.75 and median of 1.

  grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq -c | sort -n

Total number of library documentation files is 1419.

  find src/library/ -name \*.Rd | wc -l

I randomly selected 20 matching files for further study.

  % grep '\\url' src/library/*/*/*.Rd | 
    cut -d: -f1 | uniq -c | sort -R | head -n 20 | tee /tmp/rd
   2 src/library/grDevices/man/pdf.Rd
   1 src/library/base/man/taskCallbackNames.Rd
   1 src/library/stats/man/shapiro.test.Rd
   1 src/library/tcltk/man/TkWidgets.Rd
   2 src/library/graphics/man/assocplot.Rd
   1 src/library/base/man/sprintf.Rd
   6 src/library/base/man/regex.Rd
   3 src/library/datasets/man/HairEyeColor.Rd
   1 src/library/stats/man/optimize.Rd
   1 src/library/datasets/man/UKDriverDeaths.Rd
   1 src/library/utils/man/object.size.Rd
   1 src/library/utils/man/unzip.Rd
   1 src/library/base/man/dcf.Rd
   1 src/library/base/man/DateTimeClasses.Rd
   3 src/library/stats/man/GammaDist.Rd
   2 src/library/utils/man/maintainer.Rd
   2 src/library/base/man/libcurlVersion.Rd
   2 src/library/base/man/eigen.Rd
   2 src/library/base/man/chol2inv.Rd
   1 src/library/tools/man/update_pkg_po.Rd

>From these 20 I composed a table with statistical unit of \url call and
with variables filename, url, type of reference, and type of publisher.
The following commands were helpful.

  sed -e 's/^[ 0-9]*//' /tmp/rd | xargs grep \\\\url |
    sed -e 's/$/::/' -e 's/:.*\\url./:/' > urls.csv
  sed 's/^[ 0-9]*//' /tmp/rd | xargs grep -A5 -B5 \\\\url | less

I realized that I need to be a bit more precise about what I mean by a
"source". I wound up grouping the type of reference for \url calls into
the following categories.

1 Necessary sources, such as the specific file from which an algorithm
  or dataset was copied (as in stats/man/optimize.Rd)
2 Upstream documentation for bound libraries (tcltk/man/TkWidgets.Rd)
3 Extra information, such as tutorials on portable programming
  referenced in base/man/sprintf.Rd
4 Ambiguous, such as an general introduction on the topic that may have
  been used during the development of the function or may have been
  added just as further documentation (as in grDevices/man/pdf.Rd).
  These references did not include the date on which the webpage was
  accessed, so they aren't clear enough to count as source references
  even if they were in fact used during the development of the function.
5 Comments (stats/man/shapiro.test.Rd) and duplicates
  (stats/man/GammaDist.Rd)

Earlier, I distinguished between references to sources and references to
other materials. I think that the first and second categories should be
considered the source type references and the third and fourth should be
considered the non-source type references.

I separated publisher types into the following

* academic (I think that they were all public universities, but I did
  not check very thoroughly.)
* government
* private
* R project

Resulting categorization was as follows (attached urls.csv and urls.r).

             publisher
source        academic government private r-project
  1 necessary        8          0       5         3
  2 upstream         3          0       6         0
  3 extra            0          0       1         1
  4 ambiguous        0          1       4         0
  5 ignore           0          1       1         1

The references of concern are the replaceable sources (types 3 and 4)
to private publishers, which account for 5 out of the 35 \url calls
and 5 of the 20 files. To fit the table in an email, I have truncated
the URLs to their domains.

                             filename                  domain
     src/library/grDevices/man/pdf.Rd        en.wikipedia.org
      src/library/base/man/sprintf.Rd developer.r-project.org
        src/library/base/man/regex.Rd        perldoc.perl.org
 src/library/utils/man/object.size.Rd        en.wikipedia.org
   src/library/stats/man/GammaDist.Rd        en.wikipedia.org

Note that the sprintf documentation is in fact a link to an r-project
page (https://developer.r-project.org/Portability.html) that has lots of
other links on it, including links to fortran.com, en.wikipedia.org,
pubs.opengroup.org, and people.redhat.com.

So we see that several \url calls reference private publishers even
though the links could be replaced with alternatives. By my
categorization, 5 out of 20 sampled files (95% confidence interval of 9
files to 65 files of the population of 148 matching files, based on a
bespoke t-test with finite population correction because I had trouble
compiling the sampling package, see urls.r) include a replaceable
reference to a private publisher.

I briefly looked through the full population of Rd files, and I got the
impression that this sort of private reference may be restricted to a
just a few publishers, with Wikimedia possibly being the most prominent.

To summarize, several other documentation files already reference
private publishers, and the set of publishers is small enough that
it would be feasible to review each publisher in order consider whether
the references to it should replaced with alternatives.

Opinion regarding the present patch
-----------------------------------
I think that linking to the Wayback Machine, by the Internet Archive, is
consistent with the practice in many other base libraries and that it is
thus acceptable.

At present, base makes no references to the Wayback Machine but makes
several references to English Wikipedia. An even more consistent option
is thus to link to the English Wikipedia article for Great Basin
bristlecone pine (https://en.wikipedia.org/wiki/Pinus_longaeva) or for
Methuselah (https://en.wikipedia.org/wiki/Methuselah_(tree)) instead of
the Wayback Machine page that I reference in the patch. (Note that the
treering data are from a tree in the Methuselah Walk but not from
Methuselah itself.)

On the other hand, if we are to avoid referencing private institutions
unnecessarily, we should create a broader initiative to replace private
non-source references in base documentation. For me, more worrisome than
references to the Open Group or to Wikimedia are the references to the
private company GitHub, as in utils/man/tar.Rd; aside from the social
implications of supporting a private company whose repository hosting
service has been accessed by the Free Software Foundation as unethical
(https://www.gnu.org/software/repo-criteria-evaluation.html), I do not
even trust in the long-term availability of its webpages.

And of course, if we do not correct the dead link in treering, I think
we should remove the dead link. We can optionally replace it with a very
short description of Great Basin bristlecone pines.

Further discussion
------------------

RESTRICTION CRITERIA

If R is to have formal restrictions as to what sorts of references may
be included in the base documentation, I think that private versus
public is not an appropriate criterion. To start, private universities
may be similarly acceptable to public universities, and certain
government institutions may be problematic. Also, most of the present
references to software specifications refer to private institutions.
Considering the goals of the R project and its status as a component
of the GNU project, I think that it would make more sense for the
criteria to be based on the license of the referenced work, rather than
on characteristics of the legal entity that has published it.

AVOIDING LINKS

For practical reasons, I think it would be nice to avoid the sort of
link that we are presently discussing and instead to distribute the
contents of that link. If the contents are incorporated into R, then
dead links are not an issue, we are free to edit the extra documentation
that otherwise would have been linked, and users can view the
documentation without a internet connection. I think that the datasets
documentation, in particular, could benefit substantially from a few
sentences of context being added to each documentation file.

That said, it is possible that this would be enough work that it would
not be worthwhile; this extra documentation could easily become much
larger than the rest of the R source code, especially if images are
included as in the case of the Methuselah Walk photographs, so
implementing this would be more involved than simply obtaining
acceptable licenses on the extra documentation and copying passages to
Rd files.
-------------- next part --------------
filename,url,source,publisher
src/library/grDevices/man/pdf.Rd,https://en.wikipedia.org/wiki/CMYK_color_model#Mapping_RGB_to_CMYK,4,private
src/library/grDevices/man/pdf.Rd,https://www.r-project.org/doc/Rnews/Rnews_2006-2.pdf,1,r-project
src/library/base/man/taskCallbackNames.Rd,https://developer.r-project.org/TaskHandlers.pdf,3,r-project
src/library/stats/man/shapiro.test.Rd,http://lib.stat.cmu.edu/apstat/R94,5,private
src/library/tcltk/man/TkWidgets.Rd,http://www.tkdocs.com,2,private
src/library/graphics/man/assocplot.Rd,http://www.math.yorku.ca/SCS/sugi/sugi17-paper.html,1,academic
src/library/graphics/man/assocplot.Rd,http://epub.wu.ac.at/dyn/openURL?id=oai:epub.wu-wien.ac.at:epub-wu-01_8a1,1,academic
src/library/base/man/sprintf.Rd,https://developer.r-project.org/Portability.html,3,private
src/library/base/man/regex.Rd,http://www.pcre.org,2,private
src/library/base/man/regex.Rd,http://www.pcre.org/original/doc/html/,2,private
src/library/base/man/regex.Rd,http://laurikari.net/tre/documentation/regex-syntax/,2,private
src/library/base/man/regex.Rd,http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html,1,private
src/library/base/man/regex.Rd,http://www.pcre.org/original/pcre.txt,1,private
src/library/base/man/regex.Rd,http://perldoc.perl.org/perlre.html,4,private
src/library/datasets/man/HairEyeColor.Rd,http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas,1,academic
src/library/datasets/man/HairEyeColor.Rd,http://www.math.yorku.ca/SCS/sugi/sugi17-paper.html,1,academic
src/library/datasets/man/HairEyeColor.Rd,http://www.math.yorku.ca/SCS/Papers/asa92.html,1,academic
src/library/stats/man/optimize.Rd,http://www.netlib.org/fmm/fmin.f,1,academic
src/library/datasets/man/UKDriverDeaths.Rd,http://www.ssfpack.com/dkbook/,1,private
src/library/utils/man/object.size.Rd,https://en.wikipedia.org/wiki/Binary_prefix,4,private
src/library/utils/man/unzip.Rd,http://zlib.net,1,private
src/library/base/man/dcf.Rd,https://www.debian.org/doc/debian-policy/ch-controlfields.html,1,private
src/library/base/man/DateTimeClasses.Rd,https://www.r-project.org/doc/Rnews/Rnews_2001-2.pdf,1,r-project
src/library/stats/man/GammaDist.Rd,https://en.wikipedia.org/wiki/Incomplete_gamma_function,4,private
src/library/stats/man/GammaDist.Rd,http://dlmf.nist.gov/8.2#i,4,government
src/library/stats/man/GammaDist.Rd,http://dlmf.nist.gov/,5,government
src/library/utils/man/maintainer.Rd,https://stat.ethz.ch/pipermail/r-help/2010-February/230027.html,1,r-project
src/library/utils/man/maintainer.Rd,http://n4.nabble.com/R-help-question-How-can-we-enable-useRs-to-contribute-corrections-to-help-files-faster-tp1572568p1572868.html,5,r-project
src/library/base/man/libcurlVersion.Rd,http://curl.haxx.se/docs/sslcerts.html,2,private
src/library/base/man/libcurlVersion.Rd,http://curl.haxx.se/docs/ssl-compared.html,2,private
src/library/base/man/eigen.Rd,http://www.netlib.org/lapack,1,academic
src/library/base/man/eigen.Rd,http://www.netlib.org/lapack/lug/lapack_lug.html,2,academic
src/library/base/man/chol2inv.Rd,http://www.netlib.org/lapack,1,academic
src/library/base/man/chol2inv.Rd,http://www.netlib.org/lapack/lug/lapack_lug.html,2,academic
src/library/tools/man/update_pkg_po.Rd,https://www.stats.ox.ac.uk/pub/Rtools/goodies/gettext-tools.zip,2,academic
-------------- next part --------------
urls <- read.csv('urls.csv')

urls.tab <- function(urls) {
  urls$source <- factor(urls$source)
  levels(urls$source) <- paste(levels(urls$source), c('necessary', 'upstream', 'extra', 'ambiguous', 'ignore'))
  print(table(urls[c('source', 'publisher')]))
}

urls.tab(urls)
interesting <- (urls$source==3|urls$source==4) & (urls$publisher=='private')
urls.interesting <- urls[interesting,1:2]

N <- 148
n <- length(levels(urls$filename))
x <- nrow(urls.interesting)
p <- x/n
fpc <- sqrt((N-n)/(N-1))
se <- (sqrt(p*(1-p))/sqrt(n)) * fpc
t <- qt(1-.025, n-1)

print(round(N*(p+c(-1,1)*t*se)))


More information about the R-devel mailing list