[Bioc-devel] A handful of check to follow up on R CMD BiocCheck

Martin Morgan martin.morgan at roswellpark.org
Thu Nov 10 00:31:45 CET 2016

On 11/03/2016 08:14 AM, Kevin RUE wrote:
> Apologies for the additional spam, for two reasons:
>   * The diff files that I've previously sent had the base and modified
>     versions swapped. This new one fixes that.
>   * This new diff file (always relative to the code I cloned from
>     Bioconductor-mirror) also fixes a bug whereby the updated code would
>     fail on packages that do not break any of the three guidelines.

Hi Kevin -- thanks for these, I incorporated a version in 1.11.1.

The report is for the first 6 lines of offenders; the code for finding 
all offenders is mentioned in the vignette

   BiocCheck:::checkFormatting("path/to/YourPackage", nlines=Inf)

Marcel Ramos mentions the lintr (CRAN) and goodpractice (github only, 
https://github.com/MangoTheCat/goodpractice) packages as useful 
resources for these types of questions; also formatR (CRAN).

For the record, it was fun to work through your code. The basic pattern 
was to read lines and flag those that violated a (usually simple) test, 
add these to a data frame, then if after parsing all files the data 
frame had any rows report these to the user.

I standardized the columns of the data frame across checks to facilitate 
re-use, and imposed the standard by using a constructor -- non-exported 
Context(). I took a little bit of pain to make the constructor work 
(return a data frame with the correct columns but no rows) when there 
are no lines of code flagged to be added to the context, so that I could 
use it without have to insert a bunch of conditional `if (...)` in my 
code that make it both harder to understand and harder to test. Also, 
the constructor is vectorized, whereas your implementation was iterative.

For reporting to the user, I wrote a small helper handleContext() that I 
could re-use across the different types of checks (and likely elsewhere 
in the package) and also maintain the separation of what is reported to 
the user (e.g., in handleContext()) from how it is reported to the user 
-- .msg, .verbatim, etc in the utils.R file. Probably the reporting 
could have been refactored further.

I feel guilty about not writing unit tests; the package does include 
tests, including a handy package-generator function to make packages 
that are invalid in various ways. It seems like a better way to go would 
be to further refactor the body of the checkFormatting() function so 
that one could for instance invoke a function checkLineLengths(pkg, 
file, lines) and get back a data.frame created by Context(); it would 
then be easy to test these for various 'lines' without having to figure 
out how to make test packages. But this seemed of course like too much 
additional work for this feature (I'll probably regret not expending the 
effort at a future date).

I also found myself using the horrible 'copy-and-append' paradigm

    ctxt = Context()
    for (fl in files) {
        ctxt = rbind(ctxt, Context(...)

which makes a copy of (the increasingly large) ctxt each time through 
the loop and therefore scales quadratically (when each file has some 
problems) with the number of files. I'm counting on the number of files 
to be small (e.g., less than 10000, when the copy-and-append pattern 
starts to be expensive even for trivial operations; mzR has ~8750 files, 
though many of these are not checked for formatting) so that this 
pattern does not become a bottleneck. It was still painful to use...


> Best,
> Kevin
> On Thu, Nov 3, 2016 at 11:49 AM, Kevin RUE <kevinrue67 at gmail.com
> <mailto:kevinrue67 at gmail.com>> wrote:
>     Hi all,
>     Please find attached the diff relative to the code that I cloned
>     from Bioconductor-mirror yesterday (please ignore the previous diff
>     file).
>     Basically three new features:
>       * As per previous email: display up to the first 6 lines that are
>         over 80 characters long
>       * *New*: display up to the first 6 lines that are not indented by
>         a multiple of 4 spaces
>       * *New*: display up to the first 6 lines that use TAB instead of 4
>         spaces for indentation
>     I also attach the output of the updated code
>     <https://github.com/kevinrue/BiocCheck>, to illustrate the changes.
>     Notes:
>       * For demonstration purpose, I indented a handful of lines from
>         the checks.R file itself with TAB characters. I assume that's
>         OK, as some lines were already longer than 80 characters and not
>         indented by a multiple of 4 spaces.
>     All the best,
>     Kevin
>     On Wed, Nov 2, 2016 at 10:00 PM, Kevin RUE <kevinrue67 at gmail.com
>     <mailto:kevinrue67 at gmail.com>> wrote:
>         Me again :)
>         Please find attached the first patch to print the first 6 lines
>         over 80 characters long. (I'll get to the tabulation offenders
>         next).
>         Note that all the offending lines are stored in the "df.length"
>         data.frame. How about an option like "fullReport=c(FALSE, TRUE)"
>         that print *all* the offending lines?
>         The data.frame also stores the content of the lines for the
>         record, but does not print them. I think Kasper is right:
>         filename and line should be enough to track down the line.
>         All the best,
>         Kevin
>         On Wed, Nov 2, 2016 at 8:08 PM, Kevin RUE <kevinrue67 at gmail.com
>         <mailto:kevinrue67 at gmail.com>> wrote:
>             Thanks for the feedback!
>             I also tend to prefer *all* the lines being reported (or to
>             be honest, that was really true when I had lots of them; a
>             problem that I largely mitigated by fixing all of them once
>             and subsequently paying more attention while developing).
>             Printing the content of the offending line somewhat helps me
>             spot the line faster (more so for tab issues). But I must
>             admit that showing the whole line is somewhat "overkill". I
>             just started thinking of a compromise being to only show the
>             first N characters of the line, with N being 80 minus the
>             number of characters necessary to print the filename and
>             line number.
>             Thanks Martin for pointing out the lines in BiocCheck. (Now
>             I feel bad for not having checked sooner.. hehe!)
>             I think the idea of BiocCheck showing the first 6 offenders
>             in BiocCheck quite nice, as I rarely have more since I use
>             using the RStudio "Tools > Global Options > Code > Display >
>             Show Margin > Margin column: 80" feature.
>             I'll give a go at both approaches (developing BiocCheck and
>             my own scripts)
>             Cheers,
>             Kevin
>             On Wed, Nov 2, 2016 at 7:41 PM, Kasper Daniel Hansen
>             <kasperdanielhansen at gmail.com
>             <mailto:kasperdanielhansen at gmail.com>> wrote:
>                 I would prefer all line numbers reported, but on the
>                 other hand I am indifferent wrt. the content of the
>                 line, unless (say) TABs are marked up somehow.
>                 Kasper
>                 On Wed, Nov 2, 2016 at 3:17 PM, Martin Morgan
>                 <martin.morgan at roswellpark.org
>                 <mailto:martin.morgan at roswellpark.org>> wrote:
>                     On 11/02/2016 02:49 PM, Kevin RUE wrote:
>                         Dear all,
>                         Just thought I'd share a handful of scripts that
>                         I wrote to follow up on
>                         certain NOTE messages thrown by R CMD BiocCheck.
>                         https://github.com/kevinrue/BiocCheckTools
>                         <https://github.com/kevinrue/BiocCheckTools>
>                         They're very simple, but I occasionally find
>                         them quite convenient.
>                         Apologies if something similar already exists
>                         somewhere :)
>                     Maybe consider creating a diff against the source
>                     code that, e.g., reported the first 6 offenders? The
>                     relevant lines are near
>                     https://github.com/Bioconductor-mirror/BiocCheck/blob/master/R/checks.R#L1081
>                     <https://github.com/Bioconductor-mirror/BiocCheck/blob/master/R/checks.R#L1081>
>                     Martin
>                         All the best,
>                         Kevin
>                                 [[alternative HTML version deleted]]
>                         _______________________________________________
>                         Bioc-devel at r-project.org
>                         <mailto:Bioc-devel at r-project.org> mailing list
>                         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>                         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>                     This email message may contain legally privileged
>                     and/or...{{dropped:2}}
>                     _______________________________________________
>                     Bioc-devel at r-project.org
>                     <mailto:Bioc-devel at r-project.org> mailing list
>                     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>                     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

This email message may contain legally privileged and/or...{{dropped:2}}

More information about the Bioc-devel mailing list