[Bioc-devel] A handful of check to follow up on R CMD BiocCheck

Thu Nov 10 11:36:33 CET 2016

Hi Martin,

Thanks for spending the time and effort to refactor and integrate my code
into BiocCheck. I didn't intend to give you so much to think about !

For the record, I used the "checkFunctionLengths" method (now starting at
line 829 of checks.R) as a template to offer code somewhat homogenous with
what I could see in place already. I believe that function should also be
refactored using your new "Context" and "handleContext" helpers, right?

Side note:
Far from trying to compete with the BiocCheck Bioconductor package, I
couldn't help but update my "follow-up" "BiocCheckTools" Python code
<https://github.com/kevinrue/BiocCheckTools/blob/master/README.md> in the
meantime (sorry for unoriginal name, I'll take suggestions!) taking
Kasper's advice on board:

   - Single master file which runs everything and gives a report.
   - *Bonus*: each module can be run separately (line_chars; tab_width).
   - All line numbers are reported
   - Content of the line is not displayed
   - I didn't mark up TABs, but I indicate how many spaces they are made of
   (when not a multiple of 4)

All the best,
Kevin

On Wed, Nov 9, 2016 at 11:31 PM, Martin Morgan <
martin.morgan at roswellpark.org> wrote:

> On 11/03/2016 08:14 AM, Kevin RUE wrote:
>
>> Apologies for the additional spam, for two reasons:
>>
>>   * The diff files that I've previously sent had the base and modified
>>     versions swapped. This new one fixes that.
>>   * This new diff file (always relative to the code I cloned from
>>     Bioconductor-mirror) also fixes a bug whereby the updated code would
>>     fail on packages that do not break any of the three guidelines.
>>
>
> Hi Kevin -- thanks for these, I incorporated a version in 1.11.1.
>
> The report is for the first 6 lines of offenders; the code for finding all
> offenders is mentioned in the vignette
>
>   BiocCheck:::checkFormatting("path/to/YourPackage", nlines=Inf)
>
> Marcel Ramos mentions the lintr (CRAN) and goodpractice (github only,
> https://github.com/MangoTheCat/goodpractice) packages as useful resources
> for these types of questions; also formatR (CRAN).
>
> For the record, it was fun to work through your code. The basic pattern
> was to read lines and flag those that violated a (usually simple) test, add
> these to a data frame, then if after parsing all files the data frame had
> any rows report these to the user.
>
> I standardized the columns of the data frame across checks to facilitate
> re-use, and imposed the standard by using a constructor -- non-exported
> Context(). I took a little bit of pain to make the constructor work (return
> a data frame with the correct columns but no rows) when there are no lines
> of code flagged to be added to the context, so that I could use it without
> have to insert a bunch of conditional `if (...)` in my code that make it
> both harder to understand and harder to test. Also, the constructor is
> vectorized, whereas your implementation was iterative.
>
> For reporting to the user, I wrote a small helper handleContext() that I
> could re-use across the different types of checks (and likely elsewhere in
> the package) and also maintain the separation of what is reported to the
> user (e.g., in handleContext()) from how it is reported to the user --
> .msg, .verbatim, etc in the utils.R file. Probably the reporting could have
> been refactored further.
>
> I feel guilty about not writing unit tests; the package does include
> tests, including a handy package-generator function to make packages that
> are invalid in various ways. It seems like a better way to go would be to
> further refactor the body of the checkFormatting() function so that one
> could for instance invoke a function checkLineLengths(pkg, file, lines) and
> get back a data.frame created by Context(); it would then be easy to test
> these for various 'lines' without having to figure out how to make test
> packages. But this seemed of course like too much additional work for this
> feature (I'll probably regret not expending the effort at a future date).
>
> I also found myself using the horrible 'copy-and-append' paradigm
>
>    ctxt = Context()
>    for (fl in files) {
>        ...
>        ctxt = rbind(ctxt, Context(...)
>     }
>
> which makes a copy of (the increasingly large) ctxt each time through the
> loop and therefore scales quadratically (when each file has some problems)
> with the number of files. I'm counting on the number of files to be small
> (e.g., less than 10000, when the copy-and-append pattern starts to be
> expensive even for trivial operations; mzR has ~8750 files, though many of
> these are not checked for formatting) so that this pattern does not become
> a bottleneck. It was still painful to use...
>
> Martin
>
>
>> Best,
>> Kevin
>>
>>
>> On Thu, Nov 3, 2016 at 11:49 AM, Kevin RUE <kevinrue67 at gmail.com
>> <mailto:kevinrue67 at gmail.com>> wrote:
>>
>>     Hi all,
>>
>>     Please find attached the diff relative to the code that I cloned
>>     from Bioconductor-mirror yesterday (please ignore the previous diff
>>     file).
>>
>>     Basically three new features:
>>
>>       * As per previous email: display up to the first 6 lines that are
>>         over 80 characters long
>>       * *New*: display up to the first 6 lines that are not indented by
>>         a multiple of 4 spaces
>>       * *New*: display up to the first 6 lines that use TAB instead of 4
>>         spaces for indentation
>>
>>     I also attach the output of the updated code
>>     <https://github.com/kevinrue/BiocCheck>, to illustrate the changes.
>>
>>     Notes:
>>
>>       * For demonstration purpose, I indented a handful of lines from
>>         the checks.R file itself with TAB characters. I assume that's
>>         OK, as some lines were already longer than 80 characters and not
>>         indented by a multiple of 4 spaces.
>>
>>     All the best,
>>     Kevin
>>
>>
>>     On Wed, Nov 2, 2016 at 10:00 PM, Kevin RUE <kevinrue67 at gmail.com
>>     <mailto:kevinrue67 at gmail.com>> wrote:
>>
>>         Me again :)
>>
>>         Please find attached the first patch to print the first 6 lines
>>         over 80 characters long. (I'll get to the tabulation offenders
>>         next).
>>
>>         Note that all the offending lines are stored in the "df.length"
>>         data.frame. How about an option like "fullReport=c(FALSE, TRUE)"
>>         that print *all* the offending lines?
>>         The data.frame also stores the content of the lines for the
>>         record, but does not print them. I think Kasper is right:
>>         filename and line should be enough to track down the line.
>>
>>         All the best,
>>         Kevin
>>
>>
>>
>>         On Wed, Nov 2, 2016 at 8:08 PM, Kevin RUE <kevinrue67 at gmail.com
>>         <mailto:kevinrue67 at gmail.com>> wrote:
>>
>>             Thanks for the feedback!
>>
>>             I also tend to prefer *all* the lines being reported (or to
>>             be honest, that was really true when I had lots of them; a
>>             problem that I largely mitigated by fixing all of them once
>>             and subsequently paying more attention while developing).
>>
>>             Printing the content of the offending line somewhat helps me
>>             spot the line faster (more so for tab issues). But I must
>>             admit that showing the whole line is somewhat "overkill". I
>>             just started thinking of a compromise being to only show the
>>             first N characters of the line, with N being 80 minus the
>>             number of characters necessary to print the filename and
>>             line number.
>>
>>             Thanks Martin for pointing out the lines in BiocCheck. (Now
>>             I feel bad for not having checked sooner.. hehe!)
>>             I think the idea of BiocCheck showing the first 6 offenders
>>             in BiocCheck quite nice, as I rarely have more since I use
>>             using the RStudio "Tools > Global Options > Code > Display >
>>             Show Margin > Margin column: 80" feature.
>>
>>             I'll give a go at both approaches (developing BiocCheck and
>>             my own scripts)
>>
>>             Cheers,
>>             Kevin
>>
>>
>>             On Wed, Nov 2, 2016 at 7:41 PM, Kasper Daniel Hansen
>>             <kasperdanielhansen at gmail.com
>>             <mailto:kasperdanielhansen at gmail.com>> wrote:
>>
>>                 I would prefer all line numbers reported, but on the
>>                 other hand I am indifferent wrt. the content of the
>>                 line, unless (say) TABs are marked up somehow.
>>
>>                 Kasper
>>
>>                 On Wed, Nov 2, 2016 at 3:17 PM, Martin Morgan
>>                 <martin.morgan at roswellpark.org
>>                 <mailto:martin.morgan at roswellpark.org>> wrote:
>>
>>                     On 11/02/2016 02:49 PM, Kevin RUE wrote:
>>
>>                         Dear all,
>>
>>                         Just thought I'd share a handful of scripts that
>>                         I wrote to follow up on
>>                         certain NOTE messages thrown by R CMD BiocCheck.
>>
>>                         https://github.com/kevinrue/BiocCheckTools
>>                         <https://github.com/kevinrue/BiocCheckTools>
>>
>>                         They're very simple, but I occasionally find
>>                         them quite convenient.
>>                         Apologies if something similar already exists
>>                         somewhere :)
>>
>>
>>                     Maybe consider creating a diff against the source
>>                     code that, e.g., reported the first 6 offenders? The
>>                     relevant lines are near
>>
>>                     https://github.com/Bioconducto
>> r-mirror/BiocCheck/blob/master/R/checks.R#L1081
>>                     <https://github.com/Bioconduct
>> or-mirror/BiocCheck/blob/master/R/checks.R#L1081>
>>
>>                     Martin
>>
>>
>>                         All the best,
>>                         Kevin
>>
>>                                 [[alternative HTML version deleted]]
>>
>>                         _______________________________________________
>>                         Bioc-devel at r-project.org
>>                         <mailto:Bioc-devel at r-project.org> mailing list
>>                         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>                         <https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>>
>>
>>                     This email message may contain legally privileged
>>                     and/or...{{dropped:2}}
>>
>>
>>                     _______________________________________________
>>                     Bioc-devel at r-project.org
>>                     <mailto:Bioc-devel at r-project.org> mailing list
>>                     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>                     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>>
>>
>>
>>
>
> This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
>

	[[alternative HTML version deleted]]