[Bioc-devel] A handful of check to follow up on R CMD BiocCheck
Kevin RUE
kevinrue67 at gmail.com
Thu Nov 10 11:36:33 CET 2016
Hi Martin,
Thanks for spending the time and effort to refactor and integrate my code
into BiocCheck. I didn't intend to give you so much to think about !
For the record, I used the "checkFunctionLengths" method (now starting at
line 829 of checks.R) as a template to offer code somewhat homogenous with
what I could see in place already. I believe that function should also be
refactored using your new "Context" and "handleContext" helpers, right?
Side note:
Far from trying to compete with the BiocCheck Bioconductor package, I
couldn't help but update my "follow-up" "BiocCheckTools" Python code
<https://github.com/kevinrue/BiocCheckTools/blob/master/README.md> in the
meantime (sorry for unoriginal name, I'll take suggestions!) taking
Kasper's advice on board:
- Single master file which runs everything and gives a report.
- *Bonus*: each module can be run separately (line_chars; tab_width).
- All line numbers are reported
- Content of the line is not displayed
- I didn't mark up TABs, but I indicate how many spaces they are made of
(when not a multiple of 4)
All the best,
Kevin
On Wed, Nov 9, 2016 at 11:31 PM, Martin Morgan <
martin.morgan at roswellpark.org> wrote:
> On 11/03/2016 08:14 AM, Kevin RUE wrote:
>
>> Apologies for the additional spam, for two reasons:
>>
>> * The diff files that I've previously sent had the base and modified
>> versions swapped. This new one fixes that.
>> * This new diff file (always relative to the code I cloned from
>> Bioconductor-mirror) also fixes a bug whereby the updated code would
>> fail on packages that do not break any of the three guidelines.
>>
>
> Hi Kevin -- thanks for these, I incorporated a version in 1.11.1.
>
> The report is for the first 6 lines of offenders; the code for finding all
> offenders is mentioned in the vignette
>
> BiocCheck:::checkFormatting("path/to/YourPackage", nlines=Inf)
>
> Marcel Ramos mentions the lintr (CRAN) and goodpractice (github only,
> https://github.com/MangoTheCat/goodpractice) packages as useful resources
> for these types of questions; also formatR (CRAN).
>
> For the record, it was fun to work through your code. The basic pattern
> was to read lines and flag those that violated a (usually simple) test, add
> these to a data frame, then if after parsing all files the data frame had
> any rows report these to the user.
>
> I standardized the columns of the data frame across checks to facilitate
> re-use, and imposed the standard by using a constructor -- non-exported
> Context(). I took a little bit of pain to make the constructor work (return
> a data frame with the correct columns but no rows) when there are no lines
> of code flagged to be added to the context, so that I could use it without
> have to insert a bunch of conditional `if (...)` in my code that make it
> both harder to understand and harder to test. Also, the constructor is
> vectorized, whereas your implementation was iterative.
>
> For reporting to the user, I wrote a small helper handleContext() that I
> could re-use across the different types of checks (and likely elsewhere in
> the package) and also maintain the separation of what is reported to the
> user (e.g., in handleContext()) from how it is reported to the user --
> .msg, .verbatim, etc in the utils.R file. Probably the reporting could have
> been refactored further.
>
> I feel guilty about not writing unit tests; the package does include
> tests, including a handy package-generator function to make packages that
> are invalid in various ways. It seems like a better way to go would be to
> further refactor the body of the checkFormatting() function so that one
> could for instance invoke a function checkLineLengths(pkg, file, lines) and
> get back a data.frame created by Context(); it would then be easy to test
> these for various 'lines' without having to figure out how to make test
> packages. But this seemed of course like too much additional work for this
> feature (I'll probably regret not expending the effort at a future date).
>
> I also found myself using the horrible 'copy-and-append' paradigm
>
> ctxt = Context()
> for (fl in files) {
> ...
> ctxt = rbind(ctxt, Context(...)
> }
>
> which makes a copy of (the increasingly large) ctxt each time through the
> loop and therefore scales quadratically (when each file has some problems)
> with the number of files. I'm counting on the number of files to be small
> (e.g., less than 10000, when the copy-and-append pattern starts to be
> expensive even for trivial operations; mzR has ~8750 files, though many of
> these are not checked for formatting) so that this pattern does not become
> a bottleneck. It was still painful to use...
>
> Martin
>
>
>> Best,
>> Kevin
>>
>>
>> On Thu, Nov 3, 2016 at 11:49 AM, Kevin RUE <kevinrue67 at gmail.com
>> <mailto:kevinrue67 at gmail.com>> wrote:
>>
>> Hi all,
>>
>> Please find attached the diff relative to the code that I cloned
>> from Bioconductor-mirror yesterday (please ignore the previous diff
>> file).
>>
>> Basically three new features:
>>
>> * As per previous email: display up to the first 6 lines that are
>> over 80 characters long
>> * *New*: display up to the first 6 lines that are not indented by
>> a multiple of 4 spaces
>> * *New*: display up to the first 6 lines that use TAB instead of 4
>> spaces for indentation
>>
>> I also attach the output of the updated code
>> <https://github.com/kevinrue/BiocCheck>, to illustrate the changes.
>>
>> Notes:
>>
>> * For demonstration purpose, I indented a handful of lines from
>> the checks.R file itself with TAB characters. I assume that's
>> OK, as some lines were already longer than 80 characters and not
>> indented by a multiple of 4 spaces.
>>
>> All the best,
>> Kevin
>>
>>
>> On Wed, Nov 2, 2016 at 10:00 PM, Kevin RUE <kevinrue67 at gmail.com
>> <mailto:kevinrue67 at gmail.com>> wrote:
>>
>> Me again :)
>>
>> Please find attached the first patch to print the first 6 lines
>> over 80 characters long. (I'll get to the tabulation offenders
>> next).
>>
>> Note that all the offending lines are stored in the "df.length"
>> data.frame. How about an option like "fullReport=c(FALSE, TRUE)"
>> that print *all* the offending lines?
>> The data.frame also stores the content of the lines for the
>> record, but does not print them. I think Kasper is right:
>> filename and line should be enough to track down the line.
>>
>> All the best,
>> Kevin
>>
>>
>>
>> On Wed, Nov 2, 2016 at 8:08 PM, Kevin RUE <kevinrue67 at gmail.com
>> <mailto:kevinrue67 at gmail.com>> wrote:
>>
>> Thanks for the feedback!
>>
>> I also tend to prefer *all* the lines being reported (or to
>> be honest, that was really true when I had lots of them; a
>> problem that I largely mitigated by fixing all of them once
>> and subsequently paying more attention while developing).
>>
>> Printing the content of the offending line somewhat helps me
>> spot the line faster (more so for tab issues). But I must
>> admit that showing the whole line is somewhat "overkill". I
>> just started thinking of a compromise being to only show the
>> first N characters of the line, with N being 80 minus the
>> number of characters necessary to print the filename and
>> line number.
>>
>> Thanks Martin for pointing out the lines in BiocCheck. (Now
>> I feel bad for not having checked sooner.. hehe!)
>> I think the idea of BiocCheck showing the first 6 offenders
>> in BiocCheck quite nice, as I rarely have more since I use
>> using the RStudio "Tools > Global Options > Code > Display >
>> Show Margin > Margin column: 80" feature.
>>
>> I'll give a go at both approaches (developing BiocCheck and
>> my own scripts)
>>
>> Cheers,
>> Kevin
>>
>>
>> On Wed, Nov 2, 2016 at 7:41 PM, Kasper Daniel Hansen
>> <kasperdanielhansen at gmail.com
>> <mailto:kasperdanielhansen at gmail.com>> wrote:
>>
>> I would prefer all line numbers reported, but on the
>> other hand I am indifferent wrt. the content of the
>> line, unless (say) TABs are marked up somehow.
>>
>> Kasper
>>
>> On Wed, Nov 2, 2016 at 3:17 PM, Martin Morgan
>> <martin.morgan at roswellpark.org
>> <mailto:martin.morgan at roswellpark.org>> wrote:
>>
>> On 11/02/2016 02:49 PM, Kevin RUE wrote:
>>
>> Dear all,
>>
>> Just thought I'd share a handful of scripts that
>> I wrote to follow up on
>> certain NOTE messages thrown by R CMD BiocCheck.
>>
>> https://github.com/kevinrue/BiocCheckTools
>> <https://github.com/kevinrue/BiocCheckTools>
>>
>> They're very simple, but I occasionally find
>> them quite convenient.
>> Apologies if something similar already exists
>> somewhere :)
>>
>>
>> Maybe consider creating a diff against the source
>> code that, e.g., reported the first 6 offenders? The
>> relevant lines are near
>>
>> https://github.com/Bioconducto
>> r-mirror/BiocCheck/blob/master/R/checks.R#L1081
>> <https://github.com/Bioconduct
>> or-mirror/BiocCheck/blob/master/R/checks.R#L1081>
>>
>> Martin
>>
>>
>> All the best,
>> Kevin
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org
>> <mailto:Bioc-devel at r-project.org> mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>>
>>
>> This email message may contain legally privileged
>> and/or...{{dropped:2}}
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org
>> <mailto:Bioc-devel at r-project.org> mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>>
>>
>>
>>
>
> This email message may contain legally privileged and/or confidential
> information. If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited. If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list