[R] Intended use-case for data.matrix

Wed Nov 4 22:43:53 CET 2020

Hi Duncan,

Thanks; that's really useful info, and now that you point it out I completely agree that the frame arguments description does make my original use invalid - I will pay closer attention to such details in future.  Would you suggest sapply(...,as.numeric)  is the most 'R'-y way of converting a character dataframe to numeric matrix, or is there a cleaner pattern?

Best wishes,

Phil

On 4 Nov 2020, at 20:37, Duncan Murdoch <murdoch.duncan using gmail.com<mailto:murdoch.duncan using gmail.com>> wrote:

You can see the change to the help page here:

https://github.com/wch/r-source/commit/d1d3863d72613660727379dd5dffacad32ac9c35#diff-9143902e81e6ad39faace2d926725c4c72b078dd13fbb1223c4a35f833b58ee6

Before the change, it said the input should be

a data frame whose components are logical vectors, factors or numeric vectors

which suggests your input was invalid. But later it says

Logical and factor columns are converted to integers. Any other
column which is not numeric (according to \code{\link{is.numeric}}) is
converted by \code{\link{as.numeric}} or, for S4 objects,
\code{\link{as}(, "numeric")}.

which suggests what you were doing was supported.

It's unfortunate that you didn't know about this change, but it was made in August 2019, and appeared on the news feed here:

https://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2019/08/08#n2019-08-08

so some of the blame for this goes to you for not paying attention and testing unreleased R versions.

To protect yourself against this kind of unpleasant surprise in the future, I'd suggest this:

- Follow the news feed.

- Put your code in a package, and test it against R-devel now and then. (If your package is on CRAN the testing will happen automatically; if it's not on CRAN and not in a package, you could still test against R-devel, but why make your life more difficult by *not* putting it in a package?)

Duncan Murdoch

On 04/11/2020 6:48 a.m., Philip Charles wrote:
> Hi R gurus,
>
> We do a lot of work with biological -omics datasets (genomics, proteomics etc). The text file inputs to R typically contain a mixture of (mostly) character data and numeric data. The number of columns (both character and numeric data) in the file vary with the number of samples measured (which makes use of colClasses , so a typical approach might be
>
> 1) read in the whole file as character matrix
>
> #simulated result of read.table (with stringsAsFactors=FALSE)
> raw <- data.frame(Accession=c('P04637','P01375','P00761'),Description=c('Cellular tumor antigen p53','Tumor necrosis factor','Trypsin'),Species=c('Homo sapiens','Homo sapiens','Sus scrofa'),Intensity.SampleA=c('919948','1346170','15870'),Intensity.SampleB=c('1625540','710272','83624'),Intensity.SampleC=c('1232780','1481040','62548'))
>
> 2) use grep to identify numeric columns based on column names and split the raw matrix
>
> QUANT_COLS <- grepl('^Intensity\\.',colnames(raw))
> META_COLS <- !QUANT_COLS
> quant.df.char <- raw[,QUANT_COLS]
> meta.df <- raw[, META_COLS]
>
> 3) convert the quantitation data frame to a numeric matrix
>
> Prior to R version 4, my standard method for step 3 was to use data.matrix() for this last step. After recently updating from v3.6.3, I've found that all my workflows using this function were giving wildly incorrect results. I figured out that data.matrix now yields a matrix of factor levels rather than the actual numeric values
>
>> quant.df.char
> Intensity.SampleA Intensity.SampleB Intensity.SampleC
> 1 919948 1625540 1232780
> 2 1346170 710272 1481040
> 3 15870 83624 62548
>
>> data.matrix(quant.df.char)
> Intensity.SampleA Intensity.SampleB Intensity.SampleC
> [1,] 3 1 1
> [2,] 1 2 2
> [3,] 2 3 3
>
> The change in behaviour of this function is documented in the R v4.0.0 changelog, so it is clearly intentional:
>
> "data.matrix() now converts character columns to factors and from this to integers."
>
> Now, I know there are other ways to achieve the same conversion, e.g. sapply(quant.df.char, as.numeric). They aren't quite as straightforward to read in the code as data.matrix (sapply/lapply in particular I have to think though whether there will a need to transpose the result!), but the fact that this base function has been changed (without a way to replicate the previous behaviour) leads me to suspect that I have probably not previously been using data.matrix in the intended manner - and I may therefore be making similar mistakes elsewhere! I've certainly distributed/handed out R scripting examples in the past that will now give incorrect results when run on v4+ R.
>
> What even more confusing to me (but possibly related as regards an answer) is that R v4 broke with long-standing convention to change default.stringsAsFactors() to FALSE. So on one hand the update took away what was (at least, from our perspective, with our data - I am sure some here may disagree!) a perennial source of confusion/bugs for R learners, by not introducing string factorisation during data import, and then on the other hand changed a base function to explicitly introduce string factorisation.. I can't see when converting a character dataset, not to factors but, straight to numeric factor levels might be that useful (but of course that doesn't mean it isn't!).
>
> I've had a look through r-help and r-devel archives and couldn't spot any discussion of this, so apologies if this has been asked before. I'm also pretty sure my misunderstanding is with the intended use-case of data.matrix and R ethos around strings/factors rather than the rationale for the change, which is why I'm asking here.
>
> Best wishes,
>
> Phil
>
> Philip Charles
> Target Discovery Institute, Nuffield Department Of Medicine
> University of Oxford
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org<mailto:R-help using r-project.org> mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]