[R] Adding SORT to UNIQUE

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Tue Dec 21 21:37:31 CET 2021


It is not about outlawing matrix notation... to the contrary, it is about consistency. For tibbles, [] always returns another tibble. If you wanted a column vector, you should have asked for a column vector. Does the fact that DF[ 1, ] yields a different type than DF[ , 1 ] and DF[ 1:2, ] satisfy your desire to "support" matrix notation? Matlab has no concept of vectors distinct from row or column matrices, but R tries too hard to blur the lines between vectors and matrix-like objects.  The "drop" argument was a mistaken hack in defense of this failure to live with the difference between vectors and matrix-like objects and data frames.

On December 21, 2021 10:09:14 AM PST, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>On 21/12/2021 12:53 p.m., Duncan Murdoch wrote:
>> On 21/12/2021 12:29 p.m., Jeff Newmiller wrote:
>>> It is a very rational choice, not a design flaw. I don't like every choice they have made for that class, but this one is very solid, and treating data frames as lists of columns consistently helps all of us.
>> I think outlawing matrix notation is a really bad idea.  It makes code
>> harder to read, and makes it much harder to switch to matrices, which
>> sometimes gives a huge speed boost to code.
>> 
>> For example, John Fox posted an example that showed that operations on
>> whole columns of dataframes is about twice as fast using list notation
>> as using matrix notation.  But for operating on whole rows, 
>
>... or on individual elements ...
>
> > matrices are
>> about 100 times faster than dataframes.  You shouldn't use notation that
>> makes the switch to matrices more difficult.
>> 
>> Duncan Murdoch
>> 
>>>
>>> On December 21, 2021 9:02:56 AM PST, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>>> On 21/12/2021 11:59 a.m., Jeff Newmiller wrote:
>>>>> Intuitive, perhaps, but noticably slower. And it doesn't work on tibbles by design. Data frames are lists of columns.
>>>>
>>>> That's just one of the design flaws in tibbles, but not the worst one.
>>>>
>>>> Duncan Murdoch
>>>>
>>>>>
>>>>> On December 21, 2021 8:38:35 AM PST, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>>>>> On 21/12/2021 11:31 a.m., Duncan Murdoch wrote:
>>>>>>> On 21/12/2021 11:20 a.m., Stephen H. Dawson, DSL wrote:
>>>>>>>> Thanks for the reply.
>>>>>>>>
>>>>>>>> sort(unique(Data[1]))
>>>>>>>> Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing =
>>>>>>>> decreasing)) :
>>>>>>>>         undefined columns selected
>>>>>>>
>>>>>>> That's the wrong syntax:  Data[1] is not "column one of Data".  Use
>>>>>>> Data[[1]] for that, so
>>>>>>>
>>>>>>>        sort(unique(Data[[1]]))
>>>>>>
>>>>>> Actually, I'd probably recommend
>>>>>>
>>>>>>      sort(unique(Data[, 1]))
>>>>>>
>>>>>> instead.  This treats Data as a matrix rather than as a list.
>>>>>> Dataframes are lists that look like matrices, but to me the matrix
>>>>>> aspect is usually more intuitive.
>>>>>>
>>>>>> Duncan Murdoch
>>>>>>
>>>>>>>
>>>>>>> I think Rui already pointed out the typo in the quoted text below...
>>>>>>>
>>>>>>> Duncan Murdoch
>>>>>>>
>>>>>>>>
>>>>>>>> The recommended syntax did not work, as listed above.
>>>>>>>>
>>>>>>>> What I want is the sort of distinct column output. Again, the column may
>>>>>>>> be text or numbers. This is a huge analysis effort with data coming at
>>>>>>>> me from many different sources.
>>>>>>>>
>>>>>>>>
>>>>>>>> *Stephen Dawson, DSL*
>>>>>>>> /Executive Strategy Consultant/
>>>>>>>> Business & Technology
>>>>>>>> +1 (865) 804-3454
>>>>>>>> http://www.shdawson.com <http://www.shdawson.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12/21/21 11:07 AM, Duncan Murdoch wrote:
>>>>>>>>> On 21/12/2021 10:16 a.m., Stephen H. Dawson, DSL via R-help wrote:
>>>>>>>>>> Thanks everyone for the replies.
>>>>>>>>>>
>>>>>>>>>> It is clear one either needs to write a function or put the unique
>>>>>>>>>> entries into another dataframe.
>>>>>>>>>>
>>>>>>>>>> It seems odd R cannot sort a list of unique column entries with ease.
>>>>>>>>>> Python and SQL can do it with ease.
>>>>>>>>>
>>>>>>>>> I've seen several responses that looked pretty simple.  It's hard to
>>>>>>>>> beat sort(unique(x)), though there's a fair bit of confusion about
>>>>>>>>> what you actually want.  Maybe you should post an example of the code
>>>>>>>>> you'd use in Python?
>>>>>>>>>
>>>>>>>>> Duncan Murdoch
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> QUESTION
>>>>>>>>>> Is there a simpler means than other than the unique function to capture
>>>>>>>>>> distinct column entries, then sort that list?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Stephen Dawson, DSL*
>>>>>>>>>> /Executive Strategy Consultant/
>>>>>>>>>> Business & Technology
>>>>>>>>>> +1 (865) 804-3454
>>>>>>>>>> http://www.shdawson.com <http://www.shdawson.com>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 12/20/21 5:53 PM, Rui Barradas wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Inline.
>>>>>>>>>>>
>>>>>>>>>>> Às 21:18 de 20/12/21, Stephen H. Dawson, DSL via R-help escreveu:
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> sort(unique(Data[[1]]))
>>>>>>>>>>>>
>>>>>>>>>>>> This syntax provides row numbers, not column values.
>>>>>>>>>>>
>>>>>>>>>>> This is not right.
>>>>>>>>>>> The syntax Data[1] extracts a sub-data.frame, the syntax Data[[1]]
>>>>>>>>>>> extracts the column vector.
>>>>>>>>>>>
>>>>>>>>>>> As for my previous answer, it was not addressing the question, I
>>>>>>>>>>> misinterpreted it as being a question on how to sort by numeric order
>>>>>>>>>>> when the data is not numeric. Here is a, hopefully, complete answer.
>>>>>>>>>>> Still with package stringr.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> cols_to_sort <- 1:4
>>>>>>>>>>>
>>>>>>>>>>> Data2 <- lapply(Data[cols_to_sort], \(x){
>>>>>>>>>>>         stringr::str_sort(unique(x), numeric = TRUE)
>>>>>>>>>>> })
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Or using Avi's suggestion of writing a function to do all the work and
>>>>>>>>>>> simplify the lapply loop later,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> unisort2 <- function(vec, ...) stringr::str_sort(unique(vec), ...)
>>>>>>>>>>> Data2 <- lapply(Data[cols_to_sort], unisort, numeric = TRUE)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>
>>>>>>>>>>> Rui Barradas
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Stephen Dawson, DSL*
>>>>>>>>>>>> /Executive Strategy Consultant/
>>>>>>>>>>>> Business & Technology
>>>>>>>>>>>> +1 (865) 804-3454
>>>>>>>>>>>> http://www.shdawson.com <http://www.shdawson.com>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/20/21 11:58 AM, Stephen H. Dawson, DSL via R-help wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Running a simple syntax set to review entries in dataframe columns.
>>>>>>>>>>>>> Here is the working code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Data <- read.csv("./input/Source.csv", header=T)
>>>>>>>>>>>>> describe(Data)
>>>>>>>>>>>>> summary(Data)
>>>>>>>>>>>>> unique(Data[1])
>>>>>>>>>>>>> unique(Data[2])
>>>>>>>>>>>>> unique(Data[3])
>>>>>>>>>>>>> unique(Data[4])
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to add sort the unique entries. The data in the various
>>>>>>>>>>>>> columns are not defined as numbers, but also text. I realize 1 and
>>>>>>>>>>>>> 10 will not sort properly, as the column is not defined as a number,
>>>>>>>>>>>>> but want to see what I have in the columns viewed as sorted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> QUESTION
>>>>>>>>>>>>> What is the best process to sort unique output, please?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ______________________________________________
>>>>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>
>> 
>

-- 
Sent from my phone. Please excuse my brevity.



More information about the R-help mailing list