[R] Adding SORT to UNIQUE

Tue Dec 21 20:17:03 CET 2021

Stephen,

Languages have their own philosophies and are often focused initially on doing specific things well. Later, they tend to accumulate additional functionality both in the base language and extensions.

I am wondering if you have explained your need precisely enough to get the answers you want. 

SQL and Python have their own ways and both have advantages but also huge deficiencies relative to just base R. 

But there are rules you live with and if you choose day a data.frame to store things in, the columns must all be the same length. The unique members of one data.frame are likely to not be the same number so storing them in a data.frame does not work. They can be stored quite  few other ways, such as a list of lists.

And what is your definition of ease? I can program in Python and SQL and way over a hundred other languages and I know I need to adapt my thinking to the flow of the language and not the other way around. Base R was not designed to be like either SQL or Python. But it can be extended quite a few ways to do just about anything.

What you ran into for example is the fact that some functionality is more selective in what it works on. A data.frame with one column is logically the same as a matrix with one column and as a vector but in reality, they are not the same thing. Yes, they can be converted into each other fairly trivially. Sort() seems to care what you feed it. If you did not worry about efficiency, you could have a version of sort that accepts a wide variety of inputs, converts any it can to some possibly common internal form, then converts the output back into the form it was received in, or uses a command-line option to specify the output format. It is not hard in R to make such a function as it has the primitives needed to examine an arbitrary object and see what dimensions it has for some number of types and so on, and has utilities to do the conversion.

If you want a language that has calculated every possible combination of ways to combine functions and already made tens of thousands available, good luck. What languages (including Python and R) expect is for you to compose such combinations yourself in one of many ways. The annoying discussions here between purists and those wanting to use pre-made packages aside, your question can be handled in many of the ways we already discussed. They include making your own (often very small) function that implements consolidating the many steps into one logical step. It can mean using pipelines like the new "|>" operator recently added to base R or the older versions often used in the tidyverse packages like "%>%".

You want to take a data.frame and select a column at a time and ask for it to be made into unique values then ordered and shown. So you want a VECTOR and your initial use of the "[" operator does not take the underlying list structure of a data.frame apart the way you might have thought but as a narrow data.frame. So you MAY need to either extract it using "[[" or use various routines R supplies like unlist() or as.vector().

Here is a pipeline using this as my data:

mydf <- data.frame(ints=c(5,4,3,3,4,5), chars=c("z","i","t","s","t","i"))

Note the number of unique items differs s does the data type:

  mydf
  ints chars
  1    5     z
  2    4     i
  3    3     t
  4    3     s
  5    4     t
  6    5     i

To handle the columns one at a time can be done using a pipeline like:

  > mydf[2] |> unlist() |> unique() |> sort()
  [1] "i" "s" "t" "z"
  > mydf[1] |> unlist() |> unique() |> sort()
  [1] 3 4 5

The above takes a two-column data.frame and restricts it into a one-column data.frame and then passes the new temporary variable/object into the command line of the unlist() function which returns an object (again temporary) which is a  vector (in one case numeric and in the other character) and then that result is passed into the command line of unique() which returns a shorter vector in the same order and then you pass it on to sort() which reorders it. 

Note the first steps can be shortened if using the "[[" notation or by using the named way of asking for a column:

  > mydf[[1]] |> unique() |> sort()
  [1] 3 4 5
  > mydf$ints |> unique() |> sort()
  [1] 3 4 5

But pipelines are simply syntactic sugar mostly so you also can just nest function calls as in sort(unique(unlist(mydf[1]))) or do what I showed earlier of creating a function that does the work invisibly and call that.

Python often does their own version of pipelines by adding a dot at the end and calling a method and if needed another dot and then calling a method on the resulting object and so on. But that is arguably more limiting in some ways and more powerful in others. Different paradigms. In R, you do not do object.method1.method2(args).method3(args) so a pieline method is used to sort of so something related.

Now if your need was to do your operation on an entire data.frame at once, then sometimes you will find a way to do it easily and sometimes use things like functional programming techniques. It is so common to calculate the sums or means of columns in a data.frame (or matrix) that functions like rowSums() and colSums() and colMeans() are available in R. But they also allow fairly arbitrary things to be done too as in the lapply() family of functions that applies an arbitrary function perhaps including arguments, like:

lapply(mydf, max)

sapply(mydf, `[`, 2)

The latter takes the second value in each and every column of the data.frame and when possible, consolidates the results. Of course the uniqueness criterion when producing uneven numbers of results, does not simplify. Below I show how you can do many things including nested methods:

  > lapply(mydf, sort)
  $ints
  [1] 3 3 4 4 5 5

  $chars
  [1] "i" "i" "s" "t" "t" "z"

  > lapply(lapply(mydf, sort), unique)
  $ints
  [1] 3 4 5

  $chars
  [1] "i" "s" "t" "z"

  > lapply(lapply(mydf, unique), sort)
  $ints
  [1] 3 4 5

  $chars
  [1] "i" "s" "t" "z"

  > lapply(lapply(lapply(mydf, unique), sort), toupper)
  $ints
  [1] "3" "4" "5"

  $chars
  [1] "I" "S" "T" "Z"

R has plenty of other such primitives that allow you to compose things many ways including other variants like Filter and Reduce and pmap and so on, with way more in various packages.

It is simply wrong to insist that a language you are not very familiar with is not able to (often fairly easily) do all kinds of things. 

Back to your question, if I may, I think one of my earlier posts on this topic suggested another. Use factors which are part of base-R to perform the unique() for you and then extract the unique levels and re-order them by sorting.

  > sort(levels(factor(mydf[[1]])))
  [1] "3" "4" "5"
  > sort(levels(factor(mydf[[2]])))
  [1] "i" "s" "t" "z"

But note this converts everything to characters so a numeric may need to be converted back, and yes, the sorting is not done numerically.

Generally, there are oodles of ways to do anything. If this were Python, you might create an object that maintains a sorted set for example but that just hides the complexity as the various methods of the underlying object have to carefully deal keeping track of the current order and dealing with how things are added into the right place or tightening up the data structure if something is removed all the time. Others simply supply a sorted() method to use only when you actually need that. R can be done in similar ways and you can create objects of quite a few kinds to implement some things but it does not often seem necessary, at least to me.

I can imagine writing a function that makes a data.frame even from vectors of unequal length by calculating the length of the longest vector and then setting each shorter vector to be longer with code like:

length(a) <- longest

You can then patch together all the results into a data.frame with trailing NA values on some columns.

I quickly cobbled together a few lines that can do that and can be placed inside a function to return this:

  lapply(lapply(lapply(mydf, unique), sort), toupper) -> uneven
  longest <- max(unlist(lapply(uneven, length)))
  answer <- data.frame(lapply(uneven, `length<-`, longest))
  print(answer)

  ints chars
1    3     I
2    4     S
3    5     T
4 <NA>     Z

Now this has a single NA but I suggest generalizes well to a more complex example:

   ints lower upper
1    10     k     Z
2     9     j     A
3     8     i     Z
4     7     h     A
5     6     g     Z
6     5     f     A
7     4     h     Z
8     3     i     A
9     2     j     Z
10    1     k     A
11    2     l     Z
12    3     m     A

These are uneven and three columns so I tried a function version:

  mydf2 <- data.frame(ints = c(10:1, 2:3),
                      lower = c(letters[11:6], letters[8:13]),
                      upper = rep(c("Z", "A"), 6))

  unisortuneven <- function(anydf) {
    uneven <- lapply(lapply(lapply(anydf, unique), sort), toupper)
    longest <- max(unlist(lapply(uneven, length)))
    data.frame(lapply(uneven, `length<-`, longest))
  }

  unisortuneven(mydf2)
  ints lower upper
  1     1     F     A
  2     2     G     Z
  3     3     H  <NA>
    4     4     I  <NA>
    5     5     J  <NA>
    6     6     K  <NA>
    7     7     L  <NA>
    8     8     M  <NA>
    9     9  <NA>  <NA>
    10   10  <NA>  <NA>

    The above does not format great for text, sadly, so is better shown as the transpose for display purposes:

  > t(unisortuneven(mydf2))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
  ints  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" 
  lower "F"  "G"  "H"  "I"  "J"  "K"  "L"  "M"  NA   NA   
  upper "A"  "Z"  NA   NA   NA   NA   NA   NA   NA   NA

But hopefully it makes my point that a little thinking and KNOWING about features of R like how to use a functionalized version of length() that sets a changed value using the odd notation of `length<-` can let you solve all kinds of problems in a somewhat abstract manner. Of course the above function is not refined and will not handle some useful transformations or deal with errors. That can make it quite a bit harder and in some cases, make it a good idea to find someone sharing a package where they did the hard work and documented exactly what their function does.

I am eclectic and happy to switch tools at a moment's notice if they offer an interesting way to do something. But, within a language, I learn the darn rules and also the idioms often used and then choose from among many ways I can see to solve something and use what is available.  You had a trivial solution available to you to simply do one step at a time and save intermediate values, transforming at times. Some of us have sent you more general solutions. Do you still think what you want is so much harder to do in R, or that perhaps you are not thinking in R and thus want it to do it some other way other languages do?

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Stephen H. Dawson, DSL via R-help
Sent: Tuesday, December 21, 2021 10:16 AM
To: Rui Barradas <ruipbarradas using sapo.pt>; Stephen H. Dawson, DSL via R-help <r-help using r-project.org>
Subject: Re: [R] Adding SORT to UNIQUE

Thanks everyone for the replies.

It is clear one either needs to write a function or put the unique entries into another dataframe.

It seems odd R cannot sort a list of unique column entries with ease. 
Python and SQL can do it with ease.

QUESTION
Is there a simpler means than other than the unique function to capture distinct column entries, then sort that list?

*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com <http://www.shdawson.com>

On 12/20/21 5:53 PM, Rui Barradas wrote:
> Hello,
>
> Inline.
>
> Às 21:18 de 20/12/21, Stephen H. Dawson, DSL via R-help escreveu:
>> Thanks.
>>
>> sort(unique(Data[[1]]))
>>
>> This syntax provides row numbers, not column values.
>
> This is not right.
> The syntax Data[1] extracts a sub-data.frame, the syntax Data[[1]] 
> extracts the column vector.
>
> As for my previous answer, it was not addressing the question, I 
> misinterpreted it as being a question on how to sort by numeric order 
> when the data is not numeric. Here is a, hopefully, complete answer.
> Still with package stringr.
>
>
> cols_to_sort <- 1:4
>
> Data2 <- lapply(Data[cols_to_sort], \(x){
>   stringr::str_sort(unique(x), numeric = TRUE)
> })
>
>
> Or using Avi's suggestion of writing a function to do all the work and 
> simplify the lapply loop later,
>
>
> unisort2 <- function(vec, ...) stringr::str_sort(unique(vec), ...)
> Data2 <- lapply(Data[cols_to_sort], unisort, numeric = TRUE)
>
>
> Hope this helps,
>
> Rui Barradas
>
>
>>
>> *Stephen Dawson, DSL*
>> /Executive Strategy Consultant/
>> Business & Technology
>> +1 (865) 804-3454
>> http://www.shdawson.com <http://www.shdawson.com>
>>
>>
>> On 12/20/21 11:58 AM, Stephen H. Dawson, DSL via R-help wrote:
>>> Hi,
>>>
>>>
>>> Running a simple syntax set to review entries in dataframe columns. 
>>> Here is the working code.
>>>
>>> Data <- read.csv("./input/Source.csv", header=T)
>>> describe(Data)
>>> summary(Data)
>>> unique(Data[1])
>>> unique(Data[2])
>>> unique(Data[3])
>>> unique(Data[4])
>>>
>>> I would like to add sort the unique entries. The data in the various 
>>> columns are not defined as numbers, but also text. I realize 1 and 
>>> 10 will not sort properly, as the column is not defined as a number, 
>>> but want to see what I have in the columns viewed as sorted.
>>>
>>> QUESTION
>>> What is the best process to sort unique output, please?
>>>
>>>
>>> Thanks.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.