[R] counting duplicate items that occur in multiple groups

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Wed Nov 18 00:56:24 CET 2020


Many problems can often be solved with some thought by using the right tools, such as the ones from the tidyverse.

Without giving a specific answer, you might want to think about using the group_by() functionality in a pipeline that would lump together all rows matching say having the same value in several columns. Then in something like a mutate() or summarize() you can use special functions like n() that return how many rows exist within each grouping. There are many more such verbs and features that let you build up something, often by removing the grouping along the way and perhaps adding some other form of grouping including the new rowwise() that then lets you do things across columns on a row at a time and so on.

I think the point is to think of steps that lead to a result that can be used in the next step and so on. 

And, for some problems, you can  think outside the pipelines and create multiple intermediate data.frames with parts of what you will need and then combine them with joins or whatever it takes to efficiently get a result, or by brute force. Sometimes (as when making graphs) you might want to convert data between forms that are often called long versus wide. 

Yes, plenty can be done in base R or using other packages. But a good set of tools might be part of what you need to investigate.

Of course, others can chime in suggesting that there are negatives to dplyr and other aspects of the tidyverse and they would be right too. 


-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Tom Woolman
Sent: Tuesday, November 17, 2020 6:30 PM
To: Bill Dunlap <williamwdunlap using gmail.com>
Cc: r-help using r-project.org
Subject: Re: [R] counting duplicate items that occur in multiple groups

Hi Bill. Sorry to be so obtuse with the example data, I was trying (too hard) not to share any actual values so I just created randomized values for my example; of course I should have specified that the random values would not provide the expected problem pattern. I should have just used simple dummy codes as Bill Dunlap did.

So per Bill's example data for Data1, the expected (hoped for) output should be:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1     V1      A1      0
2     V2      A2      3
3     V3      A2      3
4     V4      A2      3


Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct.  
The value is 3 for V2, V3 and V4 because they each share bank account A2.


Likewise, in the Data2 frame, the same logic applies:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1     V1      A1     0
2     V2      A2     3
3     V3      A2     3
4     V1      A2     3
5     V4      A3     0
6     V2      A4     0






Thanks!


Quoting Bill Dunlap <williamwdunlap using gmail.com>:

> What should the result be for
>   Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"),
> Account=c("A1","A2","A2","A2"))
> ?
>
> Must each vendor have only one account?  If not, what should the 
> result be for
>    Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"),
> Account=c("A1","A2","A2","A2","A3","A4"))
> ?
>
> -Bill
>
> On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman <twoolman using ontargettek.com>
> wrote:
>
>> Hi everyone.  I have a dataframe that is a collection of Vendor IDs 
>> plus a bank account number for each vendor. I'm trying to find a way 
>> to count the number of duplicate bank accounts that occur in more 
>> than one unique Vendor_ID, and then assign the count value for each 
>> row in the dataframe in a new variable.
>>
>> I can do a count of bank accounts that occur within the same vendor 
>> using dplyr and group_by and count, but I can't figure out a way to 
>> count duplicates among multiple Vendor_IDs.
>>
>>
>> Dataframe example code:
>>
>>
>> #Create a sample data frame:
>>
>> set.seed(1)
>>
>> Data <- data.frame(Vendor_ID = sample(1:10000), Bank_Account_ID =
>> sample(1:10000))
>>
>>
>>
>>
>> Thanks in advance for any help.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list