[R] Ordering Duplicates for Selection

Tue Oct 5 17:58:47 CEST 2010

Here is a way of putting "Order" on your data:

> x
  V1       V2     V3 V4         V5
1  1 12345678 Soc101 34 02-04-2003
2  2 12345678 Soc101 62 31-11-2004
3  3 12345678 Psy104 63 03-05-2003
4  4 23456789 Soc101 73 02-04-2003
5  5 23456789 Psy104 76 25-02-2004
> x$order <- ave(x$V1, x$V2, x$V3, FUN=seq_along)
> x
  V1       V2     V3 V4         V5 order
1  1 12345678 Soc101 34 02-04-2003   1
2  2 12345678 Soc101 62 31-11-2004   2
3  3 12345678 Psy104 63 03-05-2003   1
4  4 23456789 Soc101 73 02-04-2003   1
5  5 23456789 Psy104 76 25-02-2004   1
>

On Tue, Oct 5, 2010 at 11:42 AM, C C <psdcc at hotmail.com> wrote:
>
> Hi all,
>
> I've found a lot of helpful info regarding identifying and deleting duplicates but I'd like to do something a little different - I'd like to identify the duplicate values but instead of deletion, label them with a value.
>
> I am working with historical data regarding school courses:
>
>
>
>                Student Number              Course                  Final Mark           Completed
> Date
>
> 1              12345678                             Soc101                  34                           02-04-2003
>
> 2              12345678                             Soc101                  62                           31-11-2004
>
> 3              12345678                             Psy104                  63                           03-05-2003
>
> 4              23456789                             Soc101                  73                           02-04-2003
>
> 5              23456789                             Psy104                  76                           25-02-2004
>
>
> In this data frame, records 1 and 2 contain data for the same student taking the same course.  In record 1, the student failed (Final Mark), took the course again (Completed Date) and finally passed (Final Mark) in record 2.
>
> I'd like to be able to work with the data so that I could summarize the achievement distribution for the first attempt records and then compare it to the achievement distribution for the second attempt records.  In Excel I'd use something like COUNTIF($A$2:A2,A2) in a new column and then summarize the "1" values and "2" values.
>
>              Order    Student Number              Course                  Final Mark           Completed Date
>
> 1              1              12345678                             Soc101                  34                           02-04-2003
>
> 2              2              12345678                             Soc101                  62                           31-11-2004
>
> 3              1              12345678                             Psy104                  63                           03-05-2003
>
> 4              1              23456789                             Soc101                  73                           02-04-2003
>
> 5              1              23456789                             Psy104                  76                           25-02-2004
>
>
> I suspect the answer is in the list discussions on "deleting duplicate records" but I'm still familiarizing myself with R and I'm not at a point to be able to see how it could be modified.  Any thoughts?
>
> Cheers,
> Chris
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?